{"id":133,"date":"2022-04-29T10:13:53","date_gmt":"2022-04-29T09:13:53","guid":{"rendered":"https:\/\/amlen.org\/?p=133"},"modified":"2022-04-29T10:13:53","modified_gmt":"2022-04-29T09:13:53","slug":"iot-code-smells-work-triggered-by-a-device-reconnection","status":"publish","type":"post","link":"https:\/\/amlen.org\/index.php\/2022\/04\/29\/iot-code-smells-work-triggered-by-a-device-reconnection\/","title":{"rendered":"IoT code smells: Work triggered by a device reconnection"},"content":{"rendered":"\n<p>The most important piece of advice I would give someone trying to design an IoT architecture that scales is to think very hard about an event where a large fraction (&gt;25% even 100%) of devices are suddenly disconnected at the same time.<\/p>\n\n\n\n<p>In IoT protocols like MQTT, data is sent over a long lived network connection and a common pattern in device firmware is to do some &#8220;work&#8221; each time a connection is established e.g. to download some config like finding out which topics to subscribe to. I&#8217;d argue this is often a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Code_smell\">code smell<\/a>.<\/p>\n\n\n\n<p>This pattern is fine when the number of devices is small but if tens of thousands, hundreds of thousands or even millions of devices are disconnected then this &#8220;extra&#8221; work is coming at a time when the system as a whole is likely to be at its busiest.<\/p>\n\n\n\n<p>When this kind of disconnect event happens, even if the network infrastructure and the brokers can scale and handle all the TLS connections re-establishing, if back-end applications are also scaling and struggling under load then a short reconnect event can spiral into a significant outage. The problem is especially serious when the devices do not receive information they were expecting within a time-out and reset\/reconnect creating a <strong>vicious cycle<\/strong> of pain for anyone involved.<\/p>\n\n\n\n<p>Instead of  devices querying setup\/config on a reconnect, a better pattern is to retrieve any required information on first connect and cache it client side and refresh it at appropriate periods that will not result in a synchronised request by a significant fraction of  the devices.<\/p>\n\n\n\n<p>&#8220;But wait&#8221; I hear the reader cry: &#8220;Our IoT infrastructure is spread across multiple datacenters&#8221; (or even multiple regions). That is good and it should reduce the number of such events but in complex systems it is a brave architect who declares such an event will never happen &#8211; in my experience they have happened to multiple different organisations with differently shaped IoT infrastructure (and they tend to stick in the mind).<\/p>\n\n\n\n<p>Apart from the argument that such events won&#8217;t happen, another reason why this kind of re-bootstrapping is chosen might be that each connection might be routed to a different system that requires slightly different config. Because of the uneven spike of traffic this pattern causes, I&#8217;d try hard to avoid such a design. If it is necessary, acknowledging the possibility of mass disconnect events, having the spare capacity (or the ability to scale quickly to handle them) and thoroughly testing the system are important.<\/p>\n\n\n\n<p>I imagine there are lots of &#8220;IoT at scale&#8221; war stories out there, are there any (anonymised?) accounts that people would recommend reading?<\/p>\n\n\n\n<p>  <\/p>\n","protected":false},"excerpt":{"rendered":"<p>The most important piece of advice I would give someone trying to design an IoT architecture that scales is to think very hard about an event where a large fraction (&gt;25% even 100%) of devices are suddenly disconnected at the same time. In IoT protocols like MQTT, data is sent over a long lived network&hellip; <a class=\"more-link\" href=\"https:\/\/amlen.org\/index.php\/2022\/04\/29\/iot-code-smells-work-triggered-by-a-device-reconnection\/\">Continue reading <span class=\"screen-reader-text\">IoT code smells: Work triggered by a device reconnection<\/span><\/a><\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-133","post","type-post","status-publish","format-standard","hentry","category-uncategorised","entry"],"_links":{"self":[{"href":"https:\/\/amlen.org\/index.php\/wp-json\/wp\/v2\/posts\/133","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/amlen.org\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/amlen.org\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/amlen.org\/index.php\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/amlen.org\/index.php\/wp-json\/wp\/v2\/comments?post=133"}],"version-history":[{"count":10,"href":"https:\/\/amlen.org\/index.php\/wp-json\/wp\/v2\/posts\/133\/revisions"}],"predecessor-version":[{"id":144,"href":"https:\/\/amlen.org\/index.php\/wp-json\/wp\/v2\/posts\/133\/revisions\/144"}],"wp:attachment":[{"href":"https:\/\/amlen.org\/index.php\/wp-json\/wp\/v2\/media?parent=133"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/amlen.org\/index.php\/wp-json\/wp\/v2\/categories?post=133"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/amlen.org\/index.php\/wp-json\/wp\/v2\/tags?post=133"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}