The most important piece of advice I would give someone trying to design an IoT architecture that scales is to think very hard about an event where a large fraction (>25% even 100%) of devices are suddenly disconnected at the same time.
In IoT protocols like MQTT, data is sent over a long lived network connection and a common pattern in device firmware is to do some “work” each time a connection is established e.g. to download some config like finding out which topics to subscribe to. I’d argue this is often a code smell.
This pattern is fine when the number of devices is small but if tens of thousands, hundreds of thousands or even millions of devices are disconnected then this “extra” work is coming at a time when the system as a whole is likely to be at its busiest.
When this kind of disconnect event happens, even if the network infrastructure and the brokers can scale and handle all the TLS connections re-establishing, if back-end applications are also scaling and struggling under load then a short reconnect event can spiral into a significant outage. The problem is especially serious when the devices do not receive information they were expecting within a time-out and reset/reconnect creating a vicious cycle of pain for anyone involved.
Instead of devices querying setup/config on a reconnect, a better pattern is to retrieve any required information on first connect and cache it client side and refresh it at appropriate periods that will not result in a synchronised request by a significant fraction of the devices.
“But wait” I hear the reader cry: “Our IoT infrastructure is spread across multiple datacenters” (or even multiple regions). That is good and it should reduce the number of such events but in complex systems it is a brave architect who declares such an event will never happen – in my experience they have happened to multiple different organisations with differently shaped IoT infrastructure (and they tend to stick in the mind).
Apart from the argument that such events won’t happen, another reason why this kind of re-bootstrapping is chosen might be that each connection might be routed to a different system that requires slightly different config. Because of the uneven spike of traffic this pattern causes, I’d try hard to avoid such a design. If it is necessary, acknowledging the possibility of mass disconnect events, having the spare capacity (or the ability to scale quickly to handle them) and thoroughly testing the system are important.
I imagine there are lots of “IoT at scale” war stories out there, are there any (anonymised?) accounts that people would recommend reading?