TinyML has made it possible to run deep learning models on microcontrollers. It is important because microcontrollers are very cheap, costing under $0.50 on average; they run on coin batteries and can be used for many applications. Deploying a deep learning model on such a tiny device plays a big role in democratising artificial intelligence. That said, tiny deep learning on microcontrollers, which has limited memory size, severely constrains its potential. Speaking particularly about convolutional neural networks, TinyML here suffers from a memory bottleneck due to imbalanced memory distribution – the first several blocks have larger memory usage than the rest of the network.
To remediate this issue, researchers from MIT and IBM jointly introduced MCUNetV2, a deep learning architecture that can adjust memory bandwidth to the limits of the microcontrollers.
Memory bottleneck is a well-researched area in TinyML. In the past, researchers have introduced techniques like pruning, quantisation, and neural architecture search (NAS) for efficient deep learning. These methods have not quite worked out very well because they focus more on reducing the number of parameters and FLOPs and have little impact on the actual memory bottleneck problem. Most of the NAS methods use CNN backbone design leading to imbalance memory distribution under per-layer inference. This restricts input resolution, stopping it from achieving good performance on tasks like object detection without patch-based inference scheduling.
This tight memory budget restricts the use to a small model capacity or small input image size by limiting the feature map or activation size. The input resolutions in existing models are less than 2242 in size. This might be acceptable for tasks like image classification but not for other dense prediction tasks like objection detection. This limits the application of tiny deep learning on several real-life tasks. In order to accommodate the initial memory-intensive stage, the whole network has to be scaled down even if the majority of the network already has a small memory usage.
In the research titled “MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning”, scientists first proposed a patch-by-patch execution order for the initial memory-intensive stage of CNNs. It operates on a small spatial region of the feature map instead of the whole activation. This strategy cuts down the peak memory of the initial stage and allows to fit a larger input resolution.
To compute the overlapping output patches, the input image patches need to be overlapped, leading to repeated computation, causing computational overhead.
The overhead is related to the receptive field of the initial stage, and the receptive field size is directly proportional to the size of the input patches; this leads to more overlapping. Considering this, the researchers then proposed receptive field redistribution to shift the receptive field and workload to a later stage of the network, reducing the patch size and the computational overhead caused by overlapping without compromising the performance of the network.
They also used patch-based inference to bring a larger design space for the neural network, offering more freedom with trading-off input resolution and model size, etc. They introduced a new design for an optimal deep model and its inference schedule with neural architecture search for specific datasets and hardware, thus minimising the computational overhead under patch-based execution.
MCUNetV2 was shown to improve the object detection performance on microcontrollers by 16.9 per cent. It also recorded ImageNet accuracy of 71.8 per cent. “Our study largely addresses the memory bottleneck in TinyML and paves the way for vision applications beyond classification,” said the authors.
With advancements like MCUNetV2, in the coming years, we can expect TinyML to find its way into several microcontrollers in homes, offices, hospitals, factories, etc., to enable applications that were previously impossible.
Read the full paper here.