An SRAM optimized approach for constant memory consumption and ultra-fast execution of ML classifiers on TinyML hardware

Sudharsan, Bharath; Yadav, Piyush; Breslin, John G.; Ali, Muhammad Intizar

View/Open

SRAM-TinyML_IEEE_SCC.pdf (1.507Mb)

Date

2021-11-15

Author

Sudharsan, Bharath

Yadav, Piyush

Breslin, John G.

Ali, Muhammad Intizar

Metadata

Show full item record

Usage

This item's downloads: 66 (view details)

Recommended Citation

Sudharsan, Bharath, Yadav, Piyush, Breslin, John G., & Ali, Muhammad Intizar. (2021). An SRAM optimized approach for constant memory consumption and ultra-fast execution of ML classifiers on TinyML hardware. Paper presented at the IEEE International Conference on Services Computing (SCC), Online Virtual Congress, 05 -11 September. doi:10.1109/SCC53864.2021.00045

Published Version

https://dx.doi.org/10.1109/SCC53864.2021.00045

Abstract

With the introduction of ultra-low-power machine learning (TinyML), IoT devices are becoming smarter as they are driven by Machine Learning (ML) models. However, any increase in the training data results in a linear increase in the space complexity of the ML models. It is highly challenging to deploy such ML models on IoT devices with limited memory (TinyML hardware). To alleviate such memory issues, in this paper, we present an SRAM-optimized classifier porting, stitching, and efficient deployment approach. The proposed method enables large classifiers to be comfortably executed on microcontroller unit (MCU) based IoT devices and perform ultra-fast classifications while consuming 0 bytes of SRAM. We tested our SRAM optimized approach by utilizing it to port and execute 7 dataset-trained classifiers on 7 popular MCU boards, and report their inference time and memory (Flash and SRAM) consumption. It is apparent from the experimental results that; (i) the classifiers ported using our proposed approach are of varied sizes but have constant SRAM consumption. Thus, the approach enabled the deployment of larger ML classifier models even on tiny Atmega328P MCU-based Arduino Nano, which has only 8 kB SRAM; (ii) even the resource-constrained 8-bit MCUs performed faster unit inference (in less than a millisecond) than a NVIDIA Jetson Nano GPU and Raspberry Pi 4 CPU; (iii) the majority of models produced 1-4x times faster inference results in comparison with the models ported by the sklearn-porter, m2cgen, and emlearn libraries.

URI

http://hdl.handle.net/10379/18000

Collections

Data Science Institute (Conference Papers)

Except where otherwise noted, this item's license is described as CC BY-NC-ND 3.0 IE