Comparing the inference speed of different tensorflow implementations
Machine learning started of as the research to emulate the human brain [1], in recent years the models developed moved further away from the structure of the human brain [2]. On the other hand state of the art models are now able to perform certain tasks better than humans [3], these rapid improvements in the last five to ten years lead to the utilisation of deep neural networks outside of the research domain [4]. Today most of the large tech companies, such as Google and Facebook, are using neural networks as part of their products.
Not only datacentre based applications, such as the ones of Google and Facebook, but also applications running on the edge of the networks, for example on smartphones, require a fast and efficient calculation of the neural networks to save energy and guarantee a quick response. Especially devices on the edge of the internet are limited in computation power and often do not provide accelerator devices such as GPUs or dedicated neural network accelerators. Thus it is of great importance to have frameworks and libraries which enable a quick inference of neural networks on a normal CPU.
In this post i will compare the inference of two neural networks with different implementations, namely the OpenCV DNN-Module (OpenCV Version 3.4.8), CppFlow which is a wrapper around the TensorFlow C-API, and TensorFlow-Lite (TfLite) which is flavour of the TensorFlow framework specialized on inference at the edge.
Setup
In the following section the setup which was used to measure the performance is explained.
The Neural Networks
For the evaluation two different Convolutional Neural Networks (CNNs) have been used. Both networks have been trained from scratch using TensorFlow (1.15), using the Keras API for the Semantic Segmentation and the NN-API for the classification.
Semantic Segmentation
The first CNN is a fully convolutional neural network used for semantic segmentation, semantic segmentation is the process of assigning every pixel in the input image a unique class. In this case the input image is a greyscale image of size 128x128 aquired from the camera of an autonomous vehicle. The CNN is used to determine for every pixel if it part of the road, which lane it is part of and if there are special features such as pedestrians isles.
The network consists of three pairs of convolutional layers followed by a max-pooling layer each. This reduces the input image by a factor of each and enlarges the receptive field of the CNN. Upsampling is done with three transposed convolutional layers.
Structure of the Semantic Segmentation CNN, image created using NN-SVG [5]
Classification
The second CNN is used for the classification of traffic signs, the input of the network is a color image of size 80x80.
The CNN consists of four convolutional layers followed by a max-pooling layer each, after the convolutional layer there are two fully connected layers, the first layer consists of 2048 neurons, the second layer of the 29 neurons that represent the probability density function for the classification.
Structure of the Classification CNN, image created using NN-SVG [5]
Hardware used
The benchmark was run on a laptop with a Intel Core i5-3230M CPU, running Ubuntu 18.04.3 with Kernel Version 5.0.0-37. The laptop does not provide a dedicated graphics card.
Test script
All measurements have been done using a small C++ Script, the scripts first loads the model (saved either as a protobuf file for CppFlow and OpenCV, or as a tflite file for TensorFlow-Lite), then runs the model ten times (TensorFlow allocates the memory on the first run, so the first run is a lot slower) and then runs the model 1000 times, each time with random input data. The random input data simulates real data and thus makes sure that no caching or other optimizations are influencing the inference. Over the 1000 runs the runtime of the inference is measures, from this data the average (mean) runtime, the standarddeviation of the runtime, the minimal runtime and the maximal runtime is calculated.
The program is compiled with GCC-8 using maximal optimization (-O3
) for this target (-march=native -mtune=native
).
OpenCV is compiled from source to use all instructions available on the CPU, especially vector instructions (SIMD).
For TensorFlow-Lite there is an option to set the number of threads.
To test the influence of this parameter the benchmark is done with one to eight threads.
Results
Semantic Segmentation
Implementation | Mean (ms) | Standarddeviation (ms) | Min (ms) | Max (ms) |
---|---|---|---|---|
OpenCV | 12.6762 | 3.24769 | 11.3089 | 34.6139 |
CppFlow | 13.4546 | 0.526951 | 12.5265 | 21.6826 |
TfLite (1 Thread) | 21.2997 | 0.285547 | 21.0454 | 24.4562 |
TfLite (2 Threads) | 21.0152 | 1.78953 | 16.5106 | 31.8352 |
TfLite (3 Threads) | 20.3686 | 1.2116 | 16.3625 | 25.7325 |
TfLite (4 Threads) | 18.0461 | 1.55253 | 15.7319 | 22.4034 |
TfLite (5 Threads) | 17.5016 | 1.53257 | 15.6677 | 27.479 |
TfLite (6 Threads) | 17.7587 | 1.637 | 15.6416 | 26.0748 |
TfLite (7 Threads) | 17.6306 | 1.51185 | 15.6903 | 25.3047 |
TfLite (8 Threads) | 17.7098 | 1.49929 | 15.7096 | 22.5983 |
Classification
Implementation | Mean (ms) | Standarddeviation (ms) | Min (ms) | Max (ms) |
---|---|---|---|---|
OpenCV | 5.54286 | 0.827205 | 4.44979 | 12.9081 |
CppFlow | 6.14033 | 0.485032 | 5.43162 | 13.4802 |
TfLite (1 Thread) | 186.294 | 21.5911 | 179.202 | 395.874 |
TfLite (2 Threads) | 97.7576 | 5.2064 | 96.3869 | 181.526 |
TfLite (3 Threads) | 96.0766 | 3.03276 | 94.7009 | 138.22 |
TfLite (4 Threads) | 92.1016 | 6.11409 | 90.7117 | 262.123 |
TfLite (5 Threads) | 105.2 | 6.31152 | 98.5465 | 238.183 |
TfLite (6 Threads) | 108.448 | 8.0407 | 99.8085 | 276.516 |
TfLite (7 Threads) | 111.563 | 7.24448 | 100.883 | 167.064 |
TfLite (8 Threads) | 115.194 | 8.70729 | 102.432 | 243.71 |
Conclusion
For both networks OpenCV yields on average the lowest inference time, both times closely followed by CppFlow. The differences between those two frameworks can mainly be attributed to the optimized instructions (like SIMD), that OpenCV uses at it was compiled from source, but CppFlow does not use as the TensorFlow C-API was installed as a binary which needs to run on a wide variety of systems.
One advantage of CppFlow over OpenCV, is that CppFlow is guaranteed to support all operations that TensorFlow supports. For OpenCV there are some operations which are not supported, in this case the model can not be loaded.
The TensorFlow-Lite implementation is slower for both of the networks. For the Semantic Segmentation the difference is between a factor of 1.6 and 1.3, for the classification task the difference is much larger, the factor is between nearly 32 and 15. Especially for the second task the difference in inference speed is huge. Additionally it can be noted for TensorFlow-Lite that the runtime is mostly not influenced by the number of cores. Intuitively the runtime should be inverse proportional to the runtime as most of the operations can be easily parallelized (that is the reason for the speed and thus popularity of GPUs).
To verify the correctness of the TensorFlow-Lite installation the same measurements have been made on a different computer, using the r1.15 version of TensorFlow and TensorFlow-Lite (Commit 590d6ee
).
The computer is equipped with a faster CPU, an Intel Core i7-6700, so a faster inference of the CNN was expected.
Considering the absolute runtime the inference took about half the time when compared to the same setup on my laptop.
This improvement is primarily due to the faster CPU, overall the inference on my laptop is still about ten times faster when using OpenCV or CppFlow, even though my laptop is slower.
When researching this behaviour of TensorFlow-Lite the only similar problems are bug reports on GitHub which are from 2017 and 2018 [6]. The cause of the discrepancy could not be found so as a result TensorFlow-Lite can, for now, not be recommended if it is possible to use an alternative such as OpenCV or CppFlow.
References
- [1] A. L. Hodgkin, A. F. Huxley: A Quantitative Description of Membrane Current and its Application to Conduction and Excitation in Nerve. In: The Journal of Physiology. Band 117, 1952, S. 500–544
- [2] Laskar, Md Nasir Uddin, Luis Gonzalo Sánchez Giraldo and Odelia Schwartz. “Correspondence of Deep Neural Networks and the Brain for Visual Textures.” ArXiv abs/1806.02888 (2018): n. pag.
- [3] IJCNN2011 Competition Results: http://benchmark.ini.rub.de/?section=gtsrb&subsection=results
- [4] A. Luckow, M. Cook, N. Ashcraft, E. Weill, E. Djerekarov and B. Vorster, “Deep learning in the automotive industry: Applications and tools,” 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, 2016, pp. 3759-3768.
- [5] NN-SVG: https://alexlenail.me/NN-SVG/
- [6] GitHub: “tflite runs much slower than tfmobile …” https://github.com/tensorflow/tensorflow/issues/21787