18 Jun 2020

Comparing the Inference Performance of Neural Networks with and Without Quantization

In my post “Getting started with the Google Coral EdgeTPU using C++” i described that models for the Google Coral EdgeTPU module need to be quantized to 8-bit fixed point numbers, instead of the commonly used 32 or 64-bit floating point numbers. This reduces the resolution of all values used in the model, that are the weights, the intermediate results and the predictions. Therefore the quality of the predictions of the neural networks is in most cases worse than without quantization.

In the blog post i will examine the differences in performance for a convolutional neural network using different metrics. This helps to not only understand the direct impact on the accuracy of the network but to further characterize the differences in the outputs. The TensorFlow team themselves examined the performance of quantization on four widely used convolutional neural networks but only reported the accuracy as metric (See: TensorFlow - Model optimization).


The network used is the same classification network as listed in the post “Comparing the inference speed of different tensorflow implementations”. To summarize the structure: it consists of four convolutional layers each followed by a max-pooling operation and a ReLU nonlinearity. After the convolutional layers there are two fully connected layers with a softmax prediction at the output. The input is of size , the output consists of 30 classes. One of the classes (class 29) is the NO_CLASS class, this class consists of samples which do not show any of the valid classes. See the appendix for the full list of classes and their semantic interpretation.

The models with and without quantization are exactly the same model, the quantized model was converted from the normal model using post-training quantization. The complete dataset consists of 53299 images. Using this dataset two different models have been trained: for the first model most of the samples have been used for training (to be precise 47969 of 53299), and all samples have been used for evaluation. This is how the model is used for the actual deployment. For the second model the dataset has been divided in a training set of size 38867 and a verification set of size 10114. The samples in the subsets are completly different as they were recorded at different times. For training only the training set is used and for the evaluation only the verification set. This training procedure is different to the one used for deployment but this guarantees that the results are not influenced by overfitting. In the next sections the first setup is refered to as the full configuration and the second as the verification configuration.

The complete code used for testing can be found on my GitHub page: github.com/aul12/NeuralNetworkQuantizationPerformanceBenchmark.



The first metric is the accuracy of the classifier on the dataset, for this we calculate the prediction as the class with the highest probability. A prediction is correct if it is the same as the label. The accuracy is then calculated as the number of correct predictions divided by the total number of samples.

Model Accuracy (Full) Accuracy (Verification)
Without Quantization 99.85% 99.83%
With Quantization 96.93% 58.14%

The performance of the quantized model is about three percent points worse than the performance of the normal model when comparing on the full configuration. This is a slightly better relative performance of the quantized model than the numbers reported by the TensorFlow team. When comparing using the verification configuration the results are vastly different: the floating-point model yields a similar result but the performance of the quantized model drops from 96.93% to 58.14%, a decrease by 38.79% percent points.

Additionally to the accuracy a confusion matrix for both models and configurations has been created, see the appendix for the full matrices and the semantic interpretation of the label numbers. As already implied by the accuracy most of the entries of the confusion matrices are located on the primary diagonal. Without quantization the incorrect classifications are distributed evenly throughout the confusion matrix for both configurations.

For the non quantized model on the full dataset there are two main error sources:

  • The different speedlimit signs can not be differentiated as clearly as the non quantized model can
  • In comparison to the non quantized model more images of type NO_CLASS are classified as a relevant

The additional problems using the verification configuration are:

  • Left- and Rightarrows can not be differentiated as good as before
  • Sharp-Turn-Left and -Right can not be differentiated as good as before
  • No passing start/end is often classified as a speedlimit, probably due to the red circle on the signs

Output probabilities

The softmax activation function used in the last layer yields a valid probability density function (pdf) over the classes. This means that all values are in the range and the sum over all 30 values is 1. The classification result is determined as the class with the maximal probability. Additionally to this hard classification information the result can also be used for additional filtering, especially in application in which the same object gets classified multiple times.

To compare the quality of the pdfs the probalities are evaluated. A good pdf clearly shows the winner but additionally provides certainty information, i.e. the output pdf is not a one-hot-distribution. For the comparison two different sets of values are compared:

  • the certainty, that is the probability of a correct classification
  • all output values, that are both the certainty and the probabilities for all other classes

For this evaluation the full configuration is used if not specified differently. The differences between the configurations are small. Over both of these set of values for both models a histogram over the values is calculated. These histograms are given below, additionally there are the raw values further below.

Certainty Values with Quantization Certainty Values without Quantization

Output Values with Quantization Output Values without Quantization

Range Certainty Quant. Certainty float Out Quant. Out float

The results are surprising: the non quantized model yields in general more one-hot-like results, when compared to the quantized model which yields a more uniform distributions. When considering the absolute scale it can also be seen that the actual differences are relativly small, both distributions look, especially when only considering the histograms, very similar.

To further characterize the pdfs the (Shannon) entropy is calculated for every sample and the average entropy over all samples is calculated for both models. The entropy can be used as measure for the uniformity of a distribution. A completly uniform distribution for a pdf of dimension yields an entropy of , for a one hot distribution the entropy is . To make the score independent of the dimension of the output space the entropy is normalized by a factor of , so that the score for a uniform distribution is .

The average entropy is given in the table below:

Model Average normalized Entropy (full) Average normalized entropy (verification)
With Quantization
Without Quantization

These numbers imply the same result as the histogram: the output pdf for the non-quantized model is strictly one hot, the quantized modell has a more uniform distribution. Still the differences are rather small.


The results are twofold: primarily the accuracy of the quantized classifier is reduced drastically when evaluating on a different dataset. Depending on the applications the results can be sufficient, especially considering that the errors are predictable, but in general the loss in accuracy renders the quantized model unusable.

Secondly the output pdf is not one-hot encoded anymore but yields more soft-encoded labels. For most applications the difference is neglectible, even the labels of the quantized model are nearly one-hot-encoded.


Confusion Matrices


The classes are defined by the Carolo-Cup-Regulations, see the official rules for more information: wiki.ifr.ing.tu-bs.de/carolocup/system/files/Master-Cup%20Regulations.pdf

Name Label Number