20 Jun 2019

Sparse multilayer perceptrons: converting CNNs to MLPs - Part 2

This is the second part of my series about converting CNNs to MLPs, the first part can be read here. In this post i will first study the time and space complexity of the converted CNN and then i will try to verify these results using actual CNNs converted to OpenCV-MLP models.

Complexity of CNNs and MLPs

As in part 1 i will simplify this derivation to grayscale images. I will use the Big O Notation to compare the asymptotic behaviour.

Time complexity

For a single element of the feature map the convolution requires as many multiplications and additions (to be precise one less) as there are elements in the kernel, the number of element in the kernel is , this calculation needs to be done for every element of the feature map of which there are as a result the complexity of the discrete convolution is:

For the fully connected network each element of the feature map is the result of the dot product of to vectors of size , this yields multiplications and additions (to be precise it is one addition less as well). With an output size of we can see that the time complexity is:

Space complexity

The kernel of the discrete convolution is of size , the bias of size , this means the space complexity of a single filter in a single layer is

which espacially means the space complexity is independent of the input size. For a MLP the size of the weight matrix is , the bias of size . This means the space complexity of a single filter in a single layer is:

For both architectures the space complexity scales linearly with the number as filters as well as the number of layers.

One thing to keep in mind, is that the conversion to a MLP requires the adaption of pooling layers to normal fully connected layers. While a pooling layer requires no learned parameter, the converted layer is another layer with parameters of complexity .

Comparison of the numbers using an actual CNN

This comparison is based on a small CNN consisting of 3 convolutional layers with kernel size , downsampling is achieved by using a stride of 3. The first layer consists of 32 filters, the following layers of 64 filters. The convolutional layers are followed by a dense layer of size 1024 and the output layer of size 29.

Structure of the CNN

This network is used for a classification task on 3-Channel images and is comparable in size to networks used for CIFAR-10 or MNIST.

Number of calculations

For this section i will just compare the number of operations required in the convolutional layers, as the number of operations for the dense layers is the same for both representations, the calculations are based on the formulas given above.

The convolutions in the CNN require about Multiply-Accumulate operations, for the converted MLP the former convolution requires about operations. We can see that the number of operations increase by a factor of , this is without paying attention to the now larger bias.

Number of weights

In this section i compare the number of weights, as in the last section i will limit this to the number of weights in the convolutional layer. In contrast to the last section these numbers are the actuall numbers from the CNN presented above, not just estimates.

The CNN consists of trainable weights, the MLP of weights. This is an increase in the number of weights by nearly factor


To summarize one can see, that a CNN is more efficient regarding operations and space. This was expected, as most of our weight-matrices consist of zeroes or a repeating filter. It is interesting to note, that the difference for the weights is much larger than for the MAC-Operations, this is due to the reuse of the same kernel in CNNs.

Comparison of the performance

The conversion script

The complete script is available on my github-page at github.com/aul12/Cnn2Mlp, it reads a tensorflow .ckpt file and converts it into a OpenCV-MLP file, for more information on the usage see the my blog post From OpenCV to TensorFlow and back: fast neural networks using OpenCV and C++. Before you use the script a little disclaimer: this script is just intended as a proof of concept and comes with some limitations:

  • the size of each input and the stride of the convolution is hard-coded in the script
  • stride is the only supported way of downsampling
  • every convolutional layer needs to have conv in it’s name to be converted
  • every layer needs to consist of a layer_name/kernel and layer_name/bias tensor.
  • the script orders the layers by name to get their order in the network, so keep this in mind when naming your layers (this primarily works because conv is before dense in the alphabet)
  • the script produces multiple output files, named part_XY.xml, this is to reduce the memory usage of the script, these files need to be concatenated, this can be done by running cat part_*.xml > all.xml


I tried to compare the results using the OpenCV-MLP implementation. Unluckily OpenCV is not able to load the model, probably due to the large size of the generated xml file.

This limits the evaluation to estimating the performance by extrapolating using the CNN as a base. On the test device the CNN needs about 10ms, running on the CPU (i7-6770HQ), given the number of operations the MLP is likely to need about 1s for one inference.


In this series of blog post i have shown, that it is theoretically possible to convert CNNs to MLPs, but due to the huge size of the MLP the performance suffers dramatically. This can be seen as an interessting demonstrations about why CNNs got so popular for computer vision tasks.

Foot note