Sparse multilayer perceptrons: converting CNNs to MLPs  Part 1
This is a followup post to my last blogpost (From OpenCV to TensorFlow and back: fast neural networks using OpenCV and C++) in which i wrote:
The library is also strictly limited to feedforward MLPs, neither CNNs nor architectures with feedback are possible.
While the part about feedback still holds, this is not the case with the part about CNNs. In this post i will present a way to convert a CNN (or any other feedforward neural network) into a MLP. In the first part i will explain the theoretical background for this conversion, in the second part (which can be read here: Sparse multilayer perceptrons: converting CNNs to MLPs  Part 2) i will compare the computational performances of the equivalent networks, primarily using the OpenCVMLP implementation running on a CPU.
From convolution to matrix products
In recent years impressive results in computer vision have been achieved. Most of them are thanks to the rise of deep convolutional networks, such as AlexNet, which started the CNNBoom in 2012.
In contrast to fully connected MLPs, one value of the output (often refered to as a feature map for CNNs) is not influenced by all inputs, but only by inputs which are close to the output. To exploit this simplification the input data requires some kind of spatial relation (a definition of “close”), such as pixel neighborhood in an image and it requires for features to be local.
Constraints
Without loss of generality i will restrict this derivation to input signals , such as monochrome (grayscale) images. This should suffice to demonstrate the approach while not overcomplicating things.
I will also limit this derivation to networks where the current layer is only connected to the next layer and no layers are skipped for some data (this is for example the case for feature pyramid networks), for these kind of networks the skipped layers need to be enlarger and setup such that they represent the identity.
In the field of neural networks the operation refered to as convolution is often actually implemented as a correlation. This doesn’t change much in this context, as changing from convolution to correlation and vice versa just requires mirroring the kernel on both axis, but certain properties of convolution do not hold (such as the convolution theorem). I will use the correlation instead of the convolution, but still refer to it as convolution to stay compliant with the tensorflow implementation.
To simplify this further i will ignore the topic of boundary conditions, more details can be found on wikipedia. My conversion script (as used in part 2) assumes a zero padding of the image (refered to as “kernel crop” in the article).
Derivation
Convolution
Assume is the input signal, is the convolution kernel (the “filter”) with . To keep things simple i demand an odd sized filter as convolution using an even sized kernel is not clearly defined, furthermore i use a square kernel, this is simply done to reduce the amount of (unnecessary) variables.
The discrete convolution is defined as ( denotes the convolution):
Generalized dot product
We define an operation as
which can be seen as a kind of generalized dot product.
Weight matrix
Furthermore we define a matrix as
Convolution and the generalized dot product
Using the generalized dot product we can see:
This proves, that the convolution is the same as our generalized dot product if the second operand is the weight matrix .
Simplifying the generalized dot product to the normal dot product
To illustrate the conversion we defined the generalized dot product, this allowed us to keep our input data in 2DForm. However most MLPs (for example the OpenCVImplementation which inspired this post) works only on 1DData that is vectors. So we need to convert our data into vectors, for this we define the operation as
as you might have seen i have defined the operation using rowmajor ordering. This is arbitrary, just be sure to use it consistently. If have chosen this order, because thats the order numpy and tensorflow are using.
Next we can show that reshaping and applying the normal dot product is equivalent to the generalized dot product:
Using this operation we can see that the generalized dot product which was equivalent to the convolution can be written as a normal dot product:
Combining everything to form a fully connected layer
We have now all the required theoretical foundations to convert our first CNN into a MLP. For the moment we will ignore the bias and the activation function, see the next section for more information on these topics. With these limitations a layer of our network is fully described by the weightmatrix . This matrix can be seen as a vector of weight vectors:
the input of our layer is the flattened image as seen above:
if we calculate the matrixvector product we can see, that our output is
by comparing each output value with the convolution from above, we can clearly see, how each weight vector needs to be defined:
with and .
In the initial step we flattened the input for our network, this step is only necessary for the first layer as the output of all layers (and as a result the input to all following layers) is already a vector.
Implementing other important operations
Activation functions
Activation functions are applied element wise, as a result they are not influenced by the shape of the data and can be used exactly the same way in the MLP as they are used in the CNN. If you are planning to use the OpenCV implementation keep in mind, that the library only supports the transfer function (see my last blog post (From OpenCV to TensorFlow and back: fast neural networks using OpenCV and C++) for more information on transfer functions in OpenCV).
Biases
A bias is added to every element of the feature map before the activation function. In a CNN the bias is independent of the position in the feature map, this means the bias can be represented as vector full of the same bias value in our MLP. So for a given we can define a as
Pooling
Pooling is used to reduce the size of the feature map (downsampling), for this multiple entrys of the feature map (usually in a grid) are combined to form a single value.

Max pooling: The most commonly used pooling operation is max pooling which simply takes the maximum of all input values. Due to the nonspace invariant properties of this operation can’t be trivially implemented using an MLP. To add max pooling to a model we need to rely on the properties of the ReLUTransfer function (inspired by this reddit post):
Furthermore we can show that we can implement the identity using ReLU
Combining these two properties we can see, that:
This can be easily implemented as a MLP, the first layer can be calculated as (the bias is ):
and the second weight matrix as:
if the second weight matrix is treated as a second layer the output is not the but to avoid this the second weight matrix needs to be combined with the following layer by multiplying the weight matrixes (for more information on combining weight matrixes see the section “Stride” below).
This requires the transfer funtion, which is not fully supported by OpenCV.

Average pooling: Average pooling can be implemented using an MLP. The approach to implement this pooling type, is to add another layer after a convolution layer with a weight matrix that calculates the average:
this method has one major drawback: the activation function is calculated twice instead of once (after the convolution and after the pooling). By using the taylor series of the we can prove, that
for small . This means we only need to compress the inputs to our activation function and expand them later on to avoid the effects of the transfer function. This compression can be achieved by scaling :
then the activation function can be applied. Now the next layer needs to get adapted as well, this can be done by simply scaling the next weight matrix with the inverse of :
This operation can not be combined with the calculation of the feature map into one layer (which would speed up calculations), the reason for this is, that ( is the nonlinear activation function):

Stride: By the definition a stride is not pooling, i will still include a stride in this section because a stride can be used for very efficient downsampling. A stride is simply used to only calculate certain points of the feature map, for the calculation there are two possible approaches, either by adding another layer, as seen above. This still requires scaling the feature map and assuming the as linear for small . The second approach is to calculate the second weight matrix and multiply them to form a combined weight matrix:
Multiple Filters
To be able to learn multiple features per layer multiple filters/kernels are used per layer. As these filters are indepentent they can be represented by different weight matrixes which are combined later on to form one large weight matrix.
Conclusion
In this post i have presented a way to convert most CNNs to sparse MLPs. All of the important operations can be ported but some of them require quite some efford to get them converted.
In the next post i will present a script to convert CNNs implemented in Tensorflow into OpenCVMLPs and evaluate their performance in terms of runtime and time and space complexity.