10 Jun 2019

Sparse multilayer perceptrons: converting CNNs to MLPs - Part 1

This is a follow-up post to my last blog-post (From OpenCV to TensorFlow and back: fast neural networks using OpenCV and C++) in which i wrote:

The library is also strictly limited to feed-forward MLPs, neither CNNs nor architectures with feedback are possible.

While the part about feedback still holds, this is not the case with the part about CNNs. In this post i will present a way to convert a CNN (or any other feedforward neural network) into a MLP. In the first part i will explain the theoretical background for this conversion, in the second part (which can be read here: Sparse multilayer perceptrons: converting CNNs to MLPs - Part 2) i will compare the computational performances of the equivalent networks, primarily using the OpenCV-MLP implementation running on a CPU.

From convolution to matrix products

In recent years impressive results in computer vision have been achieved. Most of them are thanks to the rise of deep convolutional networks, such as AlexNet, which started the CNN-Boom in 2012.

In contrast to fully connected MLPs, one value of the output (often refered to as a feature map for CNNs) is not influenced by all inputs, but only by inputs which are close to the output. To exploit this simplification the input data requires some kind of spatial relation (a definition of “close”), such as pixel neighborhood in an image and it requires for features to be local.

Constraints

Without loss of generality i will restrict this derivation to input signals $\in \mathbb{R}^{n \times m}$ , such as monochrome (grayscale) images. This should suffice to demonstrate the approach while not overcomplicating things.

I will also limit this derivation to networks where the current layer is only connected to the next layer and no layers are skipped for some data (this is for example the case for feature pyramid networks), for these kind of networks the skipped layers need to be enlarger and setup such that they represent the identity.

In the field of neural networks the operation refered to as convolution is often actually implemented as a correlation. This doesn’t change much in this context, as changing from convolution to correlation and vice versa just requires mirroring the kernel on both axis, but certain properties of convolution do not hold (such as the convolution theorem). I will use the correlation instead of the convolution, but still refer to it as convolution to stay compliant with the tensorflow implementation.

To simplify this further i will ignore the topic of boundary conditions, more details can be found on wikipedia. My conversion script (as used in part 2) assumes a zero padding of the image (refered to as “kernel crop” in the article).

Derivation

Convolution

Assume $M \in \mathbb{R}^{n \times m}$ is the input signal, $K \in \mathbb{R}^{(2 \cdot k + 1)\times(2 \cdot k + 1)}$ is the convolution kernel (the “filter”) with $n,m,k \in \mathbb{N}$ . To keep things simple i demand an odd sized filter as convolution using an even sized kernel is not clearly defined, furthermore i use a square kernel, this is simply done to reduce the amount of (unnecessary) variables.

The discrete convolution is defined as ( $\ast$ denotes the convolution):

$(M \ast K)(y,x) = \sum_{i=-k}^k \sum_{j=-k}^k M(y+i,x+j) \cdot K(i+k, j+k)$

Generalized dot product

We define an operation $\text{dot}: \mathbb{R}^{n \times m} \times \mathbb{R}^{n \times m} \to \mathbb{R}$ as

$\text{dot}(A,B) = \sum_{i=0}^n\sum_{j=0}^m A(i,j) \cdot B(i,j)$

which can be seen as a kind of generalized dot product.

Weight matrix

Furthermore we define a matrix $W_{\upsilon \chi} \in \mathbb{R}^{n \times m}$ as

$W_{\upsilon \chi}(\chi, \upsilon) = \begin{cases} K(y-\upsilon+k,x - \chi+k) & \chi-k \leq x \leq \chi+k \land \upsilon-k \leq y \leq \upsilon+k \\ 0 & \text{otherwise} \end{cases}$

Convolution and the generalized dot product

Using the generalized dot product we can see:

$\text{dot}(M, W_{\upsilon \chi}) = \sum_{i=0}^n \sum_{j=0}^m M(i.j) \cdot W_{\upsilon \chi}(i,j)$ $=\sum_{i=0}^{\upsilon-k-1} \sum_{j=0}^m M(i.j) \cdot W_{\upsilon \chi}(i,j) + \sum_{i=\upsilon-k}^{\upsilon+k} \sum_{j=0}^m M(i.j) \cdot W_{\upsilon \chi}(i,j)+ \sum_{i=\upsilon+k+1}^n \sum_{j=0}^m M(i.j) \cdot W_{\upsilon \chi}(i,j)$ $=\sum_{i=0}^{\upsilon-k-1} \sum_{j=0}^m M(i.j) \cdot 0 + \sum_{i=\upsilon-k}^{\upsilon+k} \sum_{j=0}^m M(i.j) \cdot W_{\upsilon \chi}(i,j)+ \sum_{i=\upsilon+k+1}^n \sum_{j=0}^m M(i.j) \cdot 0$ $=\sum_{i=\upsilon-k}^{\upsilon+k} \sum_{j=0}^m M(i.j) \cdot W_{\upsilon \chi}(i,j)$ $=\sum_{i=\upsilon-k}^{\upsilon+k}\left( \sum_{j=0}^{\chi-k-1} M(i.j) \cdot W_{\upsilon \chi}(i,j) + \sum_{j=\chi-k}^{\chi+k} M(i.j) \cdot W_{\upsilon \chi}(i,j) + \sum_{j=\chi+k+1}^m M(i.j) \cdot W_{\upsilon \chi}(i,j) \right)$ $=\sum_{i=\upsilon-k}^{\upsilon+k}\left( \sum_{j=0}^{\chi-k-1} M(i.j) \cdot 0 + \sum_{j=\chi-k}^{\chi+k} M(i.j) \cdot W_{\upsilon \chi}(i,j) + \sum_{j=\chi+k+1}^m M(i.j) \cdot 0 \right)$ $=\sum_{i=\upsilon-k}^{\upsilon+k}\sum_{j=\chi-k}^{\chi+k} M(i.j) \cdot W_{\upsilon \chi}(i,j)$ $=\sum_{i=\upsilon-k}^{\upsilon+k}\sum_{j=\chi-k}^{\chi+k} M(i.j) \cdot K(i-\upsilon+k,j-\chi+k)$ $=\sum_{i=-k}^{k}\sum_{j=-k}^{+k} M(i+\upsilon, j+\chi) \cdot K(i+k,j+k)$ $=(M \ast K)(\upsilon, \chi)$

This proves, that the convolution is the same as our generalized dot product if the second operand is the weight matrix $W_{\upsilon \chi}$ .

Simplifying the generalized dot product to the normal dot product

To illustrate the conversion we defined the generalized dot product, this allowed us to keep our input data in 2D-Form. However most MLPs (for example the OpenCV-Implementation which inspired this post) works only on 1D-Data that is vectors. So we need to convert our data into vectors, for this we define the operation $\text{flatten}: \mathbb{R}^{n \times m} \to \mathbb{R}^{n \cdot m}$ as

$\text{flatten}(M)(i) = x(i) = M(\text{floor}(i / n), i \mod n)$

as you might have seen i have defined the operation using row-major ordering. This is arbitrary, just be sure to use it consistently. If have chosen this order, because thats the order numpy and tensorflow are using.

Next we can show that reshaping and applying the normal dot product is equivalent to the generalized dot product:

$\text{dot}(A,B) = \sum_{i=0}^n\sum_{j=0}^m A(i,j) \cdot B(i,j)$ $= \sum_{j=0}^m\sum_{i=0}^n A(i,j) \cdot B(i,j)$ $= \sum_{i=0}^{n \cdot m} A(\text{floor}(i / n), i \mod n) \cdot B(\text{floor}(i / n), i \mod n)$ $= \sum_{i=0}^{n \cdot m} \text{flatten}(A)(i) \cdot \text{flatten}(B)(i)$ $= \text{flatten}(A)^\text{T} \cdot \text{flatten}(B)$

Using this operation we can see that the generalized dot product which was equivalent to the convolution can be written as a normal dot product:

$(M \ast K)(\upsilon, \chi) = \text{dot}(M, W_{\upsilon \chi}) = \text{flatten}(M)^\text{T} \cdot \text{flatten}(W_{\upsilon \chi})$

Combining everything to form a fully connected layer

We have now all the required theoretical foundations to convert our first CNN into a MLP. For the moment we will ignore the bias and the activation function, see the next section for more information on these topics. With these limitations a layer of our network is fully described by the weight-matrix $W \in \mathbb{R}^{(n \cdot m) \times (n \cdot m)}$ . This matrix can be seen as a vector of weight vectors:

$W = \begin{pmatrix} w_1^\text{T} \\ w_2^\text{T} \\ \vdots \\ w_n^\text{T} \\ \end{pmatrix}$

the input of our layer $a \in \mathbb{R}^{n \cdot m}$ is the flattened image as seen above:

$a = \text{flatten}(M)$

if we calculate the matrix-vector product we can see, that our output $a' \in \mathbb{R}^{n \cdot m}$ is

$a' = W \cdot a = \begin{pmatrix} w_1^\text{T} \cdot a \\ w_2^\text{T} \cdot a\\ \vdots \\ w_n^\text{T} \cdot a \\ \end{pmatrix} = \begin{pmatrix} w_1^\text{T} \cdot \text{flatten}(M) \\ w_2^\text{T} \cdot \text{flatten}(M) \\ \vdots \\ w_n^\text{T} \cdot \text{flatten}(M) \\ \end{pmatrix}$

by comparing each output value with the convolution from above, we can clearly see, how each weight vector needs to be defined:

$w_i = \text{flatten}(W_{\upsilon \chi})$

with $\upsilon = \text{floor}(i/n)$ and $\chi = i \mod n$ .

In the initial step we flattened the input for our network, this step is only necessary for the first layer as the output of all layers (and as a result the input to all following layers) is already a vector.

Implementing other important operations

Activation functions

Activation functions are applied element wise, as a result they are not influenced by the shape of the data and can be used exactly the same way in the MLP as they are used in the CNN. If you are planning to use the OpenCV implementation keep in mind, that the library only supports the $\tanh$ transfer function (see my last blog post (From OpenCV to TensorFlow and back: fast neural networks using OpenCV and C++) for more information on transfer functions in OpenCV).

Biases

A bias is added to every element of the feature map before the activation function. In a CNN the bias is independent of the position in the feature map, this means the bias $b$ can be represented as vector full of the same bias value in our MLP. So for a given $b_\text{CNN} \in \mathbb{R}$ we can define a $b_\text{MLP} \in \mathbb{R}^{n \cdot m}$ as

$b_\text{MLP} = \begin{pmatrix} b_\text{CNN} \\ b_\text{CNN} \\ \vdots \\ b_\text{CNN} \end{pmatrix}$

Pooling

Pooling is used to reduce the size of the feature map (downsampling), for this multiple entrys of the feature map (usually in a $2 \times 2$ grid) are combined to form a single value.

Max pooling: The most commonly used pooling operation is max pooling which simply takes the maximum of all input values. Due to the non-space invariant properties of this operation can’t be trivially implemented using an MLP. To add max pooling to a model we need to rely on the properties of the ReLU-Transfer function (inspired by this reddit post):
$\text{ReLU}(a-b) + b = \max(a,b)$
Furthermore we can show that we can implement the identity using ReLU
$\text{ReLU}(a) - \text{ReLU}(-a) = a$
Combining these two properties we can see, that:
$\max(a,b) = \text{ReLU}(a-b) + \text{ReLU}(b) - \text{ReLU}(-b)$
This can be easily implemented as a MLP, the first layer can be calculated as (the bias is $\vec{0}$ ):
$\begin{pmatrix}x_1 \\ x_2 \\ x_3 \end{pmatrix} = \begin{pmatrix} 1 & -1 \\ 0 & 1 \\ 0 & -1 \end{pmatrix} \cdot \begin{pmatrix} a \\ b \end{pmatrix}$
and the second weight matrix as:
$\max(a,b) = \begin{pmatrix}1 & 1 & -1\end{pmatrix} \cdot \text{ReLU} \left( \begin{pmatrix}x_1 \\ x_2 \\ x_3 \end{pmatrix} \right)$
if the second weight matrix is treated as a second layer the output is not the $\max(a,b)$ but $\text{ReLU}(\max(a,b))$ to avoid this the second weight matrix needs to be combined with the following layer by multiplying the weight matrixes (for more information on combining weight matrixes see the section “Stride” below).

This requires the $\text{ReLU}$ transfer funtion, which is not fully supported by OpenCV.
Average pooling: Average pooling can be implemented using an MLP. The approach to implement this pooling type, is to add another layer after a convolution layer with a weight matrix that calculates the average:
$W = \begin{pmatrix} 0.5 & 0.5 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & 0.5 & 0.5 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & 0.5 & 0.5 \end{pmatrix}$
this method has one major drawback: the activation function is calculated twice instead of once (after the convolution and after the pooling). By using the taylor series of the $\tanh$ we can prove, that
$x \approx \tanh(x)$
for small $x$ . This means we only need to compress the inputs to our activation function and expand them later on to avoid the effects of the transfer function. This compression can be achieved by scaling $W$ :
$\tilde{W} = a \cdot W;\ \ 0 < a \ll 1$
then the activation function can be applied. Now the next layer needs to get adapted as well, this can be done by simply scaling the next weight matrix $W'$ with the inverse of $a$ :
$\tilde{W'} = \frac{1}{a} \cdot W'$
This operation can not be combined with the calculation of the feature map into one layer (which would speed up calculations), the reason for this is, that ( $f$ is the nonlinear activation function):
$\frac{f(x) + f(y)}{2} \neq f\left(\frac{x + y}{2}\right)$
Stride: By the definition a stride $>1$ is not pooling, i will still include a stride in this section because a stride can be used for very efficient downsampling. A stride is simply used to only calculate certain points of the feature map, for the calculation there are two possible approaches, either by adding another layer, as seen above. This still requires scaling the feature map and assuming the $\tanh$ as linear for small $x$ . The second approach is to calculate the second weight matrix and multiply them to form a combined weight matrix:
$\tilde{W} = W \cdot W_\text{stride}$

Multiple Filters

To be able to learn multiple features per layer multiple filters/kernels are used per layer. As these filters are indepentent they can be represented by different weight matrixes which are combined later on to form one large weight matrix.

Conclusion

In this post i have presented a way to convert most CNNs to sparse MLPs. All of the important operations can be ported but some of them require quite some efford to get them converted.

In the next post i will present a script to convert CNNs implemented in Tensorflow into OpenCV-MLPs and evaluate their performance in terms of runtime and time and space complexity.