集五福下载

書名： Deep Learning with Theano
作者名： Christopher Bourez
本章字數： 1100字
更新時間： 2021-07-15 17:17:01

Convolutions and max layers

A great improvement in image classification has been achieved with the invention of the convolutional layers on the MNIST database:

While previous fully-connected layers perform a computation with all input values (pixels in the case of an image) of the input, a 2D convolution layer will consider only a small patch or window or receptive field of NxN pixels of the 2D input image for each output unit. The dimensions of the patch are named kernel dimensions, N is the kernel size, and the coefficients/parameters are the kernel.

At each position of the input image, the kernel produces a scalar, and all position values will lead to a matrix (2D tensor) called a feature map. Convolving the kernel on the input image as a sliding window creates a new output image. The stride of the kernel defines the number of pixels to shift the patch/window over the image: with a stride of 2, the convolution with the kernel is computed every 2 pixels.

For example, on a 224 x 224 input image, we get the following:

A 2x2 kernel with stride 1 outputs a 223 x 223 feature map
A 3x3 kernel with stride 1 outputs a 222 x 222 feature map

In order to keep the output feature map the same dimension as the input image, there is a type of zero-padding called same or half that enables the following:

Add a line and a column of zeros at the end of the input image in the case of a 2x2 kernel with stride 1
Add two lines and two columns of zeros, one in front and one at the end of the input image vertically and horizontally in the case of a 3x3 kernel with stride 1

So, the output dimensions are the same as the original ones, that is, a 224 x 224 feature map.

With zero padding:

A 2x2 kernel with stride 2 and zero padding will output a 112 x 112 feature map
A 3x3 kernel with stride 2 will output a 112 x 112 feature map

Without zero-padding, it gets more complicated:

A 2x2 kernel with stride 2 will output a 112 x 112 feature map
A 3x3 kernel with stride 2 will output a 111 x 111 feature map

Note that kernel dimensions and strides can be different for each dimension. In this case, we say kernel width, kernel height, stride width, or stride height.

In one convolutional layer, it is possible to output multiple feature maps, each feature map being computed with a different kernel (and kernel weights) and representing one feature. We say outputs, neurons, kernels, features, feature maps, units, or output channels indifferently to give the number of these different convolutions with different kernels. To be precise, neuron usually refers to a specific position within a feature map. Kernels are the kernels themselves, and the other ones refer to the result of the convolution operation. The number of them is the same, which is why these words are often used to describe the same thing. I'll use the words channels, outputs, and features.

The usual convolution operators can be applied to multi-channel inputs. This enables to apply them to three-channel images (RGB images, for example) or to the output of another convolution in order to be chained.

Let's include two convolutions with a kernel size of 5 in front of the previous MLP mode:

The 2D convolution operator requires a 4D tensor input. The first dimension is the batch size, the second the number of inputs or input channels (in the "channel-first format"), and the third and fourth the two dimensions of the feature map (in the "channel-last format", channels are the last dimension). MNIST gray images (one channel) stored in a one-dimensional vector need to be converted into a 28x28 matrix, where 28 is the image height and width:

layer0_input = x.reshape((batch_size, 1, 28, 28))

Then, adding a first convolution layer of 20 channels on top of the transformed input, we get this:

from theano.tensor.nnet import conv2d

n_conv1 = 20

W1 = shared_glorot_uniform( (n_conv1, 1, 5, 5) )

conv1_out = conv2d(
    input=layer0_input,
    filters=W1,
    filter_shape=(n_conv1, 1, 5, 5),
    input_shape=(batch_size, 1, 28, 28)
)

In this case, the Xavier initialization (from the name of its inventor, Xavier Glorot) multiplies the number of input/output channels by the number of parameters in the kernel, numpy.prod(shape[2:]) = 5 x 5 = 25, to get the total number of incoming input/output gradients in the initialization formula.

The 20 kernels of size 5x5 and stride 1 on 28x28 inputs will produce 20 feature maps of size 24x24. So the first convolution output is (batch_size,20,24,24).

Best performing nets use max pooling layers to encourage translation invariance and stability to noise. A max-pooling layer performs a maximum operation over a sliding window/patch to keep only one value out of the patch. As well as increasing speed performance, it reduces the size of the feature maps, and the total computation complexity and training time decreases:

from theano.tensor.signal import pool
pooled_out = pool.pool_2d(input=conv1_out, ws=(2, 2), ignore_border=True)

The output of the 2x2 max pooling layer will be (batch_size,20,12,12). The batch size and the number of channels stay constant. Only the feature map's size has changed.

Adding a second convolutional layer of 50 channels and max pooling layer on top of the previous one leads to an output of size (batch_size,50,4,4):

n_conv2 = 50

W2 = shared_glorot_uniform( (n_conv2, n_conv1, 5, 5) )

conv2_out = conv2d(
    input=pooled_out,
    filters=W2,
    filter_shape=(n_conv2, n_conv1, 5, 5),
    input_shape=(batch_size, n_conv1, 12, 12)
)

pooled2_out = pool.pool_2d(input=conv2_out, ds=(2, 2),ignore_border=True)

To create a classifier, we connect on top the MLP with its two fully-connected linear layers and a softmax, as seen before:

hidden_input = pooled2_out.flatten(2)

n_hidden = 500

W3 = shared_zeros( (n_conv2 * 4 * 4, n_hidden), name='W3' )
b3 = shared_zeros( (n_hidden,), name='b3' )

hidden_output = T.tanh(T.dot(hidden_input, W3) + b3)

n_out = 10

W4 = shared_zeros( (n_hidden, n_out), name='W4' )
b4 = shared_zeros( (n_out,), name='b4' )

model = T.nnet.softmax(T.dot(hidden_output, W4) + b4)
params = [W1,W2,W3,b3,W4,b4]

Such a model is named a Convolutional Neural Net (CNN).

The full code is given in the 3-cnn.py file.

Training is much slower because the number of parameters has been multiplied again, and the use of the GPU makes a lot more sense: total training time on the GPU has increased to 1 hour, 48 min and 27 seconds. Training on the CPU would take days.

The training error is zero after a few iterations, part of it due to overfitting. Let's see in the next section how to compute a testing loss and accuracy that better explains the model's efficiency.

官术网_书友最值得收藏!

Deep Learning with Theano

Convolutions and max layers