官术网_书友最值得收藏!

Get started with exploring MNIST

The MNIST dataset from http://yann.lecun.com/exdb/mnist/ consists of a training set of 60,000 samples, and a testing set of 10,000 samples. As said previously, images were originally taken from the NIST, and then centered and resized to the same height and width (28 * 28).

Rather than handling the ubyte files, train-images-idx3-ubyte.gz and train-labels-idx1-ubyte.gz in the preceding website and merge them, we use a dataset that is well-formatted from the Kaggle competition Digit Recognizer, https://www.kaggle.com/c/digit-recognizer/. We can download the training dataset, train.csv directly from https://www.kaggle.com/c/digit-recognizer/data. It is the only labeled dataset provided in the site, and we will use it to train classification models, evaluate models and do predictions. Now let's load it up:

> data <- read.csv ("train.csv")
> dim(data)
[1] 42000 785

We have 42,000 labeled samples available, and each sample has 784 features, which means each digit image has 784 (28 * 28) pixels. Take a look at the label and the first 5 features (pixels) for each of the first 6 data samples:

> head(data[1:6])
label pixel0 pixel1 pixel2 pixel3 pixel4
1 1 0 0 0 0 0
2 0 0 0 0 0 0
3 1 0 0 0 0 0
4 4 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0

The target label ranging from 0 to 9 denotes 10 digits:

> unique(unlist(data[1]))
[1] 1 0 4 7 3 5 8 9 2 6

The pixel variable ranges from 0 to 255, representing the brightness of the pixel, for example 0 means black and 255 stands for white:

> min(data[2:785])
[1] 0
> max(data[2:785])
[1] 255

Now let's take a look at two samples, first, the fourth image:

> sample_4 <- matrix(as.numeric(data[4,-1]), nrow = 28, byrow = TRUE)
> image(sample_4, col = grey.colors(255))

Where we reshaped the feature vector of length 784 into a matrix of 28 * 28.

Second, the 7th image:

> sample_7 <- matrix(as.numeric(data[7,-1]), nrow = 28, byrow = TRUE)
> image(sample_7, col = grey.colors(255))

The result is as follows:

We noticed that the images are rotated 90 degrees to the left. To better view the images, a rotation of 90 degrees clockwise is required. We simply need to reserve elements in each column of an image matrix:

> # Rotate the matrix by reversing elements in each column
> rotate <- function(x) t(apply(x, 2, rev))

Now visualize the rotated images:


> image(rotate(sample_4), col = grey.colors(255))
> image(rotate(sample_7), col = grey.colors(255))

After viewing what the data and images behind look like, we do more exploratory analysis on the labels and features. Firstly, because it is a classification problem, we inspect whether the classes from the data are balanced or unbalanced as a good practice. But before doing so, we should transform the label from integer to factor:

> # Transform target variable "label" from integer to factor, in order to perform classification
> is.factor(data$label)
[1] FALSE
> data$label <- as.factor(data$label)
> is.factor(data$label)
[1] TRUE

Now, we can summarize the label distribution in counts:

> summary(data$label)
0 1 2 3 4 5 6 7 8 9
4132 4684 4177 4351 4072 3795 4137 4401 4063 4188

Or combined with proportion (%):

> proportion <- prop.table(table(data$label)) * 100
> cbind(count=table(data$label), proportion=proportion)
count proportion
0 4132 9.838095
1 4684 11.152381
2 4177 9.945238
3 4351 10.359524
4 4072 9.695238
5 3795 9.035714
6 4137 9.850000
7 4401 10.478571
8 4063 9.673810
9 4188 9.971429

Classes are balanced.

Now, we explore the distribution of features, the pixels. As an example, we take the 4 pixels from the central 2*2 block (that is, pixel376, pixel377, pixel404, and pixel405) in each image and display the histogram for each of the 9 digits:

> central_block <- c("pixel376", "pixel377", "pixel404", "pixel405")
> par(mfrow=c(2, 2))
> for(i in 1:9) {
+ hist(c(as.matrix(data[data$label==i, central_block])),
+ main=sprintf("Histogram for digit %d", i),
+ xlab="Pixel value")
+ }

The resulting pixel brightness histograms for digit 1 to 4 are displayed respectively, as follows:

Histograms for digits 5 to 9:

And that for digit 9:

The brightness of central pixels is distributed differently among 9 digits. For instance, the majority of the central pixels are bright, as digit 8 is usually written in a way that strokes go through the center; while digit 7 is not written in this way, hence most of the central pixels are dark. Pixels taken from other positions can also be distinctly distributed among different digits.

The exploratory analysis we just conducted helps move us forward with building classification models based on pixels.

主站蜘蛛池模板: 垦利县| 宁国市| 达日县| 临沭县| 宽城| 阿拉善左旗| 陆河县| 涪陵区| 普兰县| 渝中区| 民勤县| 岑溪市| 苍南县| 二连浩特市| 澄江县| 沂水县| 伊春市| 申扎县| 浠水县| 武强县| 东兴市| 屏边| 广水市| 舟曲县| 青神县| 白河县| 奉贤区| 长丰县| 渝中区| 台前县| 白水县| 临西县| 富裕县| 苍山县| 马尔康县| 西乡县| 加查县| 鄂温| 承德市| 昌图县| 温州市|