官术网_书友最值得收藏!

The reasons for deep learning's popularity

If you've followed machine learning for some time, you may have noticed that many DL algorithms are not new. We dropped some hints about this in the A brief history of contemporary deep learning section, but let's see some more examples now. Multilayer perceptrons have been around for nearly 50 years. Backpropagation has been discovered a couple of times, but finally gained recognition in 1986. Yann LeCun, a famous computer scientist, perfected his work on convolutional networks in the 1990s. In 1997, Sepp Hochreiter and Jürgen Schmidhuber invented long short-term memory, a type of recurrent neural network still in use today. In this section, we'll try to understand why we have AI summer now, and why we only had AI winters(https://en.wikipedia.org/wiki/AI_winter) before. 

The first reason is, today, we have a lot more data than in the past. The rise of the internet and software in different industries has generated a lot of computer-accessible data. We also have more benchmark datasets, such as ImageNet. With this comes the desire to extract value from that data by analyzing it. And, as we'll see later, deep learning algorithms work better when they are trained with a lot of data. 

The second reason is the increased computing power. This is most visible in the drastically increased processing capacity of Graphical Processing Units (GPUs). Architecturally, Central Processing Units (CPUs) are composed of a few cores that can handle a few threads at a time, while GPUs are composed of hundreds of cores that can handle thousands of threads in parallel. A GPU is a highly parallelizable unit, compared to a CPU, which is mainly a serial unit. Neural networks are organized in such a way as to take advantage of this parallel architecture. Let's see why.

As we now know, neurons from a network layer are not connected to neurons from the same layer. Therefore, we can compute the activation of each neuron in that layer independently from the others. This means that we can compute their activation in parallel. To better understand this, let's use two sequential fully-connected layers, where the input layer has n neurons and the second layer has m neurons. The activation value for each neuron is . If we express it in vector form, we have , where x and w are n-dimensional vectors (because the input size is n). We can combine the weight vectors for all neurons in the second layer in an n by m dimensional matrix, W. Now, let's recall that we train the network using mini batches of inputs with an arbitrary size, k. We can represent one mini batch of input vectors as a k by n dimensional matrix, X. We'll optimize the execution by propagating the whole mini batch through the network as a single input. Putting it all together, we can compute all of the neuron activations of the second layer, Y, for all input vectors in the mini batch, as a matrix multiplication - Y = XW. This highly parallelizable operation can fully utilize the advantages of the GPU.

Furthermore, CPUs are optimized for latency and GPUs are optimized for bandwidth. This means that a CPU can fetch small chunks of memory very quickly, but will be slow to fetch large chunks. The GPU does the opposite. For matrix multiplication in a deep network with a lot of wide layers, bandwidth becomes the bottleneck, not latency. In addition, the L1 cache of the GPU is much faster than the L1 cache for the CPU and is also larger. The L1 cache represents the memory of the information that the program is likely to use next, and storing this data can speed up the process. Much of the memory gets reused in deep neural networks, which is why L1 cache memory is important. 

But even under these favorable conditions, we still haven't addressed the issue of training deep neural networks, such as vanishing gradients. Thanks to a combination of algorithmic advances, it's now possible to the train neural networks with almost arbitrary depth with the help of the combination. These include better activation functions, Rectified Linear Unit (ReLU), better initialization of the network weights before training, new network architectures, as well as new types of regularization techniques such as Batch normalization

主站蜘蛛池模板: 吴旗县| 彰化县| 吉首市| 通州区| 柯坪县| 宁武县| 休宁县| 武宁县| 山阳县| 黔江区| 友谊县| 昭苏县| 蕲春县| 阳高县| 宝丰县| 东城区| 德江县| 陈巴尔虎旗| 天门市| 饶阳县| 平江县| 琼结县| 南投市| 平潭县| 旬阳县| 和林格尔县| 洪江市| 河东区| 长汀县| 山阳县| 紫阳县| 锡林浩特市| 竹山县| 华容县| 密云县| 吉木乃县| 湖州市| 安义县| 西畴县| 阿坝县| 彭山县|