官术网_书友最值得收藏!

The reasons for deep learning's popularity

If you've followed machine learning for some time, you may have noticed that many DL algorithms are not new. We dropped some hints about this in the A brief history of contemporary deep learning section, but let's see some more examples now. Multilayer perceptrons have been around for nearly 50 years. Backpropagation has been discovered a couple of times, but finally gained recognition in 1986. Yann LeCun, a famous computer scientist, perfected his work on convolutional networks in the 1990s. In 1997, Sepp Hochreiter and Jürgen Schmidhuber invented long short-term memory, a type of recurrent neural network still in use today. In this section, we'll try to understand why we have AI summer now, and why we only had AI winters(https://en.wikipedia.org/wiki/AI_winter) before. 

The first reason is, today, we have a lot more data than in the past. The rise of the internet and software in different industries has generated a lot of computer-accessible data. We also have more benchmark datasets, such as ImageNet. With this comes the desire to extract value from that data by analyzing it. And, as we'll see later, deep learning algorithms work better when they are trained with a lot of data. 

The second reason is the increased computing power. This is most visible in the drastically increased processing capacity of Graphical Processing Units (GPUs). Architecturally, Central Processing Units (CPUs) are composed of a few cores that can handle a few threads at a time, while GPUs are composed of hundreds of cores that can handle thousands of threads in parallel. A GPU is a highly parallelizable unit, compared to a CPU, which is mainly a serial unit. Neural networks are organized in such a way as to take advantage of this parallel architecture. Let's see why.

As we now know, neurons from a network layer are not connected to neurons from the same layer. Therefore, we can compute the activation of each neuron in that layer independently from the others. This means that we can compute their activation in parallel. To better understand this, let's use two sequential fully-connected layers, where the input layer has n neurons and the second layer has m neurons. The activation value for each neuron is . If we express it in vector form, we have , where x and w are n-dimensional vectors (because the input size is n). We can combine the weight vectors for all neurons in the second layer in an n by m dimensional matrix, W. Now, let's recall that we train the network using mini batches of inputs with an arbitrary size, k. We can represent one mini batch of input vectors as a k by n dimensional matrix, X. We'll optimize the execution by propagating the whole mini batch through the network as a single input. Putting it all together, we can compute all of the neuron activations of the second layer, Y, for all input vectors in the mini batch, as a matrix multiplication - Y = XW. This highly parallelizable operation can fully utilize the advantages of the GPU.

Furthermore, CPUs are optimized for latency and GPUs are optimized for bandwidth. This means that a CPU can fetch small chunks of memory very quickly, but will be slow to fetch large chunks. The GPU does the opposite. For matrix multiplication in a deep network with a lot of wide layers, bandwidth becomes the bottleneck, not latency. In addition, the L1 cache of the GPU is much faster than the L1 cache for the CPU and is also larger. The L1 cache represents the memory of the information that the program is likely to use next, and storing this data can speed up the process. Much of the memory gets reused in deep neural networks, which is why L1 cache memory is important. 

But even under these favorable conditions, we still haven't addressed the issue of training deep neural networks, such as vanishing gradients. Thanks to a combination of algorithmic advances, it's now possible to the train neural networks with almost arbitrary depth with the help of the combination. These include better activation functions, Rectified Linear Unit (ReLU), better initialization of the network weights before training, new network architectures, as well as new types of regularization techniques such as Batch normalization

主站蜘蛛池模板: 兴化市| 阿尔山市| 禄劝| 凌海市| 尉氏县| 上栗县| 鄢陵县| 安图县| 调兵山市| 将乐县| 山西省| 临夏市| 上思县| 泗阳县| 谢通门县| 灵宝市| 罗城| 平远县| 军事| 拉萨市| 长宁县| 丹东市| 樟树市| 枣强县| 分宜县| 宜宾市| 青浦区| 炉霍县| 瓦房店市| 芦溪县| 内江市| 西藏| 石景山区| 惠来县| 攀枝花市| 西乌| 乌鲁木齐县| 阳春市| 通海县| 崇文区| 鄂温|