官术网_书友最值得收藏!

The reasons for deep learning's popularity

If you've followed machine learning for some time, you may have noticed that many DL algorithms are not new. We dropped some hints about this in the A brief history of contemporary deep learning section, but let's see some more examples now. Multilayer perceptrons have been around for nearly 50 years. Backpropagation has been discovered a couple of times, but finally gained recognition in 1986. Yann LeCun, a famous computer scientist, perfected his work on convolutional networks in the 1990s. In 1997, Sepp Hochreiter and Jürgen Schmidhuber invented long short-term memory, a type of recurrent neural network still in use today. In this section, we'll try to understand why we have AI summer now, and why we only had AI winters(https://en.wikipedia.org/wiki/AI_winter) before. 

The first reason is, today, we have a lot more data than in the past. The rise of the internet and software in different industries has generated a lot of computer-accessible data. We also have more benchmark datasets, such as ImageNet. With this comes the desire to extract value from that data by analyzing it. And, as we'll see later, deep learning algorithms work better when they are trained with a lot of data. 

The second reason is the increased computing power. This is most visible in the drastically increased processing capacity of Graphical Processing Units (GPUs). Architecturally, Central Processing Units (CPUs) are composed of a few cores that can handle a few threads at a time, while GPUs are composed of hundreds of cores that can handle thousands of threads in parallel. A GPU is a highly parallelizable unit, compared to a CPU, which is mainly a serial unit. Neural networks are organized in such a way as to take advantage of this parallel architecture. Let's see why.

As we now know, neurons from a network layer are not connected to neurons from the same layer. Therefore, we can compute the activation of each neuron in that layer independently from the others. This means that we can compute their activation in parallel. To better understand this, let's use two sequential fully-connected layers, where the input layer has n neurons and the second layer has m neurons. The activation value for each neuron is . If we express it in vector form, we have , where x and w are n-dimensional vectors (because the input size is n). We can combine the weight vectors for all neurons in the second layer in an n by m dimensional matrix, W. Now, let's recall that we train the network using mini batches of inputs with an arbitrary size, k. We can represent one mini batch of input vectors as a k by n dimensional matrix, X. We'll optimize the execution by propagating the whole mini batch through the network as a single input. Putting it all together, we can compute all of the neuron activations of the second layer, Y, for all input vectors in the mini batch, as a matrix multiplication - Y = XW. This highly parallelizable operation can fully utilize the advantages of the GPU.

Furthermore, CPUs are optimized for latency and GPUs are optimized for bandwidth. This means that a CPU can fetch small chunks of memory very quickly, but will be slow to fetch large chunks. The GPU does the opposite. For matrix multiplication in a deep network with a lot of wide layers, bandwidth becomes the bottleneck, not latency. In addition, the L1 cache of the GPU is much faster than the L1 cache for the CPU and is also larger. The L1 cache represents the memory of the information that the program is likely to use next, and storing this data can speed up the process. Much of the memory gets reused in deep neural networks, which is why L1 cache memory is important. 

But even under these favorable conditions, we still haven't addressed the issue of training deep neural networks, such as vanishing gradients. Thanks to a combination of algorithmic advances, it's now possible to the train neural networks with almost arbitrary depth with the help of the combination. These include better activation functions, Rectified Linear Unit (ReLU), better initialization of the network weights before training, new network architectures, as well as new types of regularization techniques such as Batch normalization

主站蜘蛛池模板: 克山县| 涞水县| 武胜县| 剑川县| 潢川县| 吐鲁番市| 乾安县| 双江| 邵武市| 安龙县| 武清区| 蓬莱市| 玉林市| 贵阳市| 灵石县| 丹阳市| 东光县| 平谷区| 花莲县| 安西县| 伊宁市| 台南县| 丹寨县| 湖南省| 永修县| 二连浩特市| 宝丰县| 济宁市| 新竹县| 丹巴县| 建德市| 刚察县| 姜堰市| 图木舒克市| 奉节县| 汾阳市| 彭州市| 同仁县| 赤城县| 若尔盖县| 隆林|