- Python Deep Learning
- Ivan Vasilev Daniel Slater Gianmario Spacagna Peter Roelants Valentino Zocca
- 756字
- 2021-07-02 14:31:09
The reasons for deep learning's popularity
If you've followed machine learning for some time, you may have noticed that many DL algorithms are not new. We dropped some hints about this in the A brief history of contemporary deep learning section, but let's see some more examples now. Multilayer perceptrons have been around for nearly 50 years. Backpropagation has been discovered a couple of times, but finally gained recognition in 1986. Yann LeCun, a famous computer scientist, perfected his work on convolutional networks in the 1990s. In 1997, Sepp Hochreiter and Jürgen Schmidhuber invented long short-term memory, a type of recurrent neural network still in use today. In this section, we'll try to understand why we have AI summer now, and why we only had AI winters(https://en.wikipedia.org/wiki/AI_winter) before.
The first reason is, today, we have a lot more data than in the past. The rise of the internet and software in different industries has generated a lot of computer-accessible data. We also have more benchmark datasets, such as ImageNet. With this comes the desire to extract value from that data by analyzing it. And, as we'll see later, deep learning algorithms work better when they are trained with a lot of data.
The second reason is the increased computing power. This is most visible in the drastically increased processing capacity of Graphical Processing Units (GPUs). Architecturally, Central Processing Units (CPUs) are composed of a few cores that can handle a few threads at a time, while GPUs are composed of hundreds of cores that can handle thousands of threads in parallel. A GPU is a highly parallelizable unit, compared to a CPU, which is mainly a serial unit. Neural networks are organized in such a way as to take advantage of this parallel architecture. Let's see why.
As we now know, neurons from a network layer are not connected to neurons from the same layer. Therefore, we can compute the activation of each neuron in that layer independently from the others. This means that we can compute their activation in parallel. To better understand this, let's use two sequential fully-connected layers, where the input layer has n neurons and the second layer has m neurons. The activation value for each neuron is . If we express it in vector form, we have
, where x and w are n-dimensional vectors (because the input size is n). We can combine the weight vectors for all neurons in the second layer in an n by m dimensional matrix, W. Now, let's recall that we train the network using mini batches of inputs with an arbitrary size, k. We can represent one mini batch of input vectors as a k by n dimensional matrix, X. We'll optimize the execution by propagating the whole mini batch through the network as a single input. Putting it all together, we can compute all of the neuron activations of the second layer, Y, for all input vectors in the mini batch, as a matrix multiplication - Y = XW. This highly parallelizable operation can fully utilize the advantages of the GPU.
Furthermore, CPUs are optimized for latency and GPUs are optimized for bandwidth. This means that a CPU can fetch small chunks of memory very quickly, but will be slow to fetch large chunks. The GPU does the opposite. For matrix multiplication in a deep network with a lot of wide layers, bandwidth becomes the bottleneck, not latency. In addition, the L1 cache of the GPU is much faster than the L1 cache for the CPU and is also larger. The L1 cache represents the memory of the information that the program is likely to use next, and storing this data can speed up the process. Much of the memory gets reused in deep neural networks, which is why L1 cache memory is important.
But even under these favorable conditions, we still haven't addressed the issue of training deep neural networks, such as vanishing gradients. Thanks to a combination of algorithmic advances, it's now possible to the train neural networks with almost arbitrary depth with the help of the combination. These include better activation functions, Rectified Linear Unit (ReLU), better initialization of the network weights before training, new network architectures, as well as new types of regularization techniques such as Batch normalization.
- Advanced Splunk
- Learning Scala Programming
- Java程序設計與開發(fā)
- 從零開始:數(shù)字圖像處理的編程基礎(chǔ)與應用
- Oracle從新手到高手
- Apache Karaf Cookbook
- Mastering Linux Network Administration
- Hands-On Functional Programming with TypeScript
- 運用后端技術(shù)處理業(yè)務邏輯(藍橋杯軟件大賽培訓教材-Java方向)
- Python機器學習算法: 原理、實現(xiàn)與案例
- Android Studio Cookbook
- Java Hibernate Cookbook
- 美麗洞察力:從化妝品行業(yè)看顧客需求洞察
- TensorFlow.NET實戰(zhàn)
- Elasticsearch實戰(zhàn)(第2版)