官术网_书友最值得收藏!

GPU versus CPU

One of the reasons for the popularity of deep learning today is the drastically increased processing capacity of GPUs (Graphical Processing Units). Architecturally, the CPU (Central Processing Unit) is composed of a few cores that can handle a few threads at a time, while GPUs are composed of hundreds of cores that can handle thousands of threads at the same time. A GPU is a highly parallelizable unit, compared to the CPU that is mainly a serial unit.

DNNs are composed of several layers, and each layer has neurons that behave in the same manner. Moreover, we have discussed how the activity value for each neuron is , or, if we express it in matrix form, we have a = wx, where a and x are vectors and w a matrix. All activation values are calculated in the same way across the network. CPUs and GPUs have a different architecture, in particular they are optimized differently: CPUs are latency optimized and GPUs are bandwidth optimized. In a deep neural network with many layers and a large number of neurons, bandwidth becomes the bottleneck, not latency, and this is the reason why GPUs perform so much better. In addition, the L1 cache of the GPU is much faster than the L1 cache for the CPU and is also larger.

The L1 cache represents memory of information that the program is likely to use next, and storing this data can speed up the process. Much of the memory gets re-used in deep neural networks, which is why L1 cache memory is important. Using GPUs, you can get your program, go up to one order of magnitude faster than simply using CPUs, and use of this speed-up is also the reason behind much of the recent progress in speech and image processing using deep neural networks, an increase in computing power that was not available a decade ago.

In addition to be faster for DNN training, GPUs are also more efficient to run the DNN inference. Inference is the post-training phase where we deploy our trained DNN. In a whitepaper published by GPU vendor Nvidia titled GPU-Based Deep Learning Inference: A Performance and Power Analysis, available online at http://www.nvidia.com/content/tegra/embedded-systems/pdf/jetson_tx1_whitepaper.pdf, an efficiency comparison is made between the use of GPUs and CPUs on the AlexNet network (a DNN with several convolutional layers) and the results are summarized in the following table:

Network: AlexNet

Batch Size

Tegra X1 (FP32)

Tegra X1 (FP16)

Core i7 6700K (FP32)

Inference performance

1

47 img/sec

67 img/sec

62 img/sec

Power

5.5 W

5.1 W

49.7 W

Performance/Watt

8.6 img/sec/W

13.1 img/sec/W

1.3 img/sec/W

Inference performance

128 (Tegra X1)

48 (Core i7)

155 img/sec

258 img/sec

242 img/sec

Power

6.0 W

5.7 W

62.5 W

Performance/Watt

25.8 img/sec/W

45 img/sec/W

3.9 img/sec/W

The results show that inference on Tegra X1 can be up to an order of magnitude more energy-efficient that CPU-based inference while achieving comparable performance levels

Writing code to access the GPU directly instead of the CPU is not easy, but that is why most popular open source libraries like Theano or TensorFlow allow you to simply turn on a simple switch in your code to use the GPU rather than the CPU. Use of these libraries does not require writing specialized code, but the same code can run on both the CPU and the GPU, if available. The switch depends on the open source library, but typically it can be through setting up determined environment variables or by creating a specialized resource (.rc) file that is used by the particular open source library chosen.

主站蜘蛛池模板: 楚雄市| 南澳县| 大兴区| 昌吉市| 贺州市| 通江县| 资源县| 武夷山市| 临武县| 十堰市| 区。| 梨树县| 临海市| 宁津县| 宜兴市| 肇州县| 余干县| 颍上县| 和龙市| 阿城市| 普洱| 黄龙县| 宜黄县| 白银市| 澄迈县| 汪清县| 方城县| 晋州市| 巴彦淖尔市| 德令哈市| 武胜县| 阿鲁科尔沁旗| 江华| 锡林郭勒盟| 平乐县| 阿瓦提县| 红桥区| 南昌市| 乐清市| 福州市| 台南市|