官术网_书友最值得收藏!

Summary

This chapter explained the launch of multiple blocks, with each having multiple threads from the kernel function. It showed the method for choosing the two parameters for a large value of threads. It also explained the hierarchical memory architecture that can be used by CUDA programs. The memory nearest to the thread being executed is fast, and as we move away from it, memories get slower. When multiple threads want to communicate with each other, then CUDA provides the flexibility of using shared memory, by which threads from the same blocks can communicate with each other. When multiple threads use the same memory location, then there should be synchronization between the memory access; otherwise, the final result will not be as expected. We also saw the use of an atomic operation to accomplish this synchronization. If some parameters remain constant throughout the kernel's execution, then it can be stored in constant memory for speed up. When CUDA programs exhibit a certain communication pattern like spatial locality, then texture memory should be used to improve the performance of the program. To summarize, to improve the performance of CUDA programs, we should reduce memory traffic to slow memories. If this is done efficiently, drastic improvement in the performance of the program can be achieved.

In the next chapter, the concept of CUDA streams will be discussed, which is similar to multitasking in CPU programs. How we can measure the performance of CUDA programs will also be discussed. It will also show the use of CUDA in simple image processing applications.

主站蜘蛛池模板: 江西省| 金溪县| 尼玛县| 积石山| 五大连池市| 桂平市| 吐鲁番市| 蒙城县| 南陵县| 南充市| 舞钢市| 衡阳县| 周至县| 民勤县| 金沙县| 仙桃市| 北海市| 梓潼县| 勐海县| 浦县| 招远市| 山东省| 莒南县| 无为县| 于田县| 易门县| 孝昌县| 象山县| 丰原市| 清涧县| 宁国市| 武安市| 浦北县| 太白县| 天祝| 万载县| 收藏| 磐安县| 永川市| 肇州县| 安义县|