官术网_书友最值得收藏!

Summary

This chapter explained the launch of multiple blocks, with each having multiple threads from the kernel function. It showed the method for choosing the two parameters for a large value of threads. It also explained the hierarchical memory architecture that can be used by CUDA programs. The memory nearest to the thread being executed is fast, and as we move away from it, memories get slower. When multiple threads want to communicate with each other, then CUDA provides the flexibility of using shared memory, by which threads from the same blocks can communicate with each other. When multiple threads use the same memory location, then there should be synchronization between the memory access; otherwise, the final result will not be as expected. We also saw the use of an atomic operation to accomplish this synchronization. If some parameters remain constant throughout the kernel's execution, then it can be stored in constant memory for speed up. When CUDA programs exhibit a certain communication pattern like spatial locality, then texture memory should be used to improve the performance of the program. To summarize, to improve the performance of CUDA programs, we should reduce memory traffic to slow memories. If this is done efficiently, drastic improvement in the performance of the program can be achieved.

In the next chapter, the concept of CUDA streams will be discussed, which is similar to multitasking in CPU programs. How we can measure the performance of CUDA programs will also be discussed. It will also show the use of CUDA in simple image processing applications.

主站蜘蛛池模板: 沅陵县| 稻城县| 克东县| 江永县| 周口市| 咸宁市| 开阳县| 永川市| 江达县| 巴林右旗| 张掖市| 台南县| 宽城| 石渠县| 北票市| 洛扎县| 余江县| 舟山市| 墨竹工卡县| 双流县| 呼伦贝尔市| 清涧县| 垣曲县| 盐亭县| 天祝| 汽车| 四会市| 山阴县| 鹤壁市| 丰镇市| 将乐县| 惠州市| 江安县| 特克斯县| 临汾市| 惠安县| 汕头市| 屯留县| 甘南县| 慈溪市| 岚皋县|