官术网_书友最值得收藏!

Comparing latency between the CPU and the GPU code 

The programs for CPU and the GPU addition are written in a modular way so you can play around with the value of N. If N is small, then you will not notice any significant time difference between the CPU and the GPU code. But if you N is sufficiently large, then you will notice the significant difference in the CPU execution time and the GPU execution time for the same-vector addition. The time taken for the execution of a particular block can be measured by adding the following lines to the existing code:

clock_t start_d = clock();
printf("Doing GPU Vector add\n");
gpuAdd << <N, 1 >> >(d_a, d_b, d_c);
cudaThreadSynchronize();
clock_t end_d = clock();
double time_d = (double)(end_d - start_d) / CLOCKS_PER_SEC;
printf("No of Elements in Array:%d \n Device time %f seconds \n host time %f Seconds\n", N, time_d, time_h);

Time is measured by calculating the total number of clock cycles taken to perform a particular operation. This can be done by taking the difference of starting and ending the clock tick count, measured using the clock() function. This is divided by the number of clock cycles per second, to get the execution time. When N is taken as 10,000,000 in the previous vector addition programs of the CPU and the GPU and executed simultaneously, the output is as follows:

As can be seen from the output, the execution time or throughput is improved from 25 milliseconds to almost 1 millisecond when the same function is implemented on GPU. This proves what we have seen in theory earlier that executing code in parallel on GPU helps in the improvement of throughput. CUDA provides an efficient and accurate method for measuring the performance of CUDA programs, using CUDA events, which will be explained in the later chapters.

主站蜘蛛池模板: 宁德市| 安宁市| 扎鲁特旗| 乐亭县| 凤台县| 江永县| 塔河县| 石门县| 张家口市| 沿河| 西平县| 清涧县| 长垣县| 育儿| 万山特区| 孝义市| 全南县| 桐庐县| 宣威市| 新疆| 肃宁县| 武功县| 共和县| 黑山县| 九龙坡区| 塔城市| 绩溪县| 万年县| 长沙县| 龙山县| 那坡县| 简阳市| 安西县| 泾川县| 遂溪县| 黔江区| 常山县| 平乡县| 正镶白旗| 集安市| 曲沃县|