官术网_书友最值得收藏!

Comparing latency between the CPU and the GPU code 

The programs for CPU and the GPU addition are written in a modular way so you can play around with the value of N. If N is small, then you will not notice any significant time difference between the CPU and the GPU code. But if you N is sufficiently large, then you will notice the significant difference in the CPU execution time and the GPU execution time for the same-vector addition. The time taken for the execution of a particular block can be measured by adding the following lines to the existing code:

clock_t start_d = clock();
printf("Doing GPU Vector add\n");
gpuAdd << <N, 1 >> >(d_a, d_b, d_c);
cudaThreadSynchronize();
clock_t end_d = clock();
double time_d = (double)(end_d - start_d) / CLOCKS_PER_SEC;
printf("No of Elements in Array:%d \n Device time %f seconds \n host time %f Seconds\n", N, time_d, time_h);

Time is measured by calculating the total number of clock cycles taken to perform a particular operation. This can be done by taking the difference of starting and ending the clock tick count, measured using the clock() function. This is divided by the number of clock cycles per second, to get the execution time. When N is taken as 10,000,000 in the previous vector addition programs of the CPU and the GPU and executed simultaneously, the output is as follows:

As can be seen from the output, the execution time or throughput is improved from 25 milliseconds to almost 1 millisecond when the same function is implemented on GPU. This proves what we have seen in theory earlier that executing code in parallel on GPU helps in the improvement of throughput. CUDA provides an efficient and accurate method for measuring the performance of CUDA programs, using CUDA events, which will be explained in the later chapters.

主站蜘蛛池模板: 汉寿县| 泉州市| 巴马| 兖州市| 富源县| 阿拉善盟| 兴仁县| 天全县| 马关县| 夏邑县| 临沂市| 封丘县| 桐乡市| 张北县| 静海县| 东平县| 保康县| 昌乐县| 大足县| 台山市| 正阳县| 巴彦县| 邢台县| 当阳市| 澜沧| 京山县| 中超| 茂名市| 精河县| 澄迈县| 祁连县| 阳朔县| 桃园市| 安阳县| 淳化县| 民丰县| 额济纳旗| 会同县| 福泉市| 卓尼县| 沙雅县|