官术网_书友最值得收藏!

Comparing latency between the CPU and the GPU code 

The programs for CPU and the GPU addition are written in a modular way so you can play around with the value of N. If N is small, then you will not notice any significant time difference between the CPU and the GPU code. But if you N is sufficiently large, then you will notice the significant difference in the CPU execution time and the GPU execution time for the same-vector addition. The time taken for the execution of a particular block can be measured by adding the following lines to the existing code:

clock_t start_d = clock();
printf("Doing GPU Vector add\n");
gpuAdd << <N, 1 >> >(d_a, d_b, d_c);
cudaThreadSynchronize();
clock_t end_d = clock();
double time_d = (double)(end_d - start_d) / CLOCKS_PER_SEC;
printf("No of Elements in Array:%d \n Device time %f seconds \n host time %f Seconds\n", N, time_d, time_h);

Time is measured by calculating the total number of clock cycles taken to perform a particular operation. This can be done by taking the difference of starting and ending the clock tick count, measured using the clock() function. This is divided by the number of clock cycles per second, to get the execution time. When N is taken as 10,000,000 in the previous vector addition programs of the CPU and the GPU and executed simultaneously, the output is as follows:

As can be seen from the output, the execution time or throughput is improved from 25 milliseconds to almost 1 millisecond when the same function is implemented on GPU. This proves what we have seen in theory earlier that executing code in parallel on GPU helps in the improvement of throughput. CUDA provides an efficient and accurate method for measuring the performance of CUDA programs, using CUDA events, which will be explained in the later chapters.

主站蜘蛛池模板: 曲松县| 瑞金市| 平安县| 合江县| 临江市| 延安市| 温宿县| 两当县| 大兴区| 惠东县| 房产| 长岛县| 来凤县| 南昌县| 重庆市| 滨州市| 肥东县| 炎陵县| 长乐市| 英山县| 鄂托克前旗| 易门县| 榕江县| 渭南市| 湾仔区| 灵寿县| 台东市| 龙海市| 夏津县| 镇赉县| 峨山| 财经| 东丰县| 龙江县| 泗洪县| 郴州市| 梓潼县| 东海县| 平南县| 九龙县| 磴口县|