- Hands-On GPU:Accelerated Computer Vision with OpenCV and CUDA
- Bhaumik Vaidya
- 711字
- 2021-08-13 15:48:25
Threads
The CUDA has a hierarchical architecture in terms of parallel execution. The kernel execution can be done in parallel with multiple blocks. Each block is further divided into multiple threads. In the last chapter, we saw that CUDA runtime can carry out parallel operations by launching the same copies of the kernel multiple times. We saw that it can be done in two ways: either by launching multiple blocks in parallel, with one thread per block, or by launching a single block, with many threads in parallel. So, two questions you might ask are, which method should I use in my code? And, is there any limitation on the number of blocks and threads that can be launched in parallel?
The answers to these questions are pivotal. As we will see later on in this chapter, threads in the same blocks can communicate with each other via shared memory. So, there is an advantage to launching one block with many threads in parallel so that they can communicate with each other. In the last chapter, we also saw the maxThreadPerBlock property that limits the number of threads that can be launched per block. Its value is 512 or 1,024 for the latest GPUs. Similarly, in the second method, the maximum number of blocks that can be launched in parallel is limited to 65,535.
Ideally, instead of launching multiple threads per single block or multiple blocks with a single thread, we launch multiple blocks with each having multiple threads (which can be equal to maxThreadPerBlock) in parallel. So, suppose you want to launch N = 50,000 threads in parallel in the vector add example, which we saw in the last chapter. The kernel call would be as follows:
gpuAdd<< <((N +511)/512),512 > >>(d_a,d_b,d_c)
The maximum threads per block are 512, and hence the total number of blocks is calculated by dividing the total number of threads (N) by 512. But if N is not an exact multiple of 512, then N divided by 512 may give a wrong number of blocks, which is one less than the actual count. So, to get the next highest integer value for the number of blocks, 511 is added to N and then it is divided by 512. It is basically the ceil operation on division.
Now, the question is, will this work for all values of N? The answer, sadly, is no. From the preceding discussion, the total number of blocks can't go beyond 65,535. So, in the afore as-mentioned kernel call, if (N+511)/512 is above 65,535, then again the code will fail. To overcome this, a small constant number of blocks and threads are launched with some modification in the kernel code, which we will see further by rewriting the kernel for our vector addition program, as seen in Chapter 2, Parallel Programming using Cuda C:
#include "stdio.h"
#include<iostream>
#include <cuda.h>
#include <cuda_runtime.h>
//Defining number of elements in array
#define N 50000
__global__ void gpuAdd(int *d_a, int *d_b, int *d_c)
{
//Getting index of current kernel
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while (tid < N)
{
d_c[tid] = d_a[tid] + d_b[tid];
tid += blockDim.x * gridDim.x;
}
}
This kernel code is similar to what we wrote in the last chapter. It has two modifications. One modification is in the calculation of thread ID and the second is the inclusion of the while loop in the kernel function. The change in thread ID calculation is due to the launching of multiple threads and blocks in parallel. This calculation can be understood by considering blocks and threads as a two-dimensional matrix with the number of blocks equal to the number of rows, and the number of columns equal to the number of threads per block. We will take an example of three blocks and three threads/blocks, as shown in the following table:

We can get the ID of each block by using blockIdx.x and the ID of each thread in the current block by the threadIdx.x command. So, for the thread shown in green, the block ID will be 2 and the thread ID will be 1. But what if we want a unique index for this thread among all the threads? This can be calculated by multiplying its block ID with the total number of threads per block, which is given by blockDim.x, and then summing it with its thread ID. This can be represented mathematically as follows:
tid = threadIdx.x + blockIdx.x * blockDim.x;
For example, in green, threadIdx.x = 1, blockIdx.x = 2 , and blockDim.x = 3 equals tid = 7. This calculation is very important to learn as it will be used widely in your code.
The while loop is included in the code because when N is very large, the total number of threads can't be equal to N because of the limitation described earlier. So, one thread has to do multiple operations separated by the total number of threads launched. This value can be calculated by multiplying blockDim.x by gridDim.x, which gives block and grid dimensions, respectively. Inside the while loop, the thread ID is incremented by this offset value. Now, this code will work for any value of N. To complete the program, we will write the main function for this code as follows:
int main(void)
{
//Declare host and device arrays
int h_a[N], h_b[N], h_c[N];
int *d_a, *d_b, *d_c;
//Allocate Memory on Device
cudaMalloc((void**)&d_a, N * sizeof(int));
cudaMalloc((void**)&d_b, N * sizeof(int));
cudaMalloc((void**)&d_c, N * sizeof(int));
//Initialize host array
for (int i = 0; i < N; i++)
{
h_a[i] = 2 * i*i;
h_b[i] = i;
}
cudaMemcpy(d_a, h_a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, N * sizeof(int), cudaMemcpyHostToDevice);
//Kernel Call
gpuAdd << <512, 512 >> >(d_a, d_b, d_c);
cudaMemcpy(h_c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);
//This ensures that kernel execution is finishes before going forward
cudaDeviceSynchronize();
int Correct = 1;
printf("Vector addition on GPU \n");
for (int i = 0; i < N; i++)
{
if ((h_a[i] + h_b[i] != h_c[i]))
{ Correct = 0; }
}
if (Correct == 1)
{
printf("GPU has computed Sum Correctly\n");
}
else
{
printf("There is an Error in GPU Computation\n");
}
//Free up memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}
Again, the main function is very similar to what we wrote last time. The only changes are in terms of how we launch the kernel function. The kernel is launched with 512 blocks, each having 512 threads in parallel. This will solve the problem for large values of N. Instead of printing the addition of a very long vector, only one print statement, which indicates whether the calculated answer is right or wrong, is printed. The output of the code will be seen as follows:

This section explained the hierarchical execution concept in a CUDA. The next section will take this concept further by explaining a hierarchical memory architecture.
- Oracle從入門到精通(第3版)
- Java Web開發學習手冊
- Java游戲服務器架構實戰
- C語言從入門到精通(第4版)
- 大數據分析與應用實戰:統計機器學習之數據導向編程
- 運維前線:一線運維專家的運維方法、技巧與實踐
- 深入理解BootLoader
- Building Business Websites with Squarespace 7(Second Edition)
- 超好玩的Scratch 3.5少兒編程
- Getting Started with Electronic Projects
- 3D Printing Designs:The Sun Puzzle
- 深入大型數據集:并行與分布化Python代碼
- 現代JavaScript編程:經典范例與實踐技巧
- Learning Google Apps Script
- 基于Docker的Redis入門與實戰