To understand vector operation on the GPU, we will start by writing a vector addition program on the CPU and then modify it to utilize the parallel structure of GPU. We will take two arrays of some numbers and store the answer of element-wise addition in the third array. The vector addition function on CPU is shown here:
#include "stdio.h" #include<iostream> //Defining Number of elements in Array #define N 5 //Defining vector addition function for CPU void cpuAdd(int *h_a, int *h_b, int *h_c) { int tid = 0; while (tid < N) { h_c[tid] = h_a[tid] + h_b[tid]; tid += 1; } }
The cpuAdd should be very simple to understand. One thing you might find difficult to understand is the use of tid. It is included to make the program similar to the GPU program, in which tid indicated a particular thread ID. Here, also, if you have a multicore CPU, then you can initialize tid equal to 0 and 1 for each of them and then add 2 to it in the loop so that one CPU will perform a sum on even elements and one CPU will perform addition on odd elements. The main function for the code is shown here:
int main(void) { int h_a[N], h_b[N], h_c[N]; //Initializing two arrays for addition for (int i = 0; i < N; i++) { h_a[i] = 2 * i*i; h_b[i] = i; } //Calling CPU function for vector addition cpuAdd (h_a, h_b, h_c); //Printing Answer printf("Vector addition on CPU\n"); for (int i = 0; i < N; i++) { printf("The sum of %d element is %d + %d = %d\n", i, h_a[i], h_b[i], h_c[i]); } return 0; }
There are two functions in the program: main and cpuAdd. In the main function, we start by defining two arrays to hold inputs and initialize it to some random numbers. Then, we pass these two arrays as input to the cpuAdd function. The cpuAdd function stores the answer in the third array. Then, we print this answer on the console, which is shown here:
This explanation of using the tid in cpuadd function may give you an idea of how to write the same function for the GPU execution, which can have many cores in parallel. If we initialize this add function with the ID of that core, then we can do the addition of all the elements in parallel. So, the modified kernel function for addition on the GPU is shown here:
#include "stdio.h" #include<iostream> #include <cuda.h> #include <cuda_runtime.h> //Defining number of elements in Array #define N 5 //Defining Kernel function for vector addition __global__ void gpuAdd(int *d_a, int *d_b, int *d_c) { //Getting block index of current kernel int tid = blockIdx.x; // handle the data at this index if (tid < N) d_c[tid] = d_a[tid] + d_b[tid]; }
In the gpuAdd kernel function, tid is initialized with the block ID of the current block in which the kernel is executing. All kernels will add an array element indexed by this block ID. If the number of blocks are equal to the number of elements in an array, then all the addition operations will be done in parallel. How this kernel is called from the main function is explained next. The code for the main function is as follows:
int main(void) { //Defining host arrays int h_a[N], h_b[N], h_c[N]; //Defining device pointers int *d_a, *d_b, *d_c; // allocate the memory cudaMalloc((void**)&d_a, N * sizeof(int)); cudaMalloc((void**)&d_b, N * sizeof(int)); cudaMalloc((void**)&d_c, N * sizeof(int)); //Initializing Arrays for (int i = 0; i < N; i++) { h_a[i] = 2*i*i; h_b[i] = i ; }
// Copy input arrays from host to device memory cudaMemcpy(d_a, h_a, N * sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(d_b, h_b, N * sizeof(int), cudaMemcpyHostToDevice);
//Calling kernels with N blocks and one thread per block, passing device pointers as parameters gpuAdd << <N, 1 >> >(d_a, d_b, d_c); //Copy result back to host memory from device memory cudaMemcpy(h_c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost); printf("Vector addition on GPU \n"); //Printing result on console for (int i = 0; i < N; i++) { printf("The sum of %d element is %d + %d = %d\n", i, h_a[i], h_b[i], h_c[i]); } //Free up memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }
The GPU main function has the known structure as explained in the first section of this chapter:
It starts with defining arrays and pointers for host and device. The device pointers are allocated memory using the cudaMalloc function.
The arrays, which are to be passed to the kernel, are copied from the host memory to the device memory by using the cudaMemcpy function.
The kernel is launched by passing the device pointers as parameters to it. If you see the values inside the kernel launch operator, they are N and 1, which indicate we are launching N blocks with one thread per each block.
The answer stored by the kernel on the device memory is copied back to the host memory by again using the cudaMemcpy, but this time with the direction of data transfer from the device to the host.
And, finally, memory allocated to three device pointers is freed up by using the cudaFree function. The output of the program is as follows:
All CUDA programs follow the same pattern as shown before. We are launching N blocks in parallel. The meaning of this is that we are launching N copies of the same kernel simultaneously. You can understand this by taking a real-life example: Suppose you want to transfer five big boxes from one place to another. In the first method, you can perform this task by hiring one person who takes one block from one place to the other and repeat this five times. This option will take time, and it is similar to how vectors are added to the CPU. Now, suppose you hire five people and each of them carries one box. Each of them also knows the ID of the box they are carrying. This option will be much faster than the previous one. Each one of them just needs to be told that they have to carry one box with a particular ID from one place to the other.
This is exactly how kernels are defined and executed on the device. Each kernel copy knows the ID of it. This can be known by the blockIdx.x command. Each copy works on an array element indexed by its ID. All copies add all elements in parallel, which significantly reduces the processing time for the entire array. So, in a way, we are improving the throughput by doing operations in parallel over CPU sequential execution. The comparison of throughput between the CPU code and the GPU code is explained in the next section.