官术网_书友最值得收藏!

Elementwise squaring of vectors in CUDA

Now, one question you can ask is, now that we are launching N blocks in parallel with one thread in each block, can we work in a reverse way? The answer is yes. We can launch only one block with N threads in parallel. To show that and make you more familiar with working around vectors in CUDA, we take the second example of the element-wise squaring of numbers in an array. We take one array of numbers and return an array that contains the square of these numbers. The kernel function to find the element-wise square is shown here:

#include "stdio.h"
#include<iostream>
#include <cuda.h>
#include <cuda_runtime.h>
//Defining number of elements in Array
#define N 5
//Kernel function for squaring number
__global__ void gpuSquare(float *d_in, float *d_out)
{
//Getting thread index for current kernel
int tid = threadIdx.x; // handle the data at this index
float temp = d_in[tid];
d_out[tid] = temp*temp;
}

The gpuSquare kernel function has pointers to two arrays as arguments. The first pointer d_in points to the memory location where the input array is stored, while the second pointer d_out points to the memory location where output will be stored. In this program, instead of launching multiple blocks in parallel, we want to launch multiple threads in parallel, so tid is initialized with a particular thread ID using threadIdx.x. The main function for this program is as follows:

int main(void) 
{
//Defining Arrays for host
float h_in[N], h_out[N];
float *d_in, *d_out;
// allocate the memory on the cpu
cudaMalloc((void**)&d_in, N * sizeof(float));
cudaMalloc((void**)&d_out, N * sizeof(float));
//Initializing Array
for (int i = 0; i < N; i++)
{
h_in[i] = i;
}
//Copy Array from host to device
cudaMemcpy(d_in, h_in, N * sizeof(float), cudaMemcpyHostToDevice);
//Calling square kernel with one block and N threads per block
gpuSquare << <1, N >> >(d_in, d_out);
//Coping result back to host from device memory
cudaMemcpy(h_out, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
//Printing result on console
printf("Square of Number on GPU \n");
for (int i = 0; i < N; i++)
{
printf("The square of %f is %f\n", h_in[i], h_out[i]);
}
//Free up memory
cudaFree(d_in);
cudaFree(d_out);
return 0;
}

This main function follows a similar structure to the vector addition program. One difference that you will see here from the vector addition program is that we are launching a single block with N threads in parallel. The output of the program is as follows:

Whenever you are using this way of launching N threads in parallel, you should take care that the maximum threads per block are limited to 512 or 1,024. So, the value of N should be less than this value. If N is 2,000 and the maximum number of threads per block for your device is 512, then you can't write << <1,2000 > >>. Instead, you should use something such as << <4,500> >>. The choice of a number of blocks and the number of threads per block should be made judiciously.

To summarize, we have learned how to work with vectors and how we can launch multiple blocks and multiple threads in parallel. We have also seen that by doing vector operations on GPU, it improves throughput, compared to the same operation on the CPU. In the last section of this chapter, we discuss the various parallel communication patterns that are followed by threads executing in parallel.

主站蜘蛛池模板: 灵石县| 孟州市| 改则县| 娱乐| 孝昌县| 五常市| 双江| 遂宁市| 凤城市| 庄浪县| 鄂伦春自治旗| 朝阳县| 山西省| 新兴县| 宁蒗| 马鞍山市| 宜兰县| 延寿县| 新昌县| 永城市| 射洪县| 阜南县| 横山县| 图们市| 嵊州市| 奉贤区| 民县| 郯城县| 桓台县| 汉源县| 四川省| 东安县| 贞丰县| 景泰县| 芜湖县| 如东县| 珠海市| 孝感市| 泗洪县| 城市| 凌云县|