- Hands-On GPU:Accelerated Computer Vision with OpenCV and CUDA
- Bhaumik Vaidya
- 399字
- 2021-08-13 15:48:23
Elementwise squaring of vectors in CUDA
Now, one question you can ask is, now that we are launching N blocks in parallel with one thread in each block, can we work in a reverse way? The answer is yes. We can launch only one block with N threads in parallel. To show that and make you more familiar with working around vectors in CUDA, we take the second example of the element-wise squaring of numbers in an array. We take one array of numbers and return an array that contains the square of these numbers. The kernel function to find the element-wise square is shown here:
#include "stdio.h"
#include<iostream>
#include <cuda.h>
#include <cuda_runtime.h>
//Defining number of elements in Array
#define N 5
//Kernel function for squaring number
__global__ void gpuSquare(float *d_in, float *d_out)
{
//Getting thread index for current kernel
int tid = threadIdx.x; // handle the data at this index
float temp = d_in[tid];
d_out[tid] = temp*temp;
}
The gpuSquare kernel function has pointers to two arrays as arguments. The first pointer d_in points to the memory location where the input array is stored, while the second pointer d_out points to the memory location where output will be stored. In this program, instead of launching multiple blocks in parallel, we want to launch multiple threads in parallel, so tid is initialized with a particular thread ID using threadIdx.x. The main function for this program is as follows:
int main(void)
{
//Defining Arrays for host
float h_in[N], h_out[N];
float *d_in, *d_out;
// allocate the memory on the cpu
cudaMalloc((void**)&d_in, N * sizeof(float));
cudaMalloc((void**)&d_out, N * sizeof(float));
//Initializing Array
for (int i = 0; i < N; i++)
{
h_in[i] = i;
}
//Copy Array from host to device
cudaMemcpy(d_in, h_in, N * sizeof(float), cudaMemcpyHostToDevice);
//Calling square kernel with one block and N threads per block
gpuSquare << <1, N >> >(d_in, d_out);
//Coping result back to host from device memory
cudaMemcpy(h_out, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
//Printing result on console
printf("Square of Number on GPU \n");
for (int i = 0; i < N; i++)
{
printf("The square of %f is %f\n", h_in[i], h_out[i]);
}
//Free up memory
cudaFree(d_in);
cudaFree(d_out);
return 0;
}
This main function follows a similar structure to the vector addition program. One difference that you will see here from the vector addition program is that we are launching a single block with N threads in parallel. The output of the program is as follows:

Whenever you are using this way of launching N threads in parallel, you should take care that the maximum threads per block are limited to 512 or 1,024. So, the value of N should be less than this value. If N is 2,000 and the maximum number of threads per block for your device is 512, then you can't write << <1,2000 > >>. Instead, you should use something such as << <4,500> >>. The choice of a number of blocks and the number of threads per block should be made judiciously.
To summarize, we have learned how to work with vectors and how we can launch multiple blocks and multiple threads in parallel. We have also seen that by doing vector operations on GPU, it improves throughput, compared to the same operation on the CPU. In the last section of this chapter, we discuss the various parallel communication patterns that are followed by threads executing in parallel.
- 手機(jī)安全和可信應(yīng)用開發(fā)指南:TrustZone與OP-TEE技術(shù)詳解
- Python快樂編程:人工智能深度學(xué)習(xí)基礎(chǔ)
- Learning Bayesian Models with R
- 新手學(xué)Visual C# 2008程序設(shè)計(jì)
- Python漫游數(shù)學(xué)王國:高等數(shù)學(xué)、線性代數(shù)、數(shù)理統(tǒng)計(jì)及運(yùn)籌學(xué)
- Python數(shù)據(jù)可視化之Matplotlib與Pyecharts實(shí)戰(zhàn)
- Python編程與幾何圖形
- Oracle JDeveloper 11gR2 Cookbook
- BIM概論及Revit精講
- 數(shù)據(jù)結(jié)構(gòu)與算法分析(C++語言版)
- Windows Phone 7.5:Building Location-aware Applications
- SQL Server與JSP動(dòng)態(tài)網(wǎng)站開發(fā)
- Linux C編程:一站式學(xué)習(xí)
- OpenResty完全開發(fā)指南:構(gòu)建百萬級別并發(fā)的Web應(yīng)用
- 微前端設(shè)計(jì)與實(shí)現(xiàn)