- Hands-On GPU:Accelerated Computer Vision with OpenCV and CUDA
- Bhaumik Vaidya
- 395字
- 2021-08-13 15:48:26
Shared memory
Shared memory is available on-chip, and hence it is much faster than global memory. Shared memory latency is roughly 100 times lower than uncached global memory latency. All the threads from the same block can access shared memory. This is very useful in many applications where threads need to share their results with other threads. However, it can also create chaos or false results if it is not synchronized. If one thread reads data from memory before the other thread has written to it, it can lead to false results. So, the memory access should be controlled or managed properly. This is done by the __syncthreads() directive, which ensures that all the write operations to memory are completed before moving ahead in the programs. This is also called a barrier. The meaning of barrier is that all threads will reach this line and wait for other threads to finish. After all threads have reached this barrier, they can move further. To demonstrate the use of shared memory and thread synchronization, an example of a moving average is taken. The kernel function for that is shown as follows:
#include <stdio.h>
__global__ void gpu_shared_memory(float *d_a)
{
int i, index = threadIdx.x;
float average, sum = 0.0f;
//Defining shared memory
__shared__ float sh_arr[10];
sh_arr[index] = d_a[index];
// This directive ensure all the writes to shared memory have completed
__syncthreads();
for (i = 0; i<= index; i++)
{
sum += sh_arr[i];
}
average = sum / (index + 1.0f);
d_a[index] = average;
//This statement is redundant and will have no effect on overall code execution
sh_arr[index] = average;
}
The moving average operation is nothing but finding an average of all elements in an array up to the current element. Many threads will need the same data of an array for their calculation. This is an ideal case of using shared memory, and it will provide faster data than global memory. This will reduce the number of global memory accesses per thread, which in turn will reduce the latency of the program. The shared memory location is defined using the __shared__ directive. In this example, the shared memory of ten float elements is defined. Normally, the size of shared memory should be equal to the number of threads per block. Here, we are working on an array of 10, and hence we have taken the shared memory of this size.
The next step is to copy data from global memory to this shared memory. All the threads copy the element indexed by its thread ID to the shared array. Now, this is a shared memory write operation and, in the next line, we will read from this shared array. So, before proceeding, we should ensure that all shared memory write operations are completed. Therefore, let's introduce the __synchronizethreads() barrier.
Next, the for loop calculates the average of all elements up to the current elements using the values in shared memory and stores the answer in global memory which is indexed by the current thread ID. The last line copies the calculated value in shared memory also. This line will have no effect on the overall execution of the code because shared memory has a lifetime up until the end of the current block execution, and this is the last line after which block execution is complete. It is just used to demonstrate this concept about shared memory. Now, we will try to write the main function for this code as follows:
int main(int argc, char **argv)
{
float h_a[10];
float *d_a;
//Initialize host Array
for (int i = 0; i < 10; i++)
{
h_a[i] = i;
}
// allocate global memory on the device
cudaMalloc((void **)&d_a, sizeof(float) * 10);
// copy data from host memory to device memory
cudaMemcpy((void *)d_a, (void *)h_a, sizeof(float) * 10, cudaMemcpyHostToDevice);
gpu_shared_memory << <1, 10 >> >(d_a);
// copy the modified array back to the host
cudaMemcpy((void *)h_a, (void *)d_a, sizeof(float) * 10, cudaMemcpyDeviceToHost);
printf("Use of Shared Memory on GPU: \n");
for (int i = 0; i < 10; i++)
{
printf("The running average after %d element is %f \n", i, h_a[i]);
}
return 0;
}
In the main function, after allocating memory for host and device arrays, host array is populated with values from zero to nine. This is copied to device memory where the moving average is calculated and the result is stored. The result from device memory is copied back to host memory and then printed on the console. The output on the console is shown as follows:

This section demonstrated the use of shared memory when multiple threads use data from the same memory location. The next section demonstrates the use of the atomic operations, which are very important in read-modified write operations.
- Microsoft Exchange Server PowerShell Cookbook(Third Edition)
- ThinkPHP 5實戰
- 零基礎學Scratch少兒編程:小學課本中的Scratch創意編程
- 神經網絡編程實戰:Java語言實現(原書第2版)
- 鋒利的SQL(第2版)
- Python機器學習實戰
- Java Web程序設計任務教程
- BIM概論及Revit精講
- Learning Raspbian
- PHP+Ajax+jQuery網站開發項目式教程
- Building Wireless Sensor Networks Using Arduino
- 搞定J2EE:Struts+Spring+Hibernate整合詳解與典型案例
- 小程序,巧應用:微信小程序開發實戰(第2版)
- Python:Deeper Insights into Machine Learning
- Kubernetes進階實戰