書名： Hands-On GPU：Accelerated Computer Vision with OpenCV and CUDA
作者名： Bhaumik Vaidya
本章字數： 395字
更新時間： 2021-08-13 15:48:26

Shared memory

Shared memory is available on-chip, and hence it is much faster than global memory. Shared memory latency is roughly 100 times lower than uncached global memory latency. All the threads from the same block can access shared memory. This is very useful in many applications where threads need to share their results with other threads. However, it can also create chaos or false results if it is not synchronized. If one thread reads data from memory before the other thread has written to it, it can lead to false results. So, the memory access should be controlled or managed properly. This is done by the __syncthreads() directive, which ensures that all the write operations to memory are completed before moving ahead in the programs. This is also called a barrier. The meaning of barrier is that all threads will reach this line and wait for other threads to finish. After all threads have reached this barrier, they can move further. To demonstrate the use of shared memory and thread synchronization, an example of a moving average is taken. The kernel function for that is shown as follows:

#include <stdio.h>
__global__ void gpu_shared_memory(float *d_a)
{
  int i, index = threadIdx.x;
  float average, sum = 0.0f;
  //Defining shared memory
  __shared__ float sh_arr[10];
      
  sh_arr[index] = d_a[index];
 // This directive ensure all the writes to shared memory have completed
 
  __syncthreads();  
  for (i = 0; i<= index; i++) 
  { 
    sum += sh_arr[i]; 
  }
  average = sum / (index + 1.0f);
  d_a[index] = average;
      
    //This statement is redundant and will have no effect on overall code execution  
  sh_arr[index] = average;
}

The moving average operation is nothing but finding an average of all elements in an array up to the current element. Many threads will need the same data of an array for their calculation. This is an ideal case of using shared memory, and it will provide faster data than global memory. This will reduce the number of global memory accesses per thread, which in turn will reduce the latency of the program. The shared memory location is defined using the __shared__ directive. In this example, the shared memory of ten float elements is defined. Normally, the size of shared memory should be equal to the number of threads per block. Here, we are working on an array of 10, and hence we have taken the shared memory of this size.

The next step is to copy data from global memory to this shared memory. All the threads copy the element indexed by its thread ID to the shared array. Now, this is a shared memory write operation and, in the next line, we will read from this shared array. So, before proceeding, we should ensure that all shared memory write operations are completed. Therefore, let's introduce the __synchronizethreads() barrier.

Next, the for loop calculates the average of all elements up to the current elements using the values in shared memory and stores the answer in global memory which is indexed by the current thread ID. The last line copies the calculated value in shared memory also. This line will have no effect on the overall execution of the code because shared memory has a lifetime up until the end of the current block execution, and this is the last line after which block execution is complete. It is just used to demonstrate this concept about shared memory. Now, we will try to write the main function for this code as follows:

int main(int argc, char **argv)
{
   float h_a[10]; 
   float *d_a; 
  
      //Initialize host Array
   for (int i = 0; i < 10; i++) 
   {
     h_a[i] = i;
   }
  
    // allocate global memory on the device
    cudaMalloc((void **)&d_a, sizeof(float) * 10);
      
    // copy data from host memory  to device memory 
    cudaMemcpy((void *)d_a, (void *)h_a, sizeof(float) * 10,         cudaMemcpyHostToDevice);
    gpu_shared_memory << <1, 10 >> >(d_a);
  
    // copy the modified array back to the host
    cudaMemcpy((void *)h_a, (void *)d_a, sizeof(float) * 10, cudaMemcpyDeviceToHost);
    printf("Use of Shared Memory on GPU: \n");
  
    for (int i = 0; i < 10; i++) 
    {
      printf("The running average after %d element is %f \n", i, h_a[i]);
    }
    return 0;
}

In the main function, after allocating memory for host and device arrays, host array is populated with values from zero to nine. This is copied to device memory where the moving average is calculated and the result is stored. The result from device memory is copied back to host memory and then printed on the console. The output on the console is shown as follows:

This section demonstrated the use of shared memory when multiple threads use data from the same memory location. The next section demonstrates the use of the atomic operations, which are very important in read-modified write operations.

官术网_书友最值得收藏!

Hands-On GPU：Accelerated Computer Vision with OpenCV and CUDA

Shared memory