- Hands-On GPU:Accelerated Computer Vision with OpenCV and CUDA
- Bhaumik Vaidya
- 380字
- 2021-08-13 15:48:25
Memory architecture
The execution of code on a GPU is divided among streaming multiprocessors, blocks, and threads. The GPU has several different memory spaces, with each having particular features and uses and different speeds and scopes. This memory space is hierarchically divided into different chunks, like global memory, shared memory, local memory, constant memory, and texture memory, and each of them can be accessed from different points in the program. This memory architecture is shown in preceding diagram:

As shown in the diagram, each thread has its own local memory and a register file. Unlike processors, GPU cores have lots of registers to store local data. When the data of a thread does not fit in the register file, the local memory is used. Both of them are unique to each thread. The register file is the fastest memory. Threads in the same blocks have shared memory that can be accessed by all threads in that block. It is used for communication between threads. There is a global memory that can be accessed by all blocks and all threads. Global memory has a large memory access latency. There is a concept of caching to speed up this operation. L1 and L2 caches are available, as shown in the following table. There is a read-only constant memory that is used to store constants and kernel parameters. Finally, there is a texture memory that can take advantage of different two-dimensional or three-dimensional access patterns.
The features of all memories are summarized in the following table:

The preceding table describes important features of all memories. The scope defines the part of the program that can use this memory, and lifetime defines the time for which data in that memory will be visible to the program. Apart from this, L1 and L2 caches are also available for GPU programs for faster memory access.
To summarize, all threads have a register file, which is the fastest. Multiple threads in the same blocks have shared memory that is faster than global memory. All blocks can access global memory, which will be the slowest. Constant and texture memory are used for a special purpose, which will be discussed in the next section. Memory access is the biggest bottleneck in the fast execution of the program.
- Java語言程序設(shè)計
- HTML5+CSS3+JavaScript從入門到精通:上冊(微課精編版·第2版)
- Python程序設(shè)計教程(第2版)
- Rust Cookbook
- QTP自動化測試進階
- Learning FuelPHP for Effective PHP Development
- Instant Ext.NET Application Development
- ANSYS Fluent 二次開發(fā)指南
- TypeScript圖形渲染實戰(zhàn):2D架構(gòu)設(shè)計與實現(xiàn)
- AI自動化測試:技術(shù)原理、平臺搭建與工程實踐
- HTML5游戲開發(fā)實戰(zhàn)
- UI動效設(shè)計從入門到精通
- C++服務(wù)器開發(fā)精髓
- Getting Started with Backbone Marionette
- Go Programming Cookbook(Second Edition)