- Hands-On GPU Programming with Python and CUDA
- Dr. Brian Tuomanen
- 777字
- 2021-06-10 19:25:41
Using PyCUDA's ElementWiseKernel for performing pointwise computations
We will now see how to program our own point-wise (or equivalently, element-wise) operations directly onto our GPU with the help of PyCUDA's ElementWiseKernel function. This is where our prior knowledge of C/C++ programming will become useful—we'll have to write a little bit of inline code in CUDA C, which is compiled externally by NVIDIA's nvcc compiler and then launched at runtime by our code via PyCUDA.
We use the term kernel quite a bit in this text; by kernel, we always mean a function that is launched directly onto the GPU by CUDA. We will use several functions from PyCUDA that generate templates and design patterns for different types of kernels, easing our transition into GPU programming.
Let's dive right in; we're going to start by explicitly rewriting the code to multiply each element of a gpuarray object by 2 in CUDA-C; we will use the ElementwiseKernel function from PyCUDA to generate our code. You should try typing the following code directly into an IPython console. (The less adventurous can just download this from this text's Git repository, which has the filename simple_element_kernel_example0.py):
import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
from time import time
from pycuda.elementwise import ElementwiseKernel
host_data = np.float32( np.random.random(50000000) )
gpu_2x_ker = ElementwiseKernel(
"float *in, float *out",
"out[i] = 2*in[i];",
"gpu_2x_ker")
Let's take a look at how this is set up; this is, of course, several lines of inline C. We first set the input and output variables in the first line ( "float *in, float *out" ), which will generally be in the form of C pointers to allocated memory on the GPU. In the second line, we define our element-wise operation with "out[i] = 2*in[i];", which will multiply each point in in by two and place this in the corresponding index of out.
Note that PyCUDA automatically sets up the integer index i for us. When we use i as our index, ElementwiseKernel will automatically parallelize our calculation over i among the many cores in our GPU. Finally, we give our piece of code its internal CUDA C kernel name ( "gpu_2x_ker" ). Since this refers to CUDA C's namespace and not Python's, it's fine (and also convenient) to give this the same name as in Python.
Now, let's do a speed comparison:
def speedcomparison():
t1 = time()
host_data_2x = host_data * np.float32(2)
t2 = time()
print 'total time to compute on CPU: %f' % (t2 - t1)
device_data = gpuarray.to_gpu(host_data)
# allocate memory for output
device_data_2x = gpuarray.empty_like(device_data)
t1 = time()
gpu_2x_ker(device_data, device_data_2x)
t2 = time()
from_device = device_data_2x.get()
print 'total time to compute on GPU: %f' % (t2 - t1)
print 'Is the host computation the same as the GPU computation? : {}'.format(np.allclose(from_device, host_data_2x) )
if __name__ == '__main__':
speedcomparison()
Now, let's run this program:

Whoa! That doesn't look good. Let's run the speedcomparison() function a few times from IPython:
As we can see, the speed increases dramatically after the first time we use a given GPU function. Again, as with the prior example, this is because PyCUDA compiles our inline CUDA C code the first time a given GPU kernel function is called using the nvcc compiler. After the code is compiled, then it is cached and re-used for the remainder of a given Python session.
Now, let's cover something else important before we move on, which is very subtle. The little kernel function we defined operates on C float pointers; this means that we will have to allocate some empty memory on the GPU that is pointed to by the out variable. Take a look at this portion of code again from the speedcomparison() function:
device_data = gpuarray.to_gpu(host_data)
# allocate memory for output
device_data_2x = gpuarray.empty_like(device_data)
As we did before, we send a NumPy array over to the GPU (host_data) via the gpuarray.to_gpu function, which automatically allocates data onto the GPU and copies it over from the CPU space. We will plug this into the in part of our kernel function. In the next line, we allocate empty memory on the GPU with the gpuarray.empty_like function. This acts as a plain malloc in C, allocating an array of the same size and data type as device_data, but without copying anything. We can now use this for the out part of our kernel function. We now look at the next line in speedcomparison() to see how to launch our kernel function onto the GPU (ignoring the lines we use for timing):
gpu_2x_ker(device_data, device_data_2x)
Again, the variables we set correspond directly to the first line we defined with ElementwiseKernel (here being, "float *in, float *out").
- Mobile-first Bootstrap
- Hands-On DevOps with Vagrant
- Installing and Configuring Windows 10:70-698 Exam Guide
- 深入Linux內(nèi)核架構(gòu)與底層原理(第2版)
- Instant Optimizing Embedded Systems using Busybox
- Linux使用和管理指南:從云原生到可觀測性
- 嵌入式系統(tǒng)及其應(yīng)用(第三版)
- AutoCAD 2014中文版從入門到精通
- Linux命令行大全(第2版)
- INSTANT Migration from Windows Server 2008 and 2008 R2 to 2012 How-to
- HTML5 Enterprise Application Development
- Ubuntu Linux操作系統(tǒng)實(shí)用教程
- iOS 10快速開發(fā):18天零基礎(chǔ)開發(fā)一個商業(yè)應(yīng)用
- 辦公自動化教程(Windows7+Office2010)
- 鴻蒙HarmonyOS應(yīng)用開發(fā)入門