官术网_书友最值得收藏!

Using PyCUDA's ElementWiseKernel for performing pointwise computations

We will now see how to program our own point-wise (or equivalently, element-wise) operations directly onto our GPU with the help of PyCUDA's ElementWiseKernel function. This is where our prior knowledge of C/C++ programming will become useful—we'll have to write a little bit of inline code in CUDA C, which is compiled externally by NVIDIA's nvcc compiler and then launched at runtime by our code via PyCUDA.

We use the term kernel quite a bit in this text; by kernel, we always mean a function that is launched directly onto the GPU by CUDA. We will use several functions from PyCUDA that generate templates and design patterns for different types of kernels, easing our transition into GPU programming.

Let's dive right in; we're going to start by explicitly rewriting the code to multiply each element of a gpuarray object by 2 in CUDA-C; we will use the ElementwiseKernel function from PyCUDA to generate our code. You should try typing the following code directly into an IPython console. (The less adventurous can just download this from this text's Git repository, which has the filename simple_element_kernel_example0.py): 

import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
from time import time
from pycuda.elementwise import ElementwiseKernel
host_data = np.float32( np.random.random(50000000) )
gpu_2x_ker = ElementwiseKernel(
"float *in, float *out",
"out[i] = 2*in[i];",
"gpu_2x_ker")

Let's take a look at how this is set up; this is, of course, several lines of inline C. We first set the input and output variables in the first line ( "float *in, float *out" ), which will generally be in the form of C pointers to allocated memory on the GPU. In the second line, we define our element-wise operation with "out[i] = 2*in[i];", which will multiply each point in in by two and place this in the corresponding index of out.

Note that PyCUDA automatically sets up the integer index i for us. When we use i as our index, ElementwiseKernel will automatically parallelize our calculation over i among the many cores in our GPU. Finally, we give our piece of code its internal CUDA C kernel name ( "gpu_2x_ker" ). Since this refers to CUDA C's namespace and not Python's, it's fine (and also convenient) to give this the same name as in Python.

Now, let's do a speed comparison:

def speedcomparison():
t1 = time()
host_data_2x = host_data * np.float32(2)
t2 = time()
print 'total time to compute on CPU: %f' % (t2 - t1)
device_data = gpuarray.to_gpu(host_data)
# allocate memory for output
device_data_2x = gpuarray.empty_like(device_data)
t1 = time()
gpu_2x_ker(device_data, device_data_2x)
t2 = time()
from_device = device_data_2x.get()
print 'total time to compute on GPU: %f' % (t2 - t1)
print 'Is the host computation the same as the GPU computation? : {}'.format(np.allclose(from_device, host_data_2x) )

if __name__ == '__main__':
speedcomparison()

Now, let's run this program:

Whoa! That doesn't look good. Let's run the speedcomparison() function a few times from IPython:

As we can see, the speed increases dramatically after the first time we use a given GPU function. Again, as with the prior example, this is because PyCUDA compiles our inline CUDA C code the first time a given GPU kernel function is called using the nvcc compiler. After the code is compiled, then it is cached and re-used for the remainder of a given Python session.

Now, let's cover something else important before we move on, which is very subtle. The little kernel function we defined operates on C float pointers; this means that we will have to allocate some empty memory on the GPU that is pointed to by the out variable. Take a look at this portion of code again from the speedcomparison() function:

device_data = gpuarray.to_gpu(host_data)
# allocate memory for output
device_data_2x = gpuarray.empty_like(device_data)

As we did before, we send a NumPy array over to the GPU (host_data) via the gpuarray.to_gpu function, which automatically allocates data onto the GPU and copies it over from the CPU space. We will plug this into the in part of our kernel function. In the next line, we allocate empty memory on the GPU with the gpuarray.empty_like function. This acts as a plain malloc in C, allocating an array of the same size and data type as device_data, but without copying anything. We can now use this for the out part of our kernel function. We now look at the next line in speedcomparison() to see how to launch our kernel function onto the GPU (ignoring the lines we use for timing):

gpu_2x_ker(device_data, device_data_2x)

Again, the variables we set correspond directly to the first line we defined with ElementwiseKernel (here being, "float *in, float *out").

主站蜘蛛池模板: 任丘市| 金昌市| 汾阳市| 清新县| 常宁市| 招远市| 惠水县| 濮阳县| 富锦市| 丰镇市| 西乡县| 灵山县| 邳州市| 和平县| 三明市| 册亨县| 新建县| 永德县| 汪清县| 新郑市| 博客| 慈溪市| 崇阳县| 苍溪县| 佛教| 蓬溪县| 江孜县| 大厂| 台南市| 杭锦后旗| 卢龙县| 响水县| 荆门市| 桐柏县| 高州市| 钟祥市| 桐城市| 太仆寺旗| 洛浦县| 安顺市| 巨野县|