官术网_书友最值得收藏!

Parallelization and Amdahl's Law

Before we can dive in and unlock the potential of GPUs, we first have to realize where their computational power lies in comparison to a modern Intel/AMD central processing unit (CPU)—the power does not lie in the fact that it has a higher clock speed than a CPU, nor in the complexity or particular design of the individual cores. An individual GPU core is actually quite simplistic, and at a disadvantage when compared to a modern individual CPU core, which use many fancy engineering tricks, such as branch prediction to reduce the latency of computations. Latency refers to the beginning-to-end duration of performing a single computation.

The power of the GPU derives from the fact that there are many, many more cores than in a CPU, which means a huge step forward in throughputThroughput here refers to the number of computations that can be performed simultaneously. Let's use an analogy to get a better understanding of what this means. A GPU is like a very wide city road that is designed to handle many slower-moving cars at once (high throughput, high latency), whereas a CPU is like a narrow highway that can only admit a few cars at once, but can get each individual car to its destination much quicker (low throughput, low latency).

We can get an idea of the increase in throughput by seeing how many cores these new GPUs have. To give you an idea, the average Intel or AMD CPU has only two to eight cores—while an entry-level, consumer-grade NVIDIA GTX 1050 GPU has 640 cores, and a new top-of-the-line NVIDIA RTX 2080 Ti has 4,352 cores! We can exploit this massive throughput, provided we know how properly to parallelize any program or algorithm we wish to speed up. By parallelize, we mean to rewrite a program or algorithm so that we can split up our workload to run in parallel on multiple processors simultaneously. Let's think about an analogy from real-life.

Suppose that you are building a house and that you already have all of the designs and materials in place. You hire a single laborer, and you estimate it will take 100 hours to construct the house. Let's suppose that this particular house can be built in such a way that the work can be perfectly divided between every additional laborer you hire—that is to say, it will take 50 hours for two laborers, 25 hours for four laborers, and 10 hours for ten laborers to construct the house—the number of hours to construct your house will be 100 divided by the number of laborers you hire. This is an example of a parallelizable task.

We notice that this task is twice as fast to complete for two laborers, and ten times as fast for ten laborers to complete together (that is, in parallel) as opposed to one laborer building the house alone (that is, in serial)—that is, if N is the number of laborers, then it will be N times as fast. In this case, N is known as the speedup of parallelizing our task over the serial version of our task.

Before we begin to program a parallel version of a given algorithm, we often start by coming up with an estimate of the potential speedup that parallelization would bring to our task. This can help us determine whether it is worth expending resources and time writing a parallelization of our program or not. Because real life is more complicated than the example we gave here, it's pretty obvious that we won't be able to parallelize every program perfectly, all of the time—most of the time, only a part of our program will be nicely parallelizable, while the rest will have to run in serial.

主站蜘蛛池模板: 五原县| 武平县| 平远县| 连云港市| 乌拉特中旗| 冀州市| 丹江口市| 泗阳县| 康马县| 龙南县| 台东市| 浦县| 苏州市| 德钦县| 沂源县| 德令哈市| 万载县| 论坛| 宜春市| 思茅市| 山阴县| 巢湖市| 鱼台县| 定西市| 灵丘县| 东明县| 仙桃市| 冕宁县| 临高县| 奇台县| 铁岭市| 德惠市| 枣强县| 永登县| 桑日县| 河北区| 章丘市| 东光县| 宜宾县| 锦屏县| 沅江市|