Siu Kwan Lam

CUDA Performance: Maximizing Instruction-Level Parallelism

CUDA programming can be difficult because of the unfamiliar hardware architecture. CUDA is so new that not many people have enough experience to say what is the best approach to write performant code. In this post, we will revisit Vasily Volkov’s talk on Better Performance at Lower Occupancy to show the importance of instruction-level parallelism (ILP).

Warp Occupancy

CUDA uses the single-instruction multiple-thread (SIMT) execution model. At each issuing cycle, the scheduler issues a group of threads called a “warp”. All threads in a warp execute the same instruction unless they are disabled due to a divergence of the code path.

Warp occupancy is the percentage of active-warps per warp-capacity in a streaming multiprocessor (SM). A high warp occupancy allows the SM to hide latency through context-switching. Context-switching is fast on the GPU because it is done by the hardware scheduler. An SM may stall when data is not yet available, when required functional units are busy, or when there are no idle warps to switch to. To ensure high performance, we want to minimize processor idle time. NVIDIA has suggested using high warp occupancy to keep the multiprocessors busy. However, Volkov demonstrated that one can achieve the same performance at low occupancy by using ILP.

Instruction-Level Parallelism

To illustrate ILP, we will use a simple example:

C = A + B
E = C + D
F = A + D

In the pseudo machine code above, the processing of each equation cannot start until the equation before it has started, and each line is processed in 4 clock cycles. Since C + D must wait until C is computed from A + B, A + D cannot start until the 5th clock cycle. Therefore, this sequence will take 9 clock cycles.

Here’s an illustration of the scheduling:

However, if we swap the last two instructions for example:

C = A + B
F = A + D
E = C + D

The new code will take only 8 cycles to complete.

The effect of ILP is amplified when we have memory load/store instructions, which are 100x slower than the arithmetic instructions. For a CUDA GPU with compute capability 3.0, each SM has 192 arithmetic cores that can be used while waiting for memory instructions. If we plan the code just right, the arithmetic instructions will be able to hide some of the memory latency without warp switching.

The following illustrates the sequential execution of code without ILP (top) and the parallel execution of code with ILP (bottom):

Experiment

We have prepared a script to duplicate the low occupancy performance experiment. It contains four different implementations of vector addition.

The device function for the work:

@cuda.jit('float32(float32, float32)', device=True)
def core(a, b):
return a + b

Python

The baseline kernel:

@cuda.jit('void(float32[:], float32[:], float32[:])')
def vec_add(a, b, c):
i = cuda.grid(1)
c[i] = core(a[i], b[i])

Python

The ILP x2 optimized kernel:

@cuda.jit('void(float32[:], float32[:], float32[:])')
def vec_add_ilp_x2(a, b, c):
# read
i = cuda.grid(1)
ai = a[i]
bi = b[i]
 
bw = cuda.blockDim.x
gw = cuda.gridDim.x
stride = gw * bw
 
j = i + stride
aj = a[j]
bj = b[j]
 
# compute
ci = core(ai, bi)
cj = core(aj, bj)
 
# write
c[i] = ci
c[j] = cj

Python

Please refer to the full source code to see the ILP x4 and x8 versions.

The vector addition has a very low compute intensity—three memory accesses per floating-point operation (two loads and one store per addition). This is the kind of situation where warp switching can hide some of the memory latency. The script runs all four kernel versions with varying block sizes and measures the number of float output per second. It is executed on:

  • a Tesla C2075 CC 2.0 GPU
  • a GTX 560 CC 2.1 GPU
  • a GT 650M CC 3.0 GPU

The results are shown in the below figures:

For the Tesla (CC 2.0) and GTX 560 Ti (CC 2.1) GPUs, the ILP versions reach peak performance in low occupancy values (small block sizes). For the GT 650M (CC 3.0) GPU, it is interesting to see the ILP versions are faster in almost all cases. At the same time, the block size seems to have little effect on the performance once it is above 128 for the baseline kernel. When looking at the Kepler Tuning Guide, we find the reason:

Furthermore, some degree of ILP in conjunction with TLP is required by Kepler GPUs in order to approach peak performance, since SMX’s warp scheduler issues one or two independent instructions from each of four warps per clock.

Conclusion

Some degree of ILP is beneficial for older GPUs, but it is essential for peak performance for the new Kepler (CC 3.x) GPUs. It is especially important for low compute intensity kernels which spend more time in memory operations than in compute operations. However, it seems writing high-performance code means sacrificing readibility.

We, the NumbaPro team, do not think the tradeoff is necessary. Through an optimizing JIT compiler and an intelligent runtime, we believe NumbaPro can address this tradeoff problem. We think programmers should focus on the validity of their programs without distractions from having to manually unroll code for better ILP. Our goal is to enable programmers to write high-performance kernels by writing less code through the use of high-level APIs, such as vectorize.

NumbaPro is packaged within the Accelerate product, designed specifically to increase performance. Try NumbaPro today and see if you experience similar results.

Tags: NumbaPro CUDA
comments powered by Disqus