Performance

In this notebook, we delve into the key performance optimizations that DevitoPRO has to offer. On either end of the optimization spectrum, there are two types:

Implicit: Those automatically enabled by the compiler/run-time – no user intervention required.
Explicit: Those enabled by the user with an API.

In the explicit case, the API is often as straightforward as setting an environment variable!

What we’ve seen in other notebooks

Adapting box: explicit, requires the use of the ABox API.
Data streaming: implicit, with a few APIs for performance tuning.

GPU backends [explicit]

DevitoPRO provides two additional backends – CUDA and HIP. A third one is currently in the pipeline, SYCL.

To enable them, set the following environment variables:

CUDA: DEVITO_PLATFORM=nvidiaX, DEVITO_LANGUAGE=cuda, DEVITO_ARCH=cuda
HIP: DEVITO_PLATFORM=amdgpuX, DEVITO_LANGUAGE=hip, DEVITO_ARCH=hip

By default, for a given 3D problem, the CUDA and HIP backends will produce code that is conceptually similar to that generated by what might be termed their OSS-Devito counterparts, namely OpenACC (for Nvidia GPUs) and OpenMP-offloading (for AMD GPUs):

Two-level parallelism – blocks and threads.
3D blocks.
Each thread computes exactly one point of the iteration space.
The block shape can (and should) be tuned via the par-tile option.

Despite employing a similar parallelization strategy, the CUDA and HIP backends always outperform OpenACC and OpenMP-offloading, respectively. This is due to a mix of more mature compiler technology and inherent limitations of the pragma-based languages.

2.5-blocking

Further, the CUDA and HIP backends can generate, for 3D problems, what is often regarded to as 2.5-blocking, that is a parallelization strategy completely different than the classic 3D blocking outlined above. The key features of 2.5-blocking:

Two-level parallelism – blocks and threads.
2D blocks.
Each thread processes all points along the third sequential dimension.
Shared-memory is explicitly used to maximize data reuse within the 2D blocks.
Registers are “rotated” along the sequential dimension to maximize data reuse.
The block shape can (and should) be tuned via the par-tile option.

2.5-blocking is extremely effective and, in our experience, will outperform 3D blocking in many cases. Currently, this feature is in beta phase. It can be activated in three ways:

Setting the environment variable DEVITO_BETA=1, or
Setting configuration['beta'] = True, or
Supplying a suitable option, Operator(..., opt=('advanced', {'gpu-opt': True}))

Vector types

CUDA and HIP can also benefit from so called “vector types” (float2, float3, …), which typically improve the available bandwidth usage. Like 2.5-blocking, vector types are in beta phase. This optimization can be activated in three ways:

Setting the environment variable DEVITO_BETA=2, or
Setting configuration['beta'] = 2, or
Supplying a suitable option, Operator(..., opt=('advanced', {'gpu-vect': True}))

Note that setting DEVITO_BETA=2 will also enable 2.5-blocking. Instead, to enable both 2.5-blocking and vector types programmatically, use should use both optimization options, that is ... {'gpu-opt': True, 'gpu-vect': True}).

The `par-tile` option

The block shape should always be tuned for maximum performance.

With the default parallelization strategy, the recommended block shapes for a 3D problem are:

(32, 4, 4), (32, 4, 8), (32, 4, 1), (32, 8, 1), (64, 4, 1), (64, 8, 1), (128, 4, 1).
Often the first two are the ones delivering the best performance.

Despite employing 2D blocks, with the 2.5-blocking scheme the tested block shapes still ought to be 3D. The third value is used to logically split the sequential dimension into sub-blocks, which typically improves the GPU utilization factor. The recommended block shapes for a 3D problem are those resulting from the cartesian product of these two axes:

axis 1: (32, 4), (32, 8), (32, 16), (64, 4), (64, 7), (64, 8), (128, 4)
axis 2: 16, 24, 32, 48, 64

Example

Alright, let’s see how to write a simple program to sweep over a set of par-tiles with 2.5-blocking.

# Show target
target = {'language': 'cuda', 'arch': 'cuda', 'platform': 'nvidiaX'}

#NBVAL_IGNORE_OUTPUT
from devito import Eq, Grid, TimeFunction, Operator
from devitopro import *

grid = Grid(shape=(8, 8, 8))

u = TimeFunction(name='u', grid=grid, space_order=4)

eq = Eq(u.forward, u.dx + u.dy.dz + 1.)

operators = []
for thread_block in [(32, 4), (64, 4)]:
    for sub_block in [16, 32]:
        par_tile = thread_block + (sub_block,)
        
        op = Operator(eq,
                      opt=('advanced', {'gpu-opt': True, 'par-tile': par_tile}),
                      **target)

        operators.append(op)
        
# Uncomment to show code
# You shall see that the kernels are identical except for the employed block shapes
# print(operators[0]._func_table['kernel0'].root)
# print(operators[1]._func_table['kernel0'].root)

If an Operator contains N kernels, then N par-tiles may be passed, e.g. 'par-tile': ((32, 4, 16), (32, 4, 32)), here with N=2. To determine N, one can look at op._func_table and count the functions whose name starts with 'kernel', or simply print out the Operator and count the kernels in the generated code.

One-pass vs two-pass implementations [explicit]

For any given problem, DevitoPRO has the capacity to generate one or more implementations. How many exactly is influenced by several factors, with the structure of the cross-derivatives being paramount.

To put it simply, for a second-order cross-derivative, like `u.dx.dy, we have two main alternatives:

One-pass implementation – compute everything on-the-fly:
- Boils down to a nested loop of depth two;
- O(n^2) operations, where n is the diameter of the stencil.
Two-pass implementation – precompute u.dx, assign it to a temporary tmp, then compute tmp.dy.
- Involves two consecutive loops.
- O(n) operations.

Though the two-pass method performs fewer arithmetic operations, it’s significantly more memory-demanding than the one-pass method. However, by splitting the working set across two separate GPU kernels, the two-pass method can be more efficient for intricate mathematics, especially with the presence of multiple cross-derivatives.

Note that, at high level, there exists just one possible one-pass implementation, whereas there could be multiple two-pass implementations (e.g., some inner derivatives may be precomputed, while others may be recomputed on-the-fly).

DevitoPRO employs a cost model to decide whether to opt for a one-pass or a two-pass implementation for a specific Operator. But this model isn’t perfect. Given the potential performance gap between the one-pass and the various two-pass implementations, it’s prudent for users to evaluate different options, especially for critical kernels.

To override the default cost model and produce a specific code variant, the cire-schedule option is made available.

#NBVAL_IGNORE_OUTPUT
# Default (1-pass)
op = Operator(eq, **target)

# Variants (2-pass)
op1 = Operator(eq, opt=('advanced', {'cire-schedule': 1}), **target)
op2 = Operator(eq, opt=('advanced', {'cire-schedule': 2}), **target)

# 1-pass
# print(op)

# 2-pass
# print(op1)  # version 0
# print(op2)  # version 1

## Optimized GPU-aware MPI [explicit]

When executed on CPUs, OSS Devito offers multiple strategies for generating MPI code related to halo exchanges. At one end of the spectrum, there’s (A) a straightforward approach that uses synchronous communication and dynamically allocates communication buffers. At the other end, there’s (B) an enhanced approach that employs asynchronous communication and leverages support threads to overlap computation and communication. There are also a few intermediate variants. The efficiency of each scheme can vary based on the scale of the simulation as well as the physics and the chosen discretization.

However, for multi-GPU runs, OSS Devito only supports scheme (A). In contrast, DevitoPRO offers two additional modes, diag2 (stable) and dual (experimental).

We encourage users to employ DEVITO_MPI=diag2.

Reduced precision [explicit]

Implementation of fixed precision compression by scaling and offset is based on a method described in detail by UCAR. This feature allows the user to trade precision for more memory and memory bandwidth. By reducing the size of selected fields, the effect is twofold: more room for storing wavefield slices during forward-propagation and, most importantly, more efficient propagation code by reducing memory traffic and working set size (e.g., fewer accesses to DRAM, fewer cache lines occupied, etc.).

Here’s an example of how to set float8 as the data type of a generic Function.

import numpy as np

grid = Grid(shape=(4, 4))

float8 = Float8(0, 255)

u = Function(name='u', grid=grid, dtype=float8, space_order=0)

assert u.dtype is np.uint8
assert u.data.dtype is np.uint8().dtype

# These setitems have zero compression/decompression error
u.data[1, 1] = 3
assert u.data[1, 1] == 3
u.data[1, 1] = 255
assert u.data[1, 1] == 255

# NOTE: out-of-range setitem causes an overflow
u.data[1, 1] = 256
assert u.data[1, 1] == 0
u.data[1, 1] = -1
assert u.data[1, 1] == 255

# These setitems have a nonzeror compression/decompression error
u.data[1, 1] = 3.1
assert u.data[1, 1] == 3
assert u.data.decompress[1, 1] == 3

Hyperplanes optimization [implicit]

This performance enhancement involves trimming the (hyper-)corners of an iteration space that remain untouched – often resulting from non-star stencil shapes. These shapes might arise, for instance, from mixed derivatives. The effect of this optimization can vary based on the spatial order of the discretization. As the spatial order increases, so does the optimization’s impact, due to the enlargement of the hyperplanes. On certain CPU architectures, we’ve noted performance boosts ranging from 5-10%.

Enabled automatically.