features

A brief, possibly not up-to-date, description of what’s in DevitoPRO.

Adapting box (ABox)

A popular technique to speed up wave equation simulations consists of using a so-called “expanding box”. In fact, this technique has been given all sorts of names in the literature – here, we call it “adaptive box” to emphasize the nature of the implementation. Key features are: (i) it resorts to the SubDomain infrastructure Devito users are already familiar with. In other words, using it is as straightforward as defining one or more equations over a subdomain; (ii) it both expands, at source injection, and shrinks, at receiver interpolation, so the reduction in iteration space size is maximum; (iii) just like any other object in Devito, it naturally and transparently coexists with MPI; (iv) at each timestep, it expands/shrinks based on the maximum velocity along each side.

To get familiar with ABox, take a look at tests/test_abox.py.

You’ll see that an ABox is a very special SubDomain – a SubDomain that varies in both time and space. In essence, using an ABox boils down to:

...
subdomain = ABox(src, rec, model.vp, model.space_order)
...
eq = Eq(u.forward, solve(pde, u.forward), subdomain=subdomain)
...

If you need to compose it with other subdomains (e.g., for BCs), take a look at tests/test_abox.py::TestCompositeSubdomains.

Reduced precision computation and fixed precision compression

Implementation of fixed precision compression by scaling and offset is based on a method described in detail by UCAR. This feature allows the user to trade precision for more memory and memory bandwidth. By reducing the size of selected fields, the effect is twofold: more room for storing wavefield slices during forward-propagation and, most importantly, more efficient propagation code by reducing memory traffic and working set size (e.g., fewer accesses to DRAM, fewer cache lines occupied, etc.).

Take a look at tests/test_cso.py. You’ll see that, again, the API is pretty straightforward to get familiar with.

Data streaming to/from devices

Data streaming refers to moving data from the GPU(s) to the host and the other way round. This is useful when, for example, the data volume is too big to fit in one GPU, or multiple GPUs after domain decomposition via MPI, or if one simply does not want to split the computation of a shot over multiple GPUs for performance reasons. The Devito compiler automatically generates the data streaming code; zero effort is requested from the user. Some examples: during forward propagation, wavefield slices may be streamed back from the GPU to the host to free up space in the GPU memory; during backward propagation, wavefield slices may be prefetched from the host to the GPU memory to perform the cross-correlation with the backward-propagating wavefield.

Data streaming is automatically enabled when a TimeFunction is created with save!=None. It may also be disabled (e.g., for testing or performance reasons) via

Operator(eqns, opt=('noop', {'gpu-fit': usave}))

Lazyness

Lazy data streaming is a fully automated feature to keep as much data as possible in device memory during a forward propagation run, rather than streaming it back to the host (and possibly disk) and then subsequently to the device again upon a backward propagation run. Aside from a simple API to specify the amount of data to be kept in device memory, this feature is fully automated by DevitoPRO.

Take a look at tests/test_layered_funcs.py and tests/test_composite_funcs.py.

Optimized MPI communications on GPUs

When running on CPUs, Devito has several strategies to generate MPI code for halo exchanges. The two extremes are: (A) a simple scheme relying on synchronous communication and communication buffers allocated on the fly and (B) a more advanced scheme based on asynchronous communication and support threads for computation/communication overlap. In the middle, a few variants. Depending on the size of the simulation, and in particular, on the number of MPI ranks, a scheme may be more performant than another. However, when running on multiple GPUs, open-source Devito only has available one scheme, (A). DevitoPRO, instead, provides various schemes, in particular one that pre-allocates, directly from Python, communication buffers in the GPU memory, which results in significant performance improvements (we have seen up to 15%) when running on single-node multi-GPU systems (e.g., 8 MPI ranks, each MPI rank driving a distinct GPU).

To enable it, simply run with DEVITO_MPI=diag2.

Calls to external C/C++ functions

This feature is a Devito API extension that enables users to place calls to arbitrary C/C++ functions in the generated code. As the feature is integrated seamlessly with the symbolic language provided by Devito, such calls can be placed anywhere – that is, at any point in the generated code and potentially at any loop depth. The user specifies the function call as well as all that is necessary for compilation (header files and their location, shared objects to link in and their location, etc).

Take a look at tests/test_ccalls.py.

Hyperplanes optimization

This performance optimization consists of slicing away the (hyper-)corners of an iteration space that will never be accessed – a by-product of non-star stencil shapes, which in turn may originate, for example, from mixed derivatives. Depending on the spatial order of the discretization, the impact of this optimization may be more or less pronounced. The higher the spatial order, the more significant the performance impact since the hyperplanes get bigger. Performance improvements in the 5-10 % range have been observed on some CPU architectures.

Compression/decompression

Users declare what Functions should be compressed and how.

usave = TimeFunction(name='usave', ..., save=nt, compression='bitcomp')

Today we support one compression library, Bitcomp, developed by NVidia. It supports both CPU (Intel, Arm) and GPU (NVidia) compression. By default, DevitoPRO will perform lossy-compression on the GPU employing an adaptive delta for integer quantization, based on the maximum amplitude value at compression time. The devito compiler will generate code that:

compresses usave once a snapshot has been computed;
decompresses usave right before it is read.

Users may customize the number of bits used for integer quantization by supplying the special keyword nbits, for example:

op = Operator(...)
op.apply(..., nbits=16)

Further compression libraries may be added in the future.

Take a look at tests/test_compressed_funcs.py.

Serialization/deserialization

Serialization is fully integrated with the lazy streaming technology, and aside from APIs for performance tuning (e.g., the number of pthreads performing concurrent disk read/writes to maximize computation/communication overlap), it is again a fully automated feature. Thus, if there’s enough space in device (and host) memory, the streamed snapshots will never reach the disk(s).

Take a look at tests/test_composite_funcs.py.

CUDA, HIP, and SYCL backends

To enable them:

CUDA: DEVITO_PLATFORM=nvidiaX, DEVITO_LANGUAGE=cuda, DEVITO_ARCH=cuda
HIP: DEVITO_PLATFORM=amdgpuX, DEVITO_LANGUAGE=hip, DEVITO_ARCH=hip
SYCL: DEVITO_PLATFORM=intelgpuX, DEVITO_LANGUAGE=sycl, DEVITO_ARCH=sycl