The Decoupler

This tutorial provides an overview of the Decoupler, a tool designed to abstract away MPI complexities, and its implementation within Devito. We’ll explore how the Decoupler works, its optimizations, and its performance.

Motivation

Devito programs can be domain-decomposed. Domain decomposition is necessary when the user wants to:

Achieve maximum performance on a multi-NUMA CPU-based node
Distribute a shot over multiple GPUs

Domain decomposition in Devito is implemented with MPI. The same exact program can run without (default) or with domain decomposition via MPI:

python devito_program.py
DEVITO_MPI=diag2 mpirun -n 4 <mpi options> python devito_program.py

The Decoupler further abstracts the MPI layer:

DEVITO_DECOUPLER=1 python devito_program.py

The Decoupler makes integrating Devito with an overarching application straightforward, as data distribution and collection are handled automatically by the Decoupler runtime support.

How It Works

Consider the following code:

from devito import Grid, Function, Eq, Operator

# Operator construction
grid = Grid(shape=(10, 10, 10))
f = Function(name='f', grid=grid)
eq = Eq(f.forward, f + 1)
op = Operator(eq)

# Operator execution
grid1 = Grid(shape=(512, 512, 512))
f1 = Function(name='f', grid=grid1)
op.apply(f=f1)

When run with DEVITO_DECOUPLER=1, op.apply will trigger the decoupling engine. The Decoupler serializes op and all its input objects (f1 in this case), spawns the so-called Fleet, and sends the serialized objects to the Fleet. The Fleet, a team of MPI processes implementing domain decomposition, performs op.apply using the deserialized objects. In this process, the data is domain-decomposed across the Fleet. Finally, the output is sent back to the Decoupler.

Efficiency and Overheads

Distributing data from a centralized point has an inherent cost, whether or not the Decoupler is used. The Decoupler also minimizes this overhead through carefully designed optimizations.

Shared Memory Segments

Shared memory segments are used to minimize data movement overhead. When the Decoupler is enabled, all Function objects use a special allocator that places data in shared memory segments, avoiding unnecessary memcpy operations.

Pull & Push Protocols

Some memcpy operations are unavoidable. For example, the domain-decomposed data in the Fleet needs halo regions, which cannot be present in the user data for contiguity reasons. However, since shared memory segments are in place, each worker process in the Fleet can handle its own memcpy operations. The workers concurrently pull the data segments they logically own and concurrently push back the computed data into the shared memory segment.

Declaration of transient Functions

A Function is said to be transient if its data may be written to by an Operator, but such data is not needed by the user once back in the Python layer. The is_transient API can be used to declare such functions and avoid unnecessary memory copies.

f = Function(name='f', grid=grid, is_transient=True)

Persistency

The Fleet is created upon the first op.apply and remains in the background until the Decoupler terminates. It is reawakened for subsequent op.apply calls.

Automated Garbage Collection

The Fleet uses a mapper of “seen” objects and automatically manages garbage collection based on object scope in the Decoupler.

Performance Testing

In a TTI test on a server, the performance with and without the Decoupler showed virtually zero overhead:

OMP_NUM_THREADS=20 DEVITO_MPI=diag2 mpirun -n 2 –bind-to numa –map-by numa python …
DEVITO_DECOUPLER=1 python …

Both configurations achieved 1.35 GPts/s.

API

Environment variables

DEVITO_DECOUPLER=1: Enables the Decoupler.
DEVITO_DECOUPLER_WORKERS=N: Use N workers. By default, Devito sets N to the number of underlying NUMA domains or GPUs, depending on the target backend. This is akin to mpirun -n N.
DEVITO_DECOUPLER_MAP_BY=numa Controls how MPI ranks are mapped (examples values include: numa, core, hwthread). See your MPI distribution’s documentation for more details (eg: OpenMPI or MPICH).
DEVITO_DECOUPLER_BIND_TO=numa Controls the entitiy to which MPI ranks are bound (examples values include: numa, core, hwthread). See your MPI distribution’s documentation for more details (eg: OpenMPI or MPICH).
DEVITO_TOPOLOGY=<value>: This works the same way it works for MPI in OSS Devito. See the FAQ at this link.

Keywords

is_transient={True,False}: Used to declare transient Functions, as explained above. Defaults to False.

Implementation Details

The Decoupler and Fleet are implemented using custom extensions of mpi4py and Python’s shared_memory module.

Current Status

The Decoupler is currently in beta testing. However, it already supports all core OSS and PRO functionalities, including SubDomains, Builtins (e.g., norms), and compression.

Known issues

Nvidia’s and Intel’s MPI implementations require the user to launch the process running Devito (which doesn’t have to be a Python program; it could be C++, Julia, or any other language) with DEVITO_DECOUPLER=1 mpirun -n 1 .... The mpirun -n 1 is necessary because these implementations perform various configurations with mpirun, including launching daemon processes and setting up environment variables. Without mpirun -n 1, the program may hang or crash.

Conclusion

The Decoupler offers an efficient way to manage domain-decomposition complexities in Devito programs, with minimal overhead due to its optimizations. Further testing and improvements are planned to enhance its functionality and compatibility.