The Decoupler
This tutorial provides an overview of the Decoupler, a tool designed to abstract away MPI complexities, and its implementation within Devito. We’ll explore how the Decoupler works, its optimizations, and its performance.
Motivation
Devito programs can be domain-decomposed. Domain decomposition is necessary when the user wants to:
- Achieve maximum performance on a multi-NUMA CPU-based node
- Distribute a shot over multiple GPUs
Domain decomposition in Devito is implemented with MPI. The same exact program can run without (default) or with domain decomposition via MPI:
python devito_program.pyDEVITO_MPI=diag2 mpirun -n 4 <mpi options> python devito_program.py
The Decoupler further abstracts the MPI layer:
DEVITO_DECOUPLER=1 python devito_program.py
The Decoupler makes integrating Devito with an overarching application straightforward, as data distribution and collection are handled automatically by the Decoupler runtime support.
How It Works
Consider the following code:
from devito import Grid, Function, Eq, Operator
# Operator construction
grid = Grid(shape=(10, 10, 10))
f = Function(name='f', grid=grid)
eq = Eq(f.forward, f + 1)
op = Operator(eq)
# Operator execution
grid1 = Grid(shape=(512, 512, 512))
f1 = Function(name='f', grid=grid1)
op.apply(f=f1)When run with DEVITO_DECOUPLER=1, op.apply will trigger the decoupling engine. The Decoupler serializes op and all its input objects (f1 in this case), spawns the so-called Fleet, and sends the serialized objects to the Fleet. The Fleet, a team of MPI processes implementing domain decomposition, performs op.apply using the deserialized objects. In this process, the data is domain-decomposed across the Fleet. Finally, the output is sent back to the Decoupler.
Efficiency and Overheads
Distributing data from a centralized point has an inherent cost, whether or not the Decoupler is used. The Decoupler also minimizes this overhead through carefully designed optimizations.
Pull & Push Protocols
Some memcpy operations are unavoidable. For example, the domain-decomposed data in the Fleet needs halo regions, which cannot be present in the user data for contiguity reasons. However, since shared memory segments are in place, each worker process in the Fleet can handle its own memcpy operations. The workers concurrently pull the data segments they logically own and concurrently push back the computed data into the shared memory segment.
Declaration of transient Functions
A Function is said to be transient if its data may be written to by an Operator, but such data is not needed by the user once back in the Python layer. The is_transient API can be used to declare such functions and avoid unnecessary memory copies.
f = Function(name='f', grid=grid, is_transient=True)Persistency
The Fleet is created upon the first op.apply and remains in the background until the Decoupler terminates. It is reawakened for subsequent op.apply calls.
Automated Garbage Collection
The Fleet uses a mapper of “seen” objects and automatically manages garbage collection based on object scope in the Decoupler.
Performance Testing
In a TTI test on a server, the performance with and without the Decoupler showed virtually zero overhead:
- OMP_NUM_THREADS=20 DEVITO_MPI=diag2 mpirun -n 2 –bind-to numa –map-by numa python …
- DEVITO_DECOUPLER=1 python …
Both configurations achieved 1.35 GPts/s.
API
Environment variables
DEVITO_DECOUPLER=1: Enables the Decoupler.DEVITO_DECOUPLER_WORKERS=N: UseNworkers. By default, Devito setsNto the number of underlying NUMA domains or GPUs, depending on the target backend. This is akin tompirun -n N.DEVITO_DECOUPLER_MAP_BY=numaControls how MPI ranks are mapped (examples values include: numa, core, hwthread). See your MPI distribution’s documentation for more details (eg: OpenMPI or MPICH).DEVITO_DECOUPLER_BIND_TO=numaControls the entitiy to which MPI ranks are bound (examples values include: numa, core, hwthread). See your MPI distribution’s documentation for more details (eg: OpenMPI or MPICH).DEVITO_TOPOLOGY=<value>: This works the same way it works for MPI in OSS Devito. See the FAQ at this link.
Keywords
is_transient={True,False}: Used to declare transient Functions, as explained above. Defaults to False.
Implementation Details
The Decoupler and Fleet are implemented using custom extensions of mpi4py and Python’s shared_memory module.
Current Status
The Decoupler is currently in beta testing. However, it already supports all core OSS and PRO functionalities, including SubDomains, Builtins (e.g., norms), and compression.
Known issues
Nvidia’s and Intel’s MPI implementations require the user to launch the process running Devito (which doesn’t have to be a Python program; it could be C++, Julia, or any other language) with DEVITO_DECOUPLER=1 mpirun -n 1 .... The mpirun -n 1 is necessary because these implementations perform various configurations with mpirun, including launching daemon processes and setting up environment variables. Without mpirun -n 1, the program may hang or crash.
Conclusion
The Decoupler offers an efficient way to manage domain-decomposition complexities in Devito programs, with minimal overhead due to its optimizations. Further testing and improvements are planned to enhance its functionality and compatibility.