Frequently Asked Questions

API

I use a few subdomains to define BCs vs interior domain, so from what I am reading to use the ABox I should intersect each of the subdomains with the ABox and then pass that as a subdomain to the relevant equation. Is this correct?
If the coordinates or sources/receivers or the vp change between forward runs, does the operator need to be re-compiled for the ABox to take these changes into account?
Can you tell me more about compression?
Can you tell me more about serialization?

Usage

How to get started?

Performance

The code runs very slow, what should I do?
How do I enable DevitoPRO for an Operator to achieve maximum performance?

GPU

How do environment variables like CUDA_VISIBLE_DEVICES interact with Operators?
I’ve set CUDA_VISIBLE_DEVICES, but the Operator is running on the wrong device

Domain decomposition

How should I go about parallelizing my code?
How does domain decomposition work?
How do I know if I need domain decomposition for my FWI
Do I need to use mpirun for domain decomposition?
The decoupler is for single-node, but I need multi-node, what should I do?

Errors

Answers

I use a few subdomains to define BCs vs interior domain, so from what I am reading to use the ABox I should intersect each of the subdomains with the ABox and then pass that as a subdomain to the relevant equation. Is this correct?

That’s right. You should look at test_abox.py::TestCompositeSubdomains, where there’s a test that illustrates how you can achieve that with DevitoPRO, e.g. via

abox = abox0.intersection(middle)

top

If the coordinates or sources/receivers or the vp change between forward runs, does the operator need to be re-compiled for the ABox to take these changes into account?

That happens automatically through the underlying op.apply(..., src=new_src) machinery . If you provide different src/rcv/vp, yes, definitely… but if you only change the coordinates, no, that won’t work.

top

Can you tell me more about compression?

Compression is a process used in computing to reduce the size of data files without significantly affecting their quality or usability. This can be beneficial in a variety of situations, such as when saving storage space or transmitting data over a network. In the context of DevitoPRO, we use compression to efficiently manage TimeFunctions.

One of the main advantages of compression in DevitoPRO is the simplicity of its API, which makes it straightforward to integrate it into your application code. The API consists of only two mechanisms:

A mechanism to declare what TimeFunctions should be compressed/decompressed and with which algorithm.

usave = TimeFunction(name='usave', ..., save=nt, compression='bitcomp')

Today we support one compression library, Bitcomp, developed by NVidia. It supports both CPU (Intel, Arm) and GPU (NVidia) compression. By default, DevitoPRO will perform lossy-compression for single- and double-precision floats employing an adaptive delta for integer quantization, based on the maximum amplitude value at compression time.

The compression/decompression rules are as follows:

usave gets compressed once a snapshot has been computed (e.g., in a “forward Operator”);
usave gets decompressed right before it is read (e.g., in an “adjoint Operator”).

A mechanism to customize the number of bits used for integer quantization by supplying the special keyword nbits, for example:

op = Operator(...)
op.apply(..., nbits=16)

For further understanding of how compression is used in DevitoPRO, we recommend looking at the test: tests/test_compressed_funcs.py::TestBasic::test_bitcomp_lossy. This test showcases the key aspects of our compression approach.

Please note that our documentation will be updated in the future with more comprehensive information. Until then, the available tests serve as a good resource for understanding compression in DevitoPRO.

Can you tell me more about serialization?

As of today, the documentation is represented by the available tests. This will change in the future. For now, we suggest to look at the following batch of tests:

Serialisation “in isolation”, that is without compression: tests/test_layered_funcs.py::TestSerialization
Serialisation + data streaming, compression, etc: tests/test_composite_funcs.py::TestCompressedLayered

These tests illustrate all key points concerning serialisation. In particular, there are tests showing how to customise serialization, such as the directory in which the snapshots get stored and subsequently retrieved (e.g., test_customization).

Serialization is not currently available in the demos, simply because it had not been implemented yet when the demos infrastructure was written. A temporary hack to enable serialisation in the demos would consist of removing the layers=... kwarg from the declaration of the wavefields in the various operators.py files, e.g. p0_save = TimeFunction(..., layers=...) in demos/tti_acoustic_self_adjoint/operators.py.

In most cases, the layers kwarg isn’t necessary, and typically the user shouldn’t care about it. One should specify layers only if finer grain control on data storage is desired. For example, to bypass lazy streaming, to systematically push all snapshots to disk, etc. This might come in handy for testing reasons or because of the constraints imposed by the overarching code in which DevitoPRO is used. The tests pointed out above use layers to incrementally check all relevant scenarios:

serialisation in isolation;
serialisation with lazy data streaming;
serialisation when all snapshots get dumped to disk (layers=Disk);
serialisation with compression;
serialisation with compression and data streaming;
etc

top

How to get started?

To run DevitoPRO on NVidia GPUs with the OpenACC backend, you’re just two lines away with Docker.

Build the Docker image:

docker build --file docker/Dockerfile.devitopro --tag devitopro-nvidia-nvc-acc --build-arg base=devitocodes/bases:nvidia-nvc .

Get a bash terminal in a container spawn from such image:

docker run -it --gpus all --rm devitopro-nvidia-nvc-acc:latest

You’re all set!

For more info about Docker images and devito, take a look here.

top

The code runs very slow, what should I do

If you’re experiencing slow performance with your code, it’s important to diagnose the issue with detailed information. It’s also essential to consider whether the code is truly performing at an objectively slow rate, or if it simply appears slower in comparison to other codes you have used.

Here are some steps to follow:

Determine Performance Metrics: Start by understanding the performance metrics, specifically the GPts/s (Giga-points per second) performance. This information can provide valuable insights into where the bottleneck might be occurring. Refer to our documentation on operator performance: Performance of an Operator.
Identify the Hardware Specifications: Knowing the exact architecture you’re using is crucial
Benchmark Your Code Properly: Proper benchmarking is essential for accurate performance assessment. Ensure that your machine is configured correctly for reliable benchmarking. Detailed instructions can be found here: Configuring Your Machine for Reliable Benchmarking.
Tune for peak performance: It’s vital to tune your code and ensure all of the available optimizations are enabled. Please refer to this other FAQ for more details.
Share the generated code: The DevitoPRO team can help in identifying bottlenecks and possible causes.

How do I enable DevitoPRO for an Operator to achieve maximum performance?

Note: In the links below, replace devitopro-<repo_name> with the actual name of your DevitoPRO repository (e.g., devitopro-xyz or devitopro-trial-xyz).

To enable DevitoPRO in your workflow, you should first import it alongside your standard Devito imports:

from devito import ...
from devitopro import *

Once DevitoPRO is enabled, there are several steps you can follow to achieve the best possible performance. We recommend proceeding in the order below:

Run DevitoTuner
Start by running your Operator through DevitoTuner. This process is straightforward and requires only a single command-line call. Detailed instructions are available here:
DevitoTuner Guide
Enable advanced optimizations
Take advantage of higher-order optimizations such as Expanding Box (ABox) and mixed precision. These can significantly improve performance. Note: It is recommended to use a realistic model before experimenting with these optimizations. Note: the recipes allow the user to quickly experiment with these options, see below for more details.
Consult the Performance Handbook
Additional simulation-specific recommendations are documented here:
Performance Handbook
Note: the recipes already integrate many of these optimizations.
Multi-GPU and multi-NUMA support
If you are targeting single-node multi-GPU or single-node multi-NUMA CPUs, see:
Multi-device Tutorial
To check whether your platform has multiple NUMA domains, you can run:
```
lscpu
```
Explore additional tutorials
For a deeper understanding of DevitoPRO features such as compression and streaming, explore the tutorial collection here:
DevitoPRO Tutorials

A note on `recipes`

If your DevitoPRO installation includes the recipes package: - You still need to perform the DevitoTuner step described above. - Many advanced features (e.g., ABox, mixed precision, gradient compression/serialization) can be enabled directly from the command line.
We strongly recommend familiarizing yourself with: bash python run.py --help This command provides a list of available options and quick access to the most important DevitoPRO features.

top

How do environment variables like `CUDA_VISIBLE_DEVICES` interact with Operators?

There are various environment variables which can be set to control the GPUs visible to a system - these include CUDA_VISIBLE_DEVICES with Nvidia GPUs and HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES for AMD. These variables perform a logical renumbering of GPUs for any program running inside an environment with these variables set. For example, if CUDA_VISIBLE_DEVICES=1,3 is set, then from the perspective of a program running in this environment, there will be two GPUs available, numbered 0 and 1, corresponding to physical GPUs 1 and 3 respectively.

Thus, one must be mindful of the interactions between CUDA_VISIBLE_DEVICES and similar, and other options such as the deviceid parameter which can be passed to Operator.apply() (or alternatively set in DEVITO_DEVICEID). As the operator sees the world from within the environment with logically-reordered GPUs, the device specified corresponds to that within the list of visible devices, rather than over the whole node. As an example, setting CUDA_VISIBLE_DEVICES=1,3 and passing deviceid=1 to Operator.apply() (or setting DEVITO_DEVICEID=1) will result in the operator being run on physical device 3, as this is at logical index 1 within the GPUs exposed to the operator in the environment.

Note also a similar interaction when running with MPI or the decoupler. If deviceid is not manually specified on a per-rank basis, each rank is assigned sequentially to a visible GPU. In the case of CUDA_VISIBLE_DEVICES=1,3 and two MPI ranks, rank 0 will be assigned to physical device 1 (logical device 0), whilst rank 1 will be assigned to physical device 3 (logical device 1).

By default, the first visible device will be used if no deviceid is specified.

top

I’ve set `CUDA_VISIBLE_DEVICES`, but the Operator is running on the wrong device

CUDA_VISIBLE_DEVICES and similar variables are checked at initialization, which occurs when Devito/DevitoPRO is first imported. It is the value of this variable at the point of initialization that determines the device on which the Operator will run. This environment variable should always be set prior to importing Devito/DevitoPRO. Note that setting or changing this variable after initialization will not change the device(s) on which the Operator runs, but may result in unexpected behaviour.

top

How should I go about parallelizing my code?

It is important to distinguish between parallelization across Non-Uniform Memory Access (NUMA) regions and parallelization within them to obtain performant code. A NUMA region consists of a processor and its local memory. This local memory is in contrast to remote memory, which is not directly connected to the processor of concern, and must be accessed via an interconnect with another processor. In most architectures, this NUMA region corresponds to a single socket (although some CPU architectures have multiple NUMA regions per socket) or a single GPU.

As local memory can be accessed much more quickly than remote memory, it is advisable to use shared-memory parallelism within a NUMA region, and distributed-memory parallelism when performing a single computation across NUMA regions. For computations which are too large to fit within a single NUMA region, this results in nested levels of parallelism - an inner shared-memory parallelism within the NUMA region, and an outer distributed-memory parallelism across NUMA regions.

In Devito, shared-memory parallelism is achieved through the use of parallel languages such as OpenMP for CPUs or CUDA/HIP/SYCL for GPUs. Distributed memory parallelism should be handled using MPI or the decoupler (which itself uses MPI under the hood).

Note that whilst distributed memory parallelism with OpenMP is possible, MPI + OpenMP is almost always the more performant approach. Socket and thread pinning is is also important to ensure correct distribution of threads and processes.

For many applications, there is an additional layer of task-parallelism in which a large number of embarassingly parallel tasks must be performed. Whilst one can use MPI for this (and even combine such approaches with domain decomposition via subcommunicators), for fault tolerance it is often better to use a task parallelism framework such as Dask for task orchestration. If tasks are sufficiently small to fit within a single NUMA region, then it is often most efficient to simply farm a task out to each processor as they become available, rather than splitting each task across processors.

top

How does domain decomposition work?

Domain decomposition refers to splitting a single shot across multiple computational resources. The typical configurations are:

For GPUs: - Single-node, multi-GPU - Multi-node, multi-GPU

For CPUs: - Single-node, multi-socket - Multi-node, multi-socket - It’s also possible to decompose across all cores within a single node, but in our experience, this rarely yields significant performance gains.

All of these configurations are supported in DevitoPRO.

The choice of configuration depends on both the problem being solved and the available hardware. Currently, we do not offer a utility that automatically suggests the optimal launch configuration for a given Operator and hardware target—but this may be added in the future.

Devito and DevitoPRO use MPI to implement domain decomposition. As explained in the documentation, great care has been taken to ensure that Devito programs can run with or without domain decomposition with little to no code changes. This is arguably one of the most user-friendly features of the system. In many cases, enabling domain decomposition is as simple as running your Python program with:

DEVITO_MPI=1 mpirun -n 4 <mpi options if any> python ...

Let’s distinguish two common scenarios:

You don’t need multi-node execution
This is often sufficient—even for high-frequency elastic FWI.
- For example, if a single multi-GPU node is enough to handle one shot, domain decomposition across the GPUs of that node is trivially simple. Thanks to the decoupler—an abstraction layer on top of MPI introduced in 2024—no additional work is required on your part (links to documentation below).
- In practice, enabling domain decomposition in this case is as simple as running:
```
DEVITO_DECOUPLER=1 python ...
```
You do need multi-node execution
This applies to both CPU and GPU targets.
- In this case, some additional responsibilities fall on you, the application developer. You are expected to handle:
  - Parallel I/O (e.g., for loading model data),
  - Collective operations (e.g., gathering gradient contributions on a single node).
  - Etc.

Resources: - 📘 MPI Overview Notebook
Explains the role of MPI in Devito, its API, and key implementation details. - ❓ FAQ: Controlling MPI Domain Decomposition - 🔗 The Decoupler
(Please replace /devitopro/ with devitopro-<clientname> as appropriate)

top

How do I know if I need domain decomposition for my FWI

Whether or not you need domain decomposition for your FWI/RTM/… problem – or better, to compute one shot in parallel across multiple GPUs or CPUs – is determined by the size of the problem and the available memory on your target architecture. Given M as the amount of DRAM available on the target architecture (e.g., DRAM on a single GPU, size of a NUMA domain on a CPU, etc), N as the number of fields in your PDEs (model parameters, wavefields), and S as the problem size (number of grid points), you can use a simple formula to determine if domain decomposition is necessary:

S * N * 4 / M

This formula is a reasonable approximation, assuming single precision calculations (hence the 4, which is sizeof(float)). We are assessing the possibility of adding a utility that automatically suggests the optimal launch configuration for a given Operator and hardware target, but this is not available yet.

top

Do I need to use mpirun for domain decomposition?

As of today, yes, you do need to use mpirun to run your code with domain decomposition. This is because DevitoPRO uses MPI to implement domain decomposition, and mpirun is the standard way to launch MPI applications. Even if you are using the Decoupler (see devitopro/demos/tutorials/decoupler.md), you at least need to launch your application with mpirun -n 1, then the other processes will be spawned automatically by the Decoupler. The reason for this is that most MPI implementations require mpirun to launch the initial process in order to set up the MPI environment correctly.

top

The decoupler is for single-node, but I need multi-node, what should I do?

As of today (June 2025), the Decoupler is designed for single-node execution, but we are considering extending it to support multi-node execution in the near future. Regardless, for multi-node execution there are things that will require the overarching application to help out, such as:

Parallel I/O to, e.g., fill in the model parameters (e.g., vp, rho, etc) on each node; this depends, among other things, on the data format used to store the model parameters. We might add support for some relevant formats in the future.
Collective operations to, e.g., gather the gradient contributions from each node. Whether or not support for this will be added in the future via an enhanced Decoupler is still to be determined.

That being said, multi-node support across the Devito/DevitoPRO ecosystem is extremely well supported thanks to its robustness, flexibility, and efficiency – even at extreme scales. This was documented in this paper, which summarizes 7 years of domain decomposition development in open-source Devito. Note that all the performance results in the paper are based on Devito, not DevitoPRO, which guarantees higher performance (especially on GPUs) thanks to more optimizations; however, the underlying principles are the same.

top

When I run DevitoPRO I get the error “Cannot compile an Operator for (TargetPlatform[nvidiaX], ‘advanced’, ‘cuda’)”

For completeness: the error message you see stems from an InvalidOperator exception, whish is raised by Devito (not DevitoPRO) when it is requested to compile an Operator for an unknown target language or architecture. In this case, you’re likely running a “pure Devito” example, that is one that does not import DevitoPRO in Python, so the cuda language (likely set via DEVITO_LANGUAGE=cuda, which is exactly what happens when running from the CUDA Dockerfile) isn’t recognized. The good news is that you don’t need to edit the source code to make your example run with DevitoPRO. We implemented preloading support in DevitoPRO so that you can run all of your “pure” Devito examples on top of DevitoPRO straightforwardly via

python -m devitopro myexample.py

The -m devitopro, in essence, preloads devito and devitopro, back to back, before jumping to myexample.py.

top

Why do I get the error “Cannot use autopadding when overriding Functions with ndarrays”

This error arises because DevitoPRO employs so called “autopadding”, which isn’t compatible with NumPy-array-based overriding of Functions/TimeFunctions at Operator.apply time.

Autopadding adjusts the amount of padding based on the size of the Function domain and its halo, offering a trade-off between a slight increase in memory usage and improved performance. So, for example, if a Function is defined over a Grid(shape=(512, 512, 512)), it might rather allocate the equivalent of (512, 512, 520) points. This happens under-the-hood – the user never sees these extra point:

.shape still returns (512, 512, 512);
.data provides a (512, 512, 512) view of the array;

Thus in practical terms the user doesn’t worry about any of this.

However, an issue occurs when a Function/TimeFunction is overridden with a numpy array at Operator.apply time, since this (legacy) action is incompatible with autopadding and results in the error you’re encountering. For example, the following code triggers this error:

dm = np.float32(initial_vp.data**(-2) - vp.data**(-2))
op.apply(..., dm=dm, ...)

To resolve this, you have two main options:

Disable Autopadding: You can turn off autopadding either through an environment variable or programmatically. This approach is also recommended if you notice performance degradation with autopadding or wish to minimize memory usage for padding. To disable autopadding:
- Using an environment variable:
```
DEVITO_AUTOPADDING=0
```
- Programmatically:
```
configuration['autopadding'] = False
```
Note: If you do experience performance issues with autopadding, please reach out for support.
Use a Function Directly: Instead of using the legacy numpy array overriding API, it’s advisable to utilize a Function directly. This method maintains compatibility with autopadding. For instance, you can rewrite the previous code snippet as follows:
```
dm = Function(name='dm', grid=vp.grid)
dm.data[:] = np.float32(initial_vp.data**(-2) - vp.data**(-2))
op.apply(..., dm=dm, ...)
```

top