The performance optimization handbook

Below is an incomplete list of factors to consider when aiming to maximize performance.

We identify three classes of performance “knobs”: DevitoPRO, Devito, independent of Devito.

DevitoPRO

ABox API – always plug an adapting box.
Compression – unless you’re running on CPU and believe there’s enough DRAM for the wavefields.
DEVITO_LANGUAGE={cuda,hip} – CUDA/HIP generate faster code than OpenACC/OpenMP.
Novel compiler pipeline – CUDA/HIP only. Currently in beta. Switch it on exporting DEVITO_BETA=1.
- On top of it, switch on the vector types: DEVITO_BETA=2.
Tune the block shape – via the par-tile option (keep reading for more info).
DEVITO_MPI=diag2 – more efficient MPI communication scheme.
Reduced precision – whenever possible, for instance float8 to store the material properties.

Devito

Buffer API – If possible, use the save=Buffer(...) API to minimize the working set size (see demos/).
Tune the block shape
- Using the built-in autotuner (op.apply(..., autotune='aggressive'));
- Using DevitoTuner (keep reading for more info);
- Manually, supplying block size overrides to op.apply.
Minimize the halo size
- The statement f = Function(..., space_order=8, ...) will create a Function with 8 points along each space dimension side.
- However, not all of them might be necessary. This is problem-specific!
- Use space_order=(8, 4, 4) to enforce a custom number of points – in this case 4 – along each space dimension side.

Mathematics

Optimize the space order – Can you afford lower spatial order in some problem dimensions?
Variable grid spacing
Different physics in different subdomains

Domain decomposition

MPI
- Process binding, typically one rank per NUMA domain.
- On multi-NUMA platforms, use MPI + OpenMP
- Customize the domain-decomposition (e.g., decompose along the y-axis only).
  - Use the Grid topology API.
Decoupler
- Hides MPI away from the user, but it still uses MPI under the hood.

The DevitoTuner package

Say you have a program with several Operators, and you want to tune one of them to achieve the best performance on your target architecture. You can use DevitoTuner to do that.

For example, if you want to tune Operator(name='MyTTI', ...), you can launch your program mycode.py as:

python -m devitotuner MyTTI mytti_tuning.json mycode.py <options>

This will take some time to complete. Once terminated, mytti_tuning.json will contain the necessary information to achieve the maximum performance. More details available in devitopro/devitotuner/README.md

If DevitoTuner feels too cumbersome for you…

Optimize CUDA/HIP – Do the following in the given order:

Tune the par-tile (see the performance notebook);
Try DEVITO_BETA >= 1 to enable 2.5-blocking;
Tune the par-tile again;
Using the cire-schedule option, try several code variants (one-pass vs two-pass).