The performance optimization handbook
Below is an incomplete list of factors to consider when aiming to maximize performance.
We identify three classes of performance “knobs”: DevitoPRO, Devito, independent of Devito.
DevitoPRO
- ABox API – always plug an adapting box.
- Compression – unless you’re running on CPU and believe there’s enough DRAM for the wavefields.
DEVITO_LANGUAGE={cuda,hip}– CUDA/HIP generate faster code than OpenACC/OpenMP.- Novel compiler pipeline – CUDA/HIP only. Currently in beta. Switch it on exporting
DEVITO_BETA=1.- On top of it, switch on the vector types:
DEVITO_BETA=2.
- On top of it, switch on the vector types:
- Tune the block shape – via the
par-tileoption (keep reading for more info). DEVITO_MPI=diag2– more efficient MPI communication scheme.- Reduced precision – whenever possible, for instance
float8to store the material properties.
Devito
- Buffer API – If possible, use the
save=Buffer(...)API to minimize the working set size (seedemos/). - Tune the block shape
- Using the built-in autotuner (
op.apply(..., autotune='aggressive')); - Using DevitoTuner (keep reading for more info);
- Manually, supplying block size overrides to
op.apply.
- Using the built-in autotuner (
- Minimize the halo size
- The statement
f = Function(..., space_order=8, ...)will create aFunctionwith 8 points along each space dimension side. - However, not all of them might be necessary. This is problem-specific!
- Use
space_order=(8, 4, 4)to enforce a custom number of points – in this case 4 – along each space dimension side.
- The statement
Mathematics
- Optimize the space order – Can you afford lower spatial order in some problem dimensions?
- Variable grid spacing
- Different physics in different subdomains
Domain decomposition
- MPI
- Process binding, typically one rank per NUMA domain.
- On multi-NUMA platforms, use MPI + OpenMP
- Customize the domain-decomposition (e.g., decompose along the y-axis only).
- Use the Grid topology API.
- Decoupler
- Hides MPI away from the user, but it still uses MPI under the hood.
The DevitoTuner package
Say you have a program with several Operators, and you want to tune one of them to achieve the best performance on your target architecture. You can use DevitoTuner to do that.
For example, if you want to tune Operator(name='MyTTI', ...), you can launch your program mycode.py as:
python -m devitotuner MyTTI mytti_tuning.json mycode.py <options>This will take some time to complete. Once terminated, mytti_tuning.json will contain the necessary information to achieve the maximum performance. More details available in devitopro/devitotuner/README.md
If DevitoTuner feels too cumbersome for you…
Optimize CUDA/HIP – Do the following in the given order:
- Tune the
par-tile(see theperformancenotebook); - Try
DEVITO_BETA >= 1to enable 2.5-blocking; - Tune the
par-tileagain; - Using the
cire-scheduleoption, try several code variants (one-pass vs two-pass).