readme

DevitoTuner

DevitoTuner is a tool for automatically tuning Operators to achieve optimal performance on a given hardware. There are multiple things that one should look for when it gets to maximizing the application performance, many of which summarized in demos/tutorials/performance_handbook.ipynb, but it is strongly recommended to give DevitoTuner a go as it can straightforwardly find an optimal configuration for your Operators without you having to worry about the details of the hardware and all the possible optimization options (“opt-options”) available in DevitoPRO.

DevitoTuner was devised with the following goals in mind:

Exceptionally easy to use, no need to know anything about the hardware or the opt-options available in DevitoPRO.
It shouldn’t take more than a few hours, typically 1-2 hours, to find the optimal configuration for a given Operator.
Run once, get the result you want, and forget about DevitoTuner until any of the following happens:
- You change the hardware.
- You change the Operator.
- You update from an old version of DevitoPRO.

DevitoTuner can be used with any Operator – it is conceived to work seamlessly with compression, serialization, domain decomposition (MPI or Decoupler), and so on. It has reached an acceptable level of maturity, but it is still under development and subject of frequent updates, so please report any issues you may find.

How to use DevitoTuner

Say you have an application with multiple Operators, which you normally launch like this:

python mycode.py <options>

In particular, at some point, this application constructs and runs: Operator(name='MyTTI', ...). Then, if you want to let DevitoTuner tune this Operator (and this Operator only), you should instead execute the following command:

python -m devitotuner MyTTI mytti_tuning.json mycode.py <options>

where mytti_tuning.json is the name of the output file where DevitoTuner will store the tuning results. The output file will be created if it does not exist, or be added a timestamp if it already exists. Further, the file mytile_tuning.err will also be created to store any additional information, warnings, or errors that may occur during the tuning process.

After a successful tuning

The produced mytti_tuning.json is a JSON file that contains the performance achieved by each variant of the designated MyTTI Operator that DevitoTuner has tried, as well as some metadata about the tuning process itself. Let’s see how to interpret this file.

    "target": "DeviceCudaTarget",
    "generated": {
        "0": "e64eeaf36e5b1d0963faa0ed2a66c1883fc09462",
        "1": "36ed39ed099b8a2c0ef6b97cf346dbe9bba3f482",
        "2": "769b60dca1dd66c832897d8daf25e66b290cc117",
        "3": "9ff4388a5915033fec6516a4e35bceaaa2bbb302",
        "4": "c3bf11b4570d960b9d03825eae71a960674af871",
        "5": "289a8f82e0275ee4dc6b8659d3b383431a4c05e0",
        "6": "e6ec4fba00a3d677ba9f7c1752c4ea79277e4f34"
    },
    "('advanced', {'errctl': 'basic', 'gpu-opt': True, 'gpu-vect': 'minimal', 'gpu-prefetch': [False, False], 'par-tile': [(32, 4), (16, 16)]})": {
        "probe": {
            "Runtime": 0.128,
            "GFlops/s": 2023.5201741813714,
            "GPts/s": 5.25589655631525,
            "Breakdown": [
                0.036876,
                0.090807
            ]
        }
    },
    <...>

We identify the following key elements in this JSON file:

target: the tuning strategy used, whose name typically encodes the language of the generated code (CUDA, in this case). We’ll see more about this later.
generated: each entry represents the filename of one variant of MyTTI that DevitoTuner has generated.
Finally, a sequence of entries that represent the performance achieved by each variant.
- The key is a tuple that contains the opt-options used to generate the variant,
- The value is a dictionary with the performance metrics achieved by this variant.

In this case, we see that the variant with opt-options

{'errctl': 'basic', 'gpu-opt': True, 'gpu-vect': 'minimal', 'gpu-prefetch': [False, False], 'par-tile': [(32, 4), (16, 16)]}

achieved a runtime of 0.128 seconds, 2023.52 GFlops/s, and 5.26 GPts/s. The Breakdown field can contain a varying number of entries, depending on the Operator itself; each entry represents a part of the Runtime, which typically corresponds to one of the tuned kernels (and, more rarely, to a subset of kernels).

More complex Operators may require testing hundreds of variants, so expect this file to be quite large. The ordering of the entries corresponds to the order in which the variants were tested, so the first entry is the first variant that was tested, and so on. DevitoTuner uses heuristics and the results of prior experiments to steer the search towards the most promising opt-options, so as one scrolls down the file, one should expect to see better and better performance metrics. The best variant – the one with the highest GPts/s – usually appears towards the end of the file.

We provide a utility function to easily fetch the optimal variant from the tuning results. This function is called retrieve_best, and it can be used as follows:

from devitotuner.wrapper import retrieve_best

best = retrieve_best('mytti_tuning.json')

print(best)

Which will print something along the lines of:

BestOpt(mode='advanced', optoptions={'gpu-opt': True, 'gpu-prefetch': [False, False], 'par-tile': [(32, 4, 8), (32, 4, 8)], 'deriv-unroll': 'inner', 'fact-schedule': 'basic', 'gpu-hoist': 2}, bs={}, defaultrun_speedup=1.8214425825253604)

Finally, all it takes to use the optimal variant in your code is to set the Operator’s opt_options to the one returned by retrieve_best:

op = Operator(name='MyTTI', opt=(best.mode, best.optoptions), ...)

Environment Variables

DEVITO_TUNER_TARGET: This environment variable can be set to specify the tuning strategy to use, but it can be ignored in most cases, since the default strategy is usually the best one. However, in addition to the default strategy, the following targets are also supported:
- DeviceCudaLegacyTarget: This is only available if the CUDA backend is selected. This target mimics the original DevitoTuner behaviour. It explores a large space of options, disjoint from the ones explored by the default target. Experiments indicate that it rarely finds better configurations than the default target, despite taking much longer to complete.
DEVITO_TUNER_CIREMODE: This environment variable can be set to specify the CIRE mode to use during tuning, which determines the number of kernels (or “passes”) that will be generated. Multi-pass implementations might yield better performance when the computational demands of the 1-pass implementation exceed the hardware capabilities (e.g., register or shared memory pressure causing low occupancy on GPUs). Supported values are:
- 1pass
- 2pass
- automatic: (default – the compiler decides the best mode to use based on the Operator characteristics)
- "('1pass', '2pass', '2pass', 'automatic', ...)": One specific mode for each kernel in the 1pass synthesis.
DEVITO_TUNER_EXTRA_OPT: A string representing a dictionary of additional opt-options to be applied to all variants during tuning. For example, setting this variable to "{'fission': 'aggressive:12'}" will ensure that all variants tested during tuning will have the fission option enabled with the given value.
DEVITO_TUNER_OPTIONS A string representing a dictionary of additional tuner options.
- timesteps_per_run Controls the number of timesteps executed by each experiment performed by DevitoTuner (defaults to 20). Replaces the legacy DEVITO_TUNER_TIMESTEPS environment variable
- blk_size_min The smallest cache block size that the tuner tests (defaults to 2).
- blk_size_max The largest cache block size that the tuner tests (defaults to 64).
- blk_size_step The step between consecutive cache block sizes (defaults to 2).
DEVITO_TUNER_VERBOSE: This environment variable can be set to 0 to disable verbose output during the tuning process. When set to 0, no .err file will be generated.

API

DevitoTuner also provides an API that can be used to programmatically tune Operators. The main function is tuner_enabled, which can be used as follows:

import tempfile
from devito import Grid, TimeFunction, Operator, Eq
from devitopro import *
from devitotuner import tuner_enabled

...

tmp = tempfile.NamedTemporaryFile(suffix='.json')

with tuner_enabled("DummyOp", tmp.name, target="DeviceCudaMaxTarget"):
    grid = Grid(shape=(40, 40, 40))

    u = TimeFunction(name='u', grid=grid)

    op = Operator(Eq(u.forward, u + 1.), name='DummyOp')

    op.apply(time_M=100)

content = eval(tmp.file.read())

# At this point, `content` is a dictionary identical to the one produced by the command line interface.