DevitoPRO Profiling

devitoprofiler runs an application with profiling enabled and folds measurements from different layers of the software stack back into Devito’s performance summary.

The entry point is:

python -m devitoprofiler devitopro/recipes/run.py ...

By default, the launcher enables every profiling layer available for the current platform and runtime. Operator-level collection can be narrowed with --operator or --op when the active layer supports it:

python -m devitoprofiler --op ForwardFletcherTTI devitopro/recipes/run.py ...

Use python -m devitoprofiler --help for the exact command line accepted by the current implementation.

Reading the Output

The output is Devito’s normal performance summary enriched by active profiling layers. A layer may add a compact table, an external report path, raw values on the returned PerformanceSummary, or higher-level application metadata.

When a hardware roofline layer is active, DevitoPRO prints a table like:


  section/kernel  time   | FP         DRAM BW | roofline efficiency
  -------------------------------------------------------------------
  section0        100%   | 3.43 TF/s  2 TB/s  | [███████░░░] 71% DRAM

Performance[mode=advanced] arguments: {'deviceid': -1, 'devicerm': 1}
  Disclaimer: beta summary; the `.ncu-rep` is the source of truth.

Columns:

  • time: local section share within the profiled run.
  • FP: achieved floating-point throughput.
  • DRAM BW: achieved DRAM bandwidth.
  • DRAM SOL: DRAM speed-of-light percentage when exposed by the active layer.
  • roofline efficiency: achieved FP throughput divided by the attainable roofline throughput at the measured operational intensity.

Hardware roofline values are also attached to the returned summary:

summary.hardware
summary.hardware_fp_tflopss
summary.hardware_attainable_peak_pct
summary.hardware_dram_tbytess
summary.hardware_dram_sol
summary.hardware_roofline
summary.hardware_sections
summary.hardware_section_fp_tflopss
summary.hardware_section_attainable_peak_pct
summary.hardware_section_dram_tbytess
summary.hardware_section_dram_sol
summary.hardware_section_roofline

Currently Supported Hardware Layers

NVIDIA CUDA / Nsight Compute

The current hardware layer collects Nsight Compute (ncu) data for CUDA Operators with time loops. Every eligible Operator is selected by default, but the launcher limits collection to the first launch of each matching generated kernel in one profiled process. --operator or --op narrows which Operator may provide those launches.

The launcher sets DEVITO_PROFILING=ncu, or DEVITO_PROFILING=ncu:OperatorName when narrowed to one Operator. You can also set that environment variable yourself and invoke Nsight Compute manually.

Limitations:

  • selected Operators must use the CUDA backend;
  • decoupler runs are not supported;
  • counter collection is expensive, so Nsight Compute collects the first launch of each matching generated kernel from the profiled process;
  • Devito section timers for repeated kernel-launch sections skip their first invocation during NCU runs, so replay overhead from the collected launch is not accumulated into the regular section timing;
  • the summary includes Devito’s cumulative no-setup GPts/s before the NCU table; rows without NCU hardware metrics are marked not collected;
  • wall-clock runtime throughput is omitted because Nsight Compute replay makes section time unrepresentative.

The log prints the generated report path:

Operator `ForwardFletcherTTI` Nsight Compute report available in `/app/devito-cache/devito-profiling-uid481129/<operator-soname>.ncu-rep`

By default, source import points Nsight Compute at Devito’s JIT cache under ${TMPDIR:-/tmp}/devito-jitcache-uid$(id -u). Override it only when generated sources live elsewhere:

python -m devitoprofiler \
    --source-folders /app/devito-cache/devito-jitcache-uid481129 \
    your_app.py

Metric notes:

  • FP uses the dominant single-precision or double-precision NCU instruction counters with fp_ops = fadd + fmul + 2*ffma.
  • DRAM BW comes from dram__bytes.sum.per_second.
  • DRAM SOL comes from gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed.
  • roofline efficiency is achieved FP / attainable roofline peak, where the attainable peak is the lower of the compute roof and the DRAM roof.

The launcher adds this package’s root and Nsight Compute’s extras/python directory to PYTHONPATH before executing the target command. Set NCU_REPORT_PYTHONPATH for custom Nsight Compute installations.

Back to top