Data Streaming

The problem

Within a compute node, we recognize three key modules: the host, device, and disk.

Data streaming refers to the efficient movement of data between these modules.

This streaming process is critical in numerous scenarios, especially when the volume of data is too large for a single module. Here are some classic examples:

The data exceeds the capacity of the device memory.
Even with domain decomposition across multiple devices, the data still doesn’t fit into device memory.
While domain decomposition would allow the data to fit in device memory, it’s not feasible because different devices are allocated to handle distinct shots.
Despite compression, the data still surpasses the host memory’s capacity.

“So, what’s the solution?”

You don’t have to sweat it – DevitoPRO has it handled.

“Seriously?”

Well, there are “knobs” you might want to adjust for optimizing performance or for better integration with your application. So, it’s a good idea to read on. But the big picture? DevitoPRO has made the core data streaming challenges essentially non-issues.

How does DevitoPRO ensure high-performance data streaming?

In short: it does a ton. If you’re content taking that on faith, you can skip to the next section. If you’re curious about the details, read on.

The DevitoPRO compiler generates code that can move data between different modules. For instance, during a forward propagation run, a snapshot might be transferred from the device to the host, freeing up some space in device memory. Conversely, in a backward propagation run, snapshots not present in device memory are automatically fetched from the host.

However, that’s just a basic overview. There’s more under the hood.

DevitoPRO’s goals are:

Forward-propagation: keep the snapshots as close as possible to the module performing the computation (host, device). When running out of space, stream to the upper layer of the memory hierarchy.
Backward-propagation: fetch the snapshots from the upper layer of the memory hierarchy as early as possible, if not already present on the module performing the computation (host, device).

For context, let’s say you’re using DEVITO_PLATFORM={nvidiaX, amdgpuX, intelgpuX}. During a forward-propagation run, the aim is to keep as many snapshots on the device as possible. If the device runs out of memory, the subsequent snapshots are sent to the host. And if the host fills up, the following snapshots are dispatched to the disk. Conversely, during a backward-propagation run, the system first checks if the next snapshot is locally available, and if not, it initiates a data transfer either from the host or disk.

There’s significant runtime management ensuring we know which snapshots have been computed and where they’re stored. The situation gets even more intricate with compression because the snapshot size isn’t statically available. And this can be challenging for several reasons, which we in part discuss below.

Storing in reverse order for maximum back-propagation performance

Assume we’re running an Operator from time_m=1 to time_M=10, computing and compressing a snapshot at every time iteration on a device platform. For simplicity, assume there are only two layers – host and device (no disk). The snapshots at time=[1,...,k-1] are sent back to the host, while the snapshots at time=[k,...,10] remain on the device. This ensures that the backward Operator instantly accesses the initial snapshots on the device, eliminating any communication latency.

Crucially, k isn’t statically set! DevitoPRO dynamically reserves a portion of the available device memory to store as many snapshots as possible, thereby inherently minimizing k. In the ideal scenario, no snapshots are sent to upper layers.

Overlapping data transfers to computation

All data transfers between modules are asynchronous and occur simultaneously with compute kernels.

In a forward-propagation run, the computed snapshots are handled by a POSIX thread and transferred, if necessary, from device to host or, if the host’s memory is full, to disk.
During backward-propagation, the required snapshots are prefetched from the host memory or disk to the device by a POSIX thread. So while computing timestep X, data for timestep X+1 might already be en route.

All POSIX threads perform non-blocking tasks and are optimized to minimize interference with the other compute threads.

Streamed TimeFunctions

A TimeFunction is streamed across different memory modules iff it’s a “saved” TimeFunction, that is an object that might require a lot of memory.

from devito import Eq, Grid, TimeFunction, Operator
from devitopro import *

grid = Grid(shape=(4, 4))

# "normal" TimeFunction
u = TimeFunction(name='u', grid=grid)

# streamed TimeFunction
usave = TimeFunction(name='usave', grid=grid, save=5)

# streamed and (dummy-)compressed TimeFunction
ucomp = TimeFunction(name='ucomp', grid=grid, save=5, compression='noop')

usave has save=X and X != None, so it’s subjected to data streaming. We can easily see that u and usave have different type since under-the-hood DevitoPRO has exploited the save to return special objects.

type(u)

TimeFunctionPro

type(usave)

EnrichedTimeFunction

type(ucomp)

CompressedTimeFunction

print("CompressedTimeFunctions are a special type of EnrichedTimeFunctions:", issubclass(type(ucomp), type(usave)))

CompressedTimeFunctions are a special type of EnrichedTimeFunctions: True

In terms of constructing Eqs, usave and ucomp behave almost identically to u – the only relevant difference is that a streamed TimeFunction (i.e., an EnrichedTimeFunction under-the-hood) cannot be both written and read in the same Operator; however, in practical use cases this is never necessary, so it’s a non-issue.

API

The API is pretty lean. Actually, you’ve already seen most of it! Saved TimeFunctions turn into EnrichedTimeFunctions under-the-hood, and this will trigger DevitoPRO to implement data streaming.

At the end of a forward-propagating Operator, the usave data may have been “auto-magically” split over different modules. For example, 10% of the overall data may reside in device memory, 20% in host memory, and the rest on disk.

The `layers` interface

Sometimes, you might want to push the entire dataset to the host or disk, like:

During testing, because accessing .data becomes more straightforward.
For integration with the overarching application, because for example there’s a lot going on between the forwawrd and backward Operators (perhaps merely for legacy reasons), thus it is more practical to dump the entire streamed TimeFunction to disk.

You can dictate behaviors like “move everything to the host and keep it there” via the layers API. Let’s dive into it.

from devitopro.types.enriched import Host, Disk

udisk = TimeFunction(name='udisk', grid=grid, save=5, layers=Disk)

eqns = [Eq(u.forward, u + 1),
        Eq(udisk, u)]

op = Operator(eqns)

#NBVAL_IGNORE_OUTPUT

summary = op.apply(time_M=4)

Operator `Kernel` ran in 0.01 s

Given all we explained, there should be no surprises at this point if we get an exception when trying to access disk-resident data.

try:
    udisk.data
except ValueError as e:
    print(e)

Cannot access `data` as on disk

If we really want to access disk-resident data, for example for testing purposes, DevitoPRO comes to our rescue with a built-in function.

from devitopro.builtins import get_disk_data

print(get_disk_data(udisk))

[[[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[2. 2. 2. 2.]
  [2. 2. 2. 2.]
  [2. 2. 2. 2.]
  [2. 2. 2. 2.]]

 [[3. 3. 3. 3.]
  [3. 3. 3. 3.]
  [3. 3. 3. 3.]
  [3. 3. 3. 3.]]

 [[4. 4. 4. 4.]
  [4. 4. 4. 4.]
  [4. 4. 4. 4.]
  [4. 4. 4. 4.]]]

The layers keyword supports the following options:

DiskHostDevice - This setting is the default when targeting GPU backends. It prioritizes keeping TimeFunction data on the device as long as possible. If necessary, data then moves to host memory and, in extreme cases, may spill over to disk.
HostDevice - Similar to DiskHostDevice, but it eliminates the disk layer, ensuring data never leaves the host memory by design.
DiskHost - The default choice for CPU backends. Note:
- Not permissible with GPU backends.
Host - Guarantees all data remains in host memory.
- Not allowed with GPU backends.
Device - Ensures all data is kept in device memory.
- Not valid for CPU backends.
Disk - Forces all data to be stored on disk.
NoLayers - Turns off lazy data streaming. Under a GPU backend, this means data is streamed between the device and host asynchronously. Ultimately, when control returns to Python, all data will be situated in host memory.

We want to stress that adjusting the layers keyword from its default setting should primarily be for testing and debugging purposes. If you find yourself needing to change it for other reasons, please reach out to us for assistance.

The `mem-perc` interface

By default, at op.apply-time once lazy finalization ultimately happens, DevitoPRO reserves at most 70% of the available memory to the streamed TimeFunction

“at most” because it can be way less; a streamed TimeFunction might be determined to require much less space in the worst case scenario, taking into account things such as grid size, save value, and compression factor.
the “70%” applies to all modules – device and host memory, disk.

The 70% threshold was chosen as deemed to be a reasonable compromise between efficiency and resource consumption. All DevitoPRO knows is how much memory is available at that moment in time and how much of it will be required by the Operator, but it doesn’t know anything about the requirements of the overarching application.

This threshold can be customized the so-called mem-perc API, which comes in the form of both environment variables and configuration entries.

The available environment variables are

DEVITO_LAZY_DEVICE_MEM_PERC
DEVITO_LAZY_HOST_MEM_PERC
DEVITO_LAZY_DISK_MEM_PERC

The corresponding entries in Devito’s global configuration dictionary are:

lazy-device-mem-perc
lazy-host-mem-perc
lazy-disk-mem-perc

The accepted values are numbers in the range [0, 1], with the default being 0.7 as explained above.

Streaming on CPU backends

Even when using CPU backends streaming is still enabled. Clearly, in this case streamed TimeFunctions can only be transferred between the host and disk.

Since data streaming is handled by POSIX threads, and since these threads do not belong to the same pool as the OpenMP threads, it is recommended to tune your Operator trying different values of nthreads. For example, if you’re running an Operator on 20 physical cores and you’re setup has two streamed TimeFunctions, you may wanna try running with op.apply(..., nthreads=18), that is with a slightly smaller OpenMP thread pool, to avoid oversubscribing the cores.

This is especially true in the case of compression in tandem with streaming. Unlike streaming data to/from disk, compression/decompression aren’t I/O-bound operations, as they consume actual compute cycles to fetch data from DRAM and perform the related operation.

While in the future we expect DevitoPRO to automatically set an appropriate value for nthreads based on the amount of loose POSIX threads in the Operator, this is still not the case today.

Streaming and `estimate_memory`

When using Operator.estimate_memory() with DevitoPRO Operators, the output is enriched with an additional 'snapshots' field, showing the maximum size of the snapshots stored. The snapshots may be smaller than this due to compression, although it is not possible to meaningfully estimate compression ratios a-priori, hence the provision of an upper bound. Note that whilst buffers used for snapshotting and streaming are incorparated into estimates for host or device memory consumption, the snapshots themselves are shown separately as they are moved between layers as necessary. For example:

op.estimate_memory()

MemoryEstimate(Kernel): {'host': '832 B', 'device': '0 B', 'snapshots': '320 B'}