Compression

There exist a variety of approximation methods to reduce the memory footprint of adjoint-based optimization problems such as FWI. Among the most popular is compression, which is widely used in industry.

In this notebook we show how to compress and decompress TimeFunctions with DevitoPRO.

Support

Currently, DevitoPRO supports the following compression libraries:

Bitcomp. This is the state-of-the-art compression library by NVidia for seismic imaging. It employs an adaptive delta for integer quantization based on the maximum amplitude value at compression time.
CvxCompress. This is the MIT-licensed wavelete-based compression library by Chevron specifically designed for wavefield compression. It emplys a multi-scale multi-dimensional wavelet transform. It employs a relative apmlitude threshold compscale to zero out small wavelet coefficients and local blocks for parallelization.

These libraries can be used with the following combinations of platform, arch, and language:

Library	platform	arch	language
Bitcomp	cpu64	nvc	{C, openmp}
Bitcomp	nvidiaX	cuda	cuda
CvxCompress	cpu64	nvc/gcc/icc	{C, openmp}

In the table above, cpu64 represents any CPU architecture while nvidiaX represents any NVidia GPU. For a comprehensive list of possible platform, arch, and language values, see here.

API

It is up to the user to state what TimeFunctions will be compressed and decompressed. The API is straightforward:

usave = TimeFunction(name='usave', grid=grid, save=nt, compression=<compression library name>)

The compiler will generate code that:

compresses usave upon each write.
decompresses usave before it’s read.

It’s as simple as that!

Bitcomp

usave = TimeFunction(name='usave', grid=grid, save=nt, compression='bitcomp')

We recall from above that Bitcomp uses an adaptive delta for integer quantization based on the maximum amplitude value at compression time. The maximum error between the uncompressed and the original data will not exceed delta. Since

delta = (dataMax - dataMin) / (2^nbits - 1),

the user can control the maximum error via nbits, which, in other words, represents the number of bits a floating point number gets mapped to. After empirical evaluation, we chose nbits=9 as default value, which means delta = dataMax / 511. One can easily try different nbits values by supplying an override at Operator-application time, for example:

op = Operator(...)
op.apply(..., nbits=16)

CvxCompress

usave = TimeFunction(name='usave', grid=grid, save=nt, compression='cvxcompress')

We recall from above that CvxCompress uses an adaptive compscale for thresholding based on the maximum amplitude value at compression time. The user can control the maximum error via compscale, which, in other words, represents the how much values will be thrown out in the avelet domain. After empirical evaluation, we chose compscale=1e-3 as default value. One can easily try different compsclae values by supplying an override at Operator-application time, for example:

op = Operator(...)
op.apply(..., compscale=1e-5)

where higher values will lead to more compression (and more loss) while smaller values will lead to less compression and more accurate reconstruction. Recommended values are between 1e-2 and 1e-6.

Usage

Alright, let’s look at an actual example. We’re gonna use the built-in “dummy” compression library noop, which, instead of actual compression/decompression routines, uses memcpy’s, thus simulating the worst-case scenario for compression efficiency. This makes the example compression-agnostic and readily runnable on any system as long as DevitoPRO is installed.

We start with declaring some symbolic objects.

from devito import Eq, Grid, Operator, TimeFunction
from devitopro import *
from devitopro.types.enriched import NoLayers

nt = 4
grid = Grid(shape=(4, 4))

u = TimeFunction(name='u', grid=grid)
usave = TimeFunction(name='usave', grid=grid, save=nt, compression='noop')

v = TimeFunction(name='v', grid=grid)
vsave = TimeFunction(name='vsave', grid=grid, save=nt, layers=NoLayers)

We now create a trivial forward-marching Operator, which emulates a PDE solve.

eqns0 = [Eq(u.forward, u + 1.),
         Eq(usave, u)]
op0 = Operator(eqns0)

There are some lessons here:

First of all, we observe there are two TimeFunctions: u is used in the PDE, usave is, in essence, for stacking snapshots.
In Eq(usave, u), usave gets written; as explained above, this will trigger data compression after the Eq is performed. This is opaque to the user, and – good news – no other actions are expected on the user front. However, it may help to know when compression takes place.
In this example we’re using compression='noop', so the content of u at a given timestep will be transfered “as is” into usave.
A more complicated Eq could have been used, for example Eq(usave, u.dt2). No issues with that! In this case, first u.dt2 gets calculated, then the result is compressed into usave.

Let’s now proceed with a trivial backward-marching Operator, which emulates an adjoint operator.

eqns1 = [Eq(v.backward, v - 1),
         Eq(vsave, usave)]
op1 = Operator(eqns1)

Some more observations:

In Eq(vsave, usave), usave is read; as explained above, this will trigger data decompression before the Eq is performed. Again, it may help to know when decompression “logically” takes place.
We say “logically” because, under-the-hood, the compiler performs several optimizations to maximize the efficiency of the decompression, among which:
- Anticipating decompression as much as possible, in tandem with prefetching if the data resides on disk or in GPU memory;
- Hiding (overlapping) decompression with computation whenever possible.
Again, these are just low-level implementation details, nothing the user needs to worry about directly.
More complicated Eqs are supported, e.g. Eq(vsave, usave.dx2).

OK, let’s now try running these two Operators.

#NBVAL_IGNORE_OUTPUT

summary0 = op0.apply(time_M=nt-1)
summary1 = op1.apply()

Operator `Kernel` ran in 0.01 s
Operator `Kernel` ran in 0.01 s

Alright, they ran. What do we expect to see now? Well, at the end of op1, vsave should contain the same data as usave. So let’s have a look.

import numpy as np
from devitopro.builtins import decompress

# Access decompressed data
usave_dec, = decompress(usave)

assert np.all(usave_dec.data == vsave.data)
assert all(np.all(vsave.data[i] == i) for i in range(nt))

print(vsave.data)

[[[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[2. 2. 2. 2.]
  [2. 2. 2. 2.]
  [2. 2. 2. 2.]
  [2. 2. 2. 2.]]

 [[3. 3. 3. 3.]
  [3. 3. 3. 3.]
  [3. 3. 3. 3.]
  [3. 3. 3. 3.]]]

Implications

There are some implications on the user front.

The .data attribute will only show a representation of the compressed data.
The .shape attribute will return a dummy value, as the compressed data is inherently unstructured.

However, in real applications there is hardly a need to access the .data field directly: usave will be compressed in a forward modelling operator and seamlessly decompressed in an adjoint operator. However, should the user want to peek at the data in between these two operators, the built-in function decompress is provided:

from devitopro.builtins import decompress
data = decompress(usave)

Limitations

Currently, compressed TimeFunctions are subjected to some limitations.

An Operator can write or read a compressed TimeFunction, but not both.
A compressed TimeFunction is a “one-time write” object: it can only be written by a specific Operator, and only via a single call to op.apply(). Otherwise, an exception will be raised by the runtime.

On the other hand, a compressed TimeFunction can be read by an arbitrary number of Operators.

We introduced these limitations to simplify the implementation.

However, in our experience, their significance in the context of PDE-constrained optimization algorithms is negligible.

FAQs

My code would look neater if I could re-initialize `usave` and use it to compute another shot in my inversion loop.

Fair enough. Compressed TimeFunctions often need a large number of arguments to be instantiated, so we’d like to avoid that. Well, it turns out you can easily create “twin” or “clone” TimeFunctions via the .func API.

usave1 = usave.func(name='usave1')

It doesn’t matter at this point whether usave had already been consumed or not. usave1 is a brand-new object, which at the moment carries no data whatsoever. And it’s structurally identical to usave except for the name. In other words, the above is equivalent to:

usave1 = TimeFunction(name='usave1', <all other arguments used to create `usave`> )

Requirements

All the compression libraries described in this notebook are provided out-of-the-box by DevitoPRO, so they don’t need to be installed separately.

Further, if you’re using our Dockerfiles, you’re already good to go.

Otherwise, you should be aware of the following dependencies:

Bitcomp. The NVidia HPC SDK must be available on the system, even when running on just CPUs. The NVidia HPC SDK ships the CUDA toolkit, which in turn provides header files (e.g., for half-precision types) that will be required by Bitcomp at jit-compilation time. So, yes – CUDA must be around even if you don’t care about NVidia GPUs.