Devito Cluster - the whys and the whats

Being Devito Codes

Featured image

One of Devito’s core missions is performance portability. Our current roadmap targets the matrix of architectures and parallel programming models below.

Architecture CUDA HIP MPI OpenACC OpenMP
AMD/CPU     Yes   Yes
AMD/GPU   Nov’22 Yes   N/A
ARM     Yes N/A Yes
Intel/CPU     Yes N/A Yes
Intel/KNC,KNL     Yes N/A Yes
Intel/GPU     TBA   TBA
NVidia Yes   Yes Yes Yes

To support the delivery of this vision, since January 2020 Devito Codes has maintained a distributed cluster as part of our DevOps infrastructure for both open-source Devito and DevitoPRO. This provides services such as:

While Devito Codes invested directly in hardware at the outset, long term loans and donations now make up the majority of the cluster 3. At the time of writing the cluster is comprised of:

Supplier units CPU GPU Sponsor
Dell 1 AMD 4 X A100 (80G) NVidia & Dell
HP 1 Intel Xeon   Devito Codes
Fujitsu 1 ARM64   Fujitsu
self (custom build) 1 AMD 2 x MI50 Devito Codes (server) & AMD (GPUs)
self (custom build) 4 Intel/AMD PC CPUs 3 x RTX3090, RTX3080 Devito Codes
Supermicro 4 Intel Xeon Gold 4 x V100 NVidia & Supermicro

While we were optimizing for the Intel KNC and KNL, [DUG]{https://dug.com/} provided us with access to nodes for DevOps. Today they are running their own DevOps on KNC and KNL for DUG Wave which depends upon Devito. So far no problems have been experienced so long as we monitor for performance regression on Intel Xeon’s running with OpenMP.

We do all our deployment on the cluster using Docker containers. We do this to ensure that the same environment is used for testing and deployment. This has also been useful for performance debugging on Cloud platforms, such as AWS and Azure, because we can differentiated between Docker related issues and the underlying platform.

Early on we relied on Cloud computing nodes rather than our own hardware. However, we ran into a number of issues:

There is no doubt that we will revisit the Cloud in the future with a smarter strategy to control costs. However, we are committed to continuing with our strategy of maintaining our own hardware so we can always drill down to bare metal when we are looking for maximum performance.

  1. If it is not tested, then it is broken. 

  2. Performance benchmark specification includes code verification for correctness. 

  3. Many thanks to all the vendors that helped make the Devito Cluster happen: AMD, Dell, DUG, NVidia, Supermicro