Devito DevOps Cluster v2

Now with Instinct™

Featured image

TLDR

Thanks to support from AMD we are happy to announce:

Nitty Gritty

One of Devito’s core missions is performance portability. Our current roadmap targets the matrix of architectures and parallel programming models below.

Architecture CUDA HIP MPI OpenACC OpenMP
AMD/CPU      
AMD/GPU    
ARM      
Intel/CPU      
Intel/KNC,KNL      
Intel/GPU     TBA   TBA
NVidia  

To support the delivery of this vision, since January 2020 Devito Codes has maintained a distributed cluster as part of our DevOps infrastructure for both open-source Devito and DevitoPRO. This provides services such as:

While Devito Codes invested directly in hardware at the outset, long term loans and donations now make up the majority of the cluster 3. At the time of writing the cluster is comprised of:

Model units CPU GPU Sponsor
AMAX AceleMax DGS-428 4U Server 1 AMD EPYC 7643 48-Core Processor 4x AMD Radeon Instinct MI210 AMAX
Dell 1 AMD 4 X A100 (80G) NVidia & Dell
HP 1 Intel Xeon   Devito Codes
Fujitsu A64FX 1 ARM64   Fujitsu
self (custom build) 1 AMD 2 x MI50 Devito Codes (server) & AMD (GPUs)
self (custom build) 4 Intel/AMD PC CPUs 3 x RTX3090, RTX3080 Devito Codes
Supermicro 4 Intel Xeon Gold 4 x V100 NVidia & Supermicro

While we were optimizing for the Intel KNC and KNL, [DUG]{https://dug.com/} provided us with access to nodes for DevOps. Today they are running their own DevOps on KNC and KNL for DUG Wave which depends upon Devito. So far no problems have been experienced so long as we monitor for performance regression on Intel Xeon’s running with OpenMP.

We do all our deployment on the cluster using Docker containers. We do this to ensure that the same environment is used for testing and deployment. This has also been useful for performance debugging on Cloud platforms, such as AWS and Azure, because we can differentiated between Docker related issues and the underlying platform.

Early on we relied on Cloud computing nodes rather than our own hardware. However, we ran into a number of issues:

There is no doubt that we will revisit the Cloud in the future with a smarter strategy to control costs. However, we are committed to continuing with our strategy of maintaining our own hardware so we can always drill down to bare metal when we are looking for maximum performance.

  1. If it is not tested, then it is broken. 

  2. Performance benchmark specification includes code verification for correctness. 

  3. Many thanks to all the vendors that helped make the Devito Cluster happen: AMD, Dell, DUG, NVidia, Supermicro