9.1
general documentation
Node-level parallelism and accelerators

Host and device parallelism

Most current accelerated computing nodes are based on a host-device architecture, where the host is based on a general-purpose processor, to which one or more devices are adjoined to offload some operations. On classical, non-accelerated nodes, only the host is present.

The most common devices are GPGPUs (General Purpose Graphical Processing Unit), but other types of accelerators such as NEC's Vector Engine or FPGA (Field Programmable Gate Array) could be considered.

Though accelerators can provide for greater performance and energy efficiency, Leveraging their power cannot be done without adaptations in the programming model.

Memory models

Accelerators usually have dedicated memory, with high bandwidth, which is separate from the host memory. Copying between the host and device memory occurs latency, and must be done as infrequently as possible. Mainstream accelerator programming models may provide both separate host and device memory accesses, with explicit exchanges, and "unified shared memory", allowing access of memory both from the host and devices in an almost transparent manner, so as to provide programmability and maintainability. The underlying implementation is usually based on a paging mechanism, which may incur additional latency if not sufficiently well understood and used (so the associated programming models may provide functions for prefetching or providing hints to as how the memory is actually used).

Available memory on devices is often more limited than that on the host, so allocating everything explicitly on the device could cause us to run out of available memory. Using unified shared memory can avoid this, as memory paging may provide the "illusion" of having more memory on the device, though this can seriously degrade performance when this mechanism kicks in.

Ideally, we could use unified shared memory in all cases, and this might be done in the future, it seems safer for the present to provide control to the developer over which type of memory is used. So in code_saturne, when allocating memory which might be needed on and accelerator, the CS_MALLOC_HD function should be used, specifying the allocation type with a cs_alloc_mode_t argument. In a manner similar to the baser CS_MALLOC macro, this provides some instrumentation, and is construed as a portability layer between several programming models.

Programming models

A few possible programming models

As classical C and current C++ standards cannot express all the possible parallelism, programming for accelerators may be done using either:

  • Directive-based approaches, such as:
    • OpenMP 5.x,
    • OpenACC,
    • Specific compiler #pragmas, such as for NEC's Vector Engine.
  • Dedicated mainstream language extensions, such as:
  • Dedicated libraries, such as Kokkos.
  • Languages designed specifically for HPC, such as Chapel.
  • Domain-specific languages (DSL), which are usually based on a form of preprocessing to generate complex code in a mainstream language from simpler patterns tailored to a application domain.

Note that the mainstream language extensions listed above, as well as Kokkos, and many DSLs are all based on C++, which is the main driver for switching code_saturne from C to C++.

None of the approches listed above is currently as ubiquitous or portable as the C++ basis with host-based OpenMP directives on which most of code_saturne is built.

  • OpenMP would be expected to be the most portable solution here, but handling of accelerators was quite incomplete up until OpenMP 5.2, and various attempts at using OpenMP offload over the year have stumbled on compiler and toolchain robustness issues and dissapointing performance.
  • OpenACC seems supported by very few vendors, and shared experiences from other codes using OpenACC have shown that this approach compiler and hardware dependence is very strong here, with limited actual portability.
  • Kokkos is well established, but would add a critical and very intrusive dependency on all architectures, so is avoided for now, though will be tested in the context of code_saturne.

Selected programming model

Given these constraints, the current strategy regarding accelerator support is the following:

  • Use CUDA for the most common hot-spots on NVIDIA GPUs (sparse linear-system solutions and gradient reconstruction), so as to benefit from previous work on the code, albeit in a non-portable manner.
  • Use a Lambda-based parallel_for construct allowing multiple back-ends:

    • Loop-based OpenMP on CPU (always available).
    • CUDA kernel on NVIDIA GPUs.
    • SYCL kernel on all supported systems.
    • OpenMP offload (partial) on supported GPUs.
    • Can be extended.

    This mechanism uses Lambda functions to allow generation of code for the appropriate device.

Host-level parallelism

Host-level parallelism is currently based on OpenMP constructs. Parallel loops using threads are used where possible. Though this is not used yet, OpenMP tasks could also be used to benefit from additional parallelization opportunities.

Vectorization may also be used locally to enhance performance, whether handled automatically by the compiler or handled explicitely through directives. In practice, as code_saturns is mostly memory-bound, the benefits of vectorization are limited, so improving the vectorization of various algorithms is not a priority.

Device-level parallelism

Various devices may be considerd, but the main targets are currently GPGPUs.

As mentioned above, exploiting parallelism can be based on CUDA, SYCL, or OpenMP directives.

Note that parallelism on GPGPU's is usually based on massive multi-threading, in which operations on an array may be divided into a series of chunks (blocs), where each block is scheduled to run on available processors (ideally in unspecified order), and computation of a given block is itself multi-threaded.

Computational kernels launched on a device from the host are usually at least in part asynchronous with the host, so that parallelism between the host and device may be exploited when algorithm allows this.