In this section, commonly-used construct types whose use may require specific explainations or recommendations are described.
In many instance data such as mesh connectivity requires managing a variable number of entries per data element. This is for example the case of faces → vertices connectivity. The average number of vertices per face is usually quite low, but the maximum number may be significantly higher, so using an array with regular stride would be very inefficient for some data sets.
A common solution to this problem is to use indexed arrays, in which an array containing data is supplemented by a second array containing the start indexes of entries in the data array for each element.
In C and C++ code, the natural indexing is zero-based, but one-based indexing may also be used for some arrays as a vestige of prior Fortran code, or for arrays using global numbers.
The recommendations are the following:
Let us consider an array array_data indexed by a zero-based array_index array. The values of array_data associated with element ie, are the values ranging from indexes istart = ie included to iend = ie+1 excluded (past-the-end index).
The number of values associated with ie is determined by: array_index[i_e+1] - array_index[i_e], whether the index is zero-based or one-based.
For an indexed array of n elements, the size the index array should thus be equal to n+1 (and not n as would be the case for regular 1-d or strided arrays), and the total size of array_data is equal to array_index[n] for a zero-based index, or array_index[n] - array_index[0] in general.
Readers familiar with Compressed Sparse Row or similar matrix or graph representations may already have noted the similarity with the indexed arrays described here. In the case of CSR matrix structures, 2 data arrays are often associated with 1 row index: one array definining the column indices, and a second one defining the associated values.
This is in reality no different than using an indexed array as described here to define a faces → vertices connectivity, and also associating data (for example coordinates) to vertices.
In code_saturne, matrix non-diagonal terms usually correspond to cell faces, and the CSR matrix representation is very similar to that of a cells → faces connectivity, except for the fact that a standard CSR representation uses only "unsigned" column ids, whereas face numbers may be signed in the matching mesh representation so as to convey orientation (an alternative solution would be to use a separate array for orientation, in which case the similarity to CSR would be complete).
We illustrate the use of an indexed array to define a faces → vertices connectivity for a simple surface mesh:
In this example we use 0-based (0 to n-1) numbering, as in most uses of indexed arrays in code_saturne.
For mesh connectivities (adjacencies) that may be used across several functions, the cs_adjacency_t structure allows storing an manipulation indexed arrays.
In C++ code using the .cpp extension, the cs_dispatch_context class may be used to run code defined as C++ lambda functions in parallel. The corresponding code can be written in a portable manner, but executed using different runtimes, such as OpenMP or CUDA.
Examples are provided in the source tree, in
tests/cs_dispatch_test.cppsrc/alge/cs_benchmark.cppNote that this class should not be called from code with the .cxx extension, as when using CUDA, that code will ve compiled using a different compiler, and using the class in both cases may confuse those compiler's avoidance of duplicating C++ template code instanciation.
The parallel dispatch is based on using templated parallel_for functions including a loop (on a CPU) or kernel call (on a GPU), in which a functor (usually generated automatically as a lambda function) is called.
We explain the working of the parallel dispatch, on a few examples, initiallt eliding the top level template aspects for clarity.
Using the code from cs_dispatch.h, the following loop:
can be replaced by:
Note here that the continue statement (skip to next loop element) is replaced by return (exit called function), as the dispatch mechanisms calls an intermediate function.
The previous code block is equivalent to:
With the following implementation of the parallel_for method (on a CPU with OpenMP parallelism)
and the following functor definition:
A functor is simply a class implementing the () operator.
In the ctx.parallel_for(n, [=] (cs_lnum_t i) {...}); syntax above, the [] syntax is a capture clause indicating the following code defines a lambda function. Such functions can access variables outside their enclosing scope, as specified in the capture clause. [&] means all variables are captured by reference, while [=] means all variables are captured by copy. The capture clause can be more complex, specifying different capture clauses for specific variables, but for the purposes of the code_saturne parallel dispatch mechanism, we always use the [=] (capture all variables by copy) syntax, as it is the only one adapted to running functions (kernels) on devices using a different memory space.
Using this lambda capture mechanism, a functor similar to the one described above is generated automatically based on the enclosed code.
The practical interest here is that using functors allows separating the loop parallelism logic from that of the called functor. Combined with template mechanisms, the same code can be used to generate parallel code using different back-ends or scheduling mechanisms. For example, to run on a CUDA-enabled device, the equivalent to the function above can be defined as:
This function is almost identical to the original one, except for the __host__ __device__ specifiers. To ensure the syntax matches exactly, we use a CS_F_HOST_DEVICE macro which expands to the above specifiers when compiled for CUDA, and is empty for simple CPU code.
So the loop is finally written as :
On a GPU, using CUDA, the parallel_for method is implemented as follows:
With the following CUDA kernel:
For readers not familiar with CUDA, note that blockId, blockDim, and threadIdx are built-in kernel variables describing where a given thread executes.
The dispatch mechanism is asynchronous, so when the call to parallel_for returns, we are only assured that the computation is scheduled. The associated wait ensures we wait for the call to actually complete. On a CPU with OpenMP, the end of an OpenMP section involves an implicit barrier, so wait is a no-op (does nothing). When using a CUDA back-end, where kernel launches are actually implicit, it contains a call to cudaStreamSynchronize,
For many operations such as balance computations, code_saturne uses*scatter* based algorithms, where values computed at faces are contributed (assembled) to cell-based arrays. When run on a relatively small number of threads (such as on a CPU), a specific face numbering is used to avoid race conditions (https://doi.org/10.1002/cpe.2852), leading to code similar to the following example.
When the number of threads becomes too massive, an alternative option is to simply use atomic sums, but can incur a significant performance penalty:
Using the dispatch mechanism, the external loops can be hidden in the implementation, leading to the following code:
The lambda capture mechanism is the same as described in the previous examples (each different sets of captured arguments leading to the generation of a different functor). On a CPU with OpenMP-based parallelism, the implementation expands to:
On a GPU using CUDA, the function body is replaced by:
Using the same cs_cuda_kernel_parallel_for kernel template as for the simple example.
For the dispatch mechanism to actually be usable with many different loops, it must be based on templates. For clarity, in the examples above, template parameters were expanded to actual types, so as to focus on the dispatch and loop mechanisms.
The cs_dispatch class also uses a CRTP (curiously recurring template pattern) to optionally allow forcing the compilation of a given section of code only on the CPU (using cs_host_context) or GPU (using cs_device_context). This can be useful when a multiple variants of a given algorithm are required for performance reasons, and generating only the variant which is expected to perform best on a given architecture is desired.
A lambda function can be declared independently of a dispatch context, and called later in the enclosing function. This may be useful if the matching code block is to be called multiple times, or for code organization and readabilty functions.
So the following example:
Can also be written using a named lambda, declared as follows:
Then passed as an argument to the dispatch mechanism:
Note that a named lambda may also use the auto keyword instead of explicit types for its parameter arguments. This is referred to as a generic lambda. Starting with C++ 20, template arguments may also be used instead of auto.