9.1
general documentation
Common construct types

In this section, commonly-used construct types whose use may require specific explainations or recommendations are described.

Indexed arrays

In many instance data such as mesh connectivity requires managing a variable number of entries per data element. This is for example the case of faces → vertices connectivity. The average number of vertices per face is usually quite low, but the maximum number may be significantly higher, so using an array with regular stride would be very inefficient for some data sets.

A common solution to this problem is to use indexed arrays, in which an array containing data is supplemented by a second array containing the start indexes of entries in the data array for each element.

In C and C++ code, the natural indexing is zero-based, but one-based indexing may also be used for some arrays as a vestige of prior Fortran code, or for arrays using global numbers.

The recommendations are the following:

  • Local index arrays should be zero-based.
  • Global index arrays should be one-based. This should only concern indexes read from or written to file.
  • When containing cell, face, or vertex connectivity information, data arrays may be either zero or one-based: zero based arrays are less error-prone so they should be preferred, but where element ids may be signed (so as to convey orientation information), one-based arrays are necessary. In a given structure, consistency is recommended, so if cells → faces connectivity requires one-based face numbers, an associated faces → vertices connectivity may also use one-based vertex numbers, even though vertices have no orientation.

Let us consider an array array_data indexed by a zero-based array_index array. The values of array_data associated with element ie, are the values ranging from indexes istart = ie included to iend = ie+1 excluded (past-the-end index).

The number of values associated with ie is determined by: array_index[i_e+1] - array_index[i_e], whether the index is zero-based or one-based.

For an indexed array of n elements, the size the index array should thus be equal to n+1 (and not n as would be the case for regular 1-d or strided arrays), and the total size of array_data is equal to array_index[n] for a zero-based index, or array_index[n] - array_index[0] in general.

Similar popular data structures

Readers familiar with Compressed Sparse Row or similar matrix or graph representations may already have noted the similarity with the indexed arrays described here. In the case of CSR matrix structures, 2 data arrays are often associated with 1 row index: one array definining the column indices, and a second one defining the associated values.

This is in reality no different than using an indexed array as described here to define a faces → vertices connectivity, and also associating data (for example coordinates) to vertices.

In code_saturne, matrix non-diagonal terms usually correspond to cell faces, and the CSR matrix representation is very similar to that of a cells → faces connectivity, except for the fact that a standard CSR representation uses only "unsigned" column ids, whereas face numbers may be signed in the matching mesh representation so as to convey orientation (an alternative solution would be to use a separate array for orientation, in which case the similarity to CSR would be complete).

Indexed Array Example

We illustrate the use of an indexed array to define a faces → vertices connectivity for a simple surface mesh:

In this example we use 0-based (0 to n-1) numbering, as in most uses of indexed arrays in code_saturne.

Recommended structure

For mesh connectivities (adjacencies) that may be used across several functions, the cs_adjacency_t structure allows storing an manipulation indexed arrays.

Parallel dispatch

In C++ code using the .cpp extension, the cs_dispatch_context class may be used to run code defined as C++ lambda functions in parallel. The corresponding code can be written in a portable manner, but executed using different runtimes, such as OpenMP or CUDA.

Examples are provided in the source tree, in

Note that this class should not be called from code with the .cxx extension, as when using CUDA, that code will ve compiled using a different compiler, and using the class in both cases may confuse those compiler's avoidance of duplicating C++ template code instanciation.

Base mechanism

The parallel dispatch is based on using templated parallel_for functions including a loop (on a CPU) or kernel call (on a GPU), in which a functor (usually generated automatically as a lambda function) is called.

We explain the working of the parallel dispatch, on a few examples, initiallt eliding the top level template aspects for clarity.

Simple parallel dispatch example

Using the code from cs_dispatch.h, the following loop:

#pragma omp parallel for
for (size_t i = 0; i < n; i++) {
if (is_disabled[i])
continue;
y[i] += a * x[i];
}

can be replaced by:

ctx.parallel_for(n, [=] (cs_lnum_t i) {
if (is_disabled[i])
return;
y[i] += a * x[i];
});
auto parallel_for(cs_lnum_t n, F &&f, Args &&... args)
Definition: cs_dispatch.h:1570
Definition: cs_dispatch.h:1711
int cs_lnum_t
local mesh entity id
Definition: cs_defs.h:350

Note here that the continue statement (skip to next loop element) is replaced by return (exit called function), as the dispatch mechanisms calls an intermediate function.

Functors and lambda function

The previous code block is equivalent to:

ctx.parallel_for(n, a, is_disabled, x, y);

With the following implementation of the parallel_for method (on a CPU with OpenMP parallelism)

void
parallel_for(size_t n,
double a,
const bool *is_disabled,
const double *x,
double *y)
{
auto_generated func(is_disabled, a, x, y);
#pragma omp parallel for
for (size_t i = 0; i < n; i++) {
func(i);
}
}

and the following functor definition:

class auto_generated {
private:
double a_;
const bool *is_disabled_;
const cs_real_t *x_;
cs_real_t *y_;
public:
// Constructor
auto_generated(double a,
const bool *is_disabled,
const double *x,
double *y) :
a_(a), is_disabled_(is_disabled), x_(x), y_(y) {}
// Operator
void operator() (size_t i)
{
if (is_disabled_[i])
return;
y_[i] += a_ * x_[i];
}
};
double cs_real_t
Floating-point value.
Definition: cs_defs.h:357

A functor is simply a class implementing the () operator.

In the ctx.parallel_for(n, [=] (cs_lnum_t i) {...}); syntax above, the [] syntax is a capture clause indicating the following code defines a lambda function. Such functions can access variables outside their enclosing scope, as specified in the capture clause. [&] means all variables are captured by reference, while [=] means all variables are captured by copy. The capture clause can be more complex, specifying different capture clauses for specific variables, but for the purposes of the code_saturne parallel dispatch mechanism, we always use the [=] (capture all variables by copy) syntax, as it is the only one adapted to running functions (kernels) on devices using a different memory space.

Using this lambda capture mechanism, a functor similar to the one described above is generated automatically based on the enclosed code.

GPU code generation

The practical interest here is that using functors allows separating the loop parallelism logic from that of the called functor. Combined with template mechanisms, the same code can be used to generate parallel code using different back-ends or scheduling mechanisms. For example, to run on a CUDA-enabled device, the equivalent to the function above can be defined as:

ctx.parallel_for(n, [=] __host__ __device__ (cs_lnum_t i) {
if (is_disabled[i])
return;
y[i] = a * x[i];
});

This function is almost identical to the original one, except for the __host__ __device__ specifiers. To ensure the syntax matches exactly, we use a CS_F_HOST_DEVICE macro which expands to the above specifiers when compiled for CUDA, and is empty for simple CPU code.

So the loop is finally written as :

if (is_disabled[i])
return;
y[i] = a * x[i];
});
#define CS_F_HOST_DEVICE
Definition: cs_defs.h:585

CUDA implementation

On a GPU, using CUDA, the parallel_for method is implemented as follows:

void
parallel_for(size_t n,
double a,
const bool *is_disabled,
const double *x,
double *y){
auto_generated func(i_face_cells, a, x, y, i_sum_type);
cs_cuda_kernel_parallel_for<<<l_grid_size, block_size_, 0, stream_>>>
(n, f);
}

With the following CUDA kernel:

__global__ void
cs_cuda_kernel_parallel_for(cs_lnum_t n,
auto_genrated f) {
// grid_size-stride loop
for (cs_lnum_t id = blockIdx.x * blockDim.x + threadIdx.x; id < n;
id += blockDim.x * gridDim.x) {
f(id, args...);
}
}

For readers not familiar with CUDA, note that blockId, blockDim, and threadIdx are built-in kernel variables describing where a given thread executes.

Asynchronous execution

The dispatch mechanism is asynchronous, so when the call to parallel_for returns, we are only assured that the computation is scheduled. The associated wait ensures we wait for the call to actually complete. On a CPU with OpenMP, the end of an OpenMP section involves an implicit barrier, so wait is a no-op (does nothing). When using a CUDA back-end, where kernel launches are actually implicit, it contains a call to cudaStreamSynchronize,

Face-based loops with parallel assembly (scatter sum)

For many operations such as balance computations, code_saturne uses*scatter* based algorithms, where values computed at faces are contributed (assembled) to cell-based arrays. When run on a relatively small number of threads (such as on a CPU), a specific face numbering is used to avoid race conditions (https://doi.org/10.1002/cpe.2852), leading to code similar to the following example.

for (int ig = 0; ig < n_groups; ig++) {
#pragma omp parallel for
for (int it = 0; it < m->numbering->n_threads; it++) {
for (cs_lnum_t face_id = m->numbering->index[ig][it][0];
face_id < m->numbering->index[ig][it][1];
face_id++) {
cs_lnum_t i = i_face_cells[face_id][0];
cs_lnum_t j = i_face_cells[face_id][1];
y[i] += a[k][0]*x[j];
y[j] += a[k][1]*x[i];
}
}
}
@ k
Definition: cs_field_pointer.h:72

When the number of threads becomes too massive, an alternative option is to simply use atomic sums, but can incur a significant performance penalty:

#pragma omp parallel for
for (cs_lnum_t face_id = 0; face_id < m->n_i_faces; face_id++) {
cs_lnum_t i = i_face_cells[face_id][0];
cs_lnum_t j = i_face_cells[face_id][1];
#pragma omp atomic
y[i] += a[k][0]*x[j];
#pragma omp atomic
y[j] += a[k][1]*x[i];
}

Using the dispatch mechanism, the external loops can be hidden in the implementation, leading to the following code:

cs_dispatch_sum_type i_sum_type = ctx.get_parallel_for_i_faces_sum_type(m);
cs_lnum_t i = i_face_cells[face_id][0];
cs_lnum_t j = i_face_cells[face_id][1];
cs_dispatch_sum(&y[i], a[k][0]*x[j], i_sum_type);
cs_dispatch_sum(&y[j], a[k][1]*x[i], i_sum_type);
});
auto parallel_for_i_faces(const M *m, F &&f, Args &&... args)
Definition: cs_dispatch.h:1523
cs_dispatch_sum_type_t get_parallel_for_i_faces_sum_type(const M *m)
Return sum type to be used with parallel_for_i_faces.
Definition: cs_dispatch.h:1657
void cs_dispatch_sum(T *dest, const T src, cs_dispatch_sum_type_t sum_type)
sum values using a chosen dispatch sum type.
Definition: cs_dispatch.h:1818

Implementation

The lambda capture mechanism is the same as described in the previous examples (each different sets of captured arguments leading to the generation of a different functor). On a CPU with OpenMP-based parallelism, the implementation expands to:

void
parallel_for_i_faces(cs_mesh_t *m,
const cs_real_2_t *a,
const cs_real_t *x,
cs_dispatch_sum_type i_sum_type)
{
auto_generated func(m->i_face_cells, a, x, y, i_sum_type);
for (int ig = 0; ig < n_groups; ig++) {
#pragma parallel for
for (int it = 0; it < m->numbering->n_threads; it++) {
for (cs_lnum_t face_id = m->numbering->index[ig][it][0];
face_id < m->numbering->index[ig][it][1];
face_id++) {
func(face_id);
}
}
}
}
cs_real_t cs_real_2_t[2]
vector of 2 floating-point values
Definition: cs_defs.h:373
Definition: cs_mesh.h:85
cs_lnum_2_t * i_face_cells
Definition: cs_mesh.h:111

On a GPU using CUDA, the function body is replaced by:

{
auto_generated func(i_face_cells, a, x, y, i_sum_type);
cs_cuda_kernel_parallel_for<<<l_grid_size, block_size_, 0, stream_>>>
(m->n_i_faces, f);
}
cs_lnum_t n_i_faces
Definition: cs_mesh.h:98

Using the same cs_cuda_kernel_parallel_for kernel template as for the simple example.

Actual mechanism

For the dispatch mechanism to actually be usable with many different loops, it must be based on templates. For clarity, in the examples above, template parameters were expanded to actual types, so as to focus on the dispatch and loop mechanisms.

The cs_dispatch class also uses a CRTP (curiously recurring template pattern) to optionally allow forcing the compilation of a given section of code only on the CPU (using cs_host_context) or GPU (using cs_device_context). This can be useful when a multiple variants of a given algorithm are required for performance reasons, and generating only the variant which is expected to perform best on a given architecture is desired.

Named lambdas

A lambda function can be declared independently of a dispatch context, and called later in the enclosing function. This may be useful if the matching code block is to be called multiple times, or for code organization and readabilty functions.

So the following example:

ctx.parallel_for(n, [=] (cs_lnum_t i) {
if (is_disabled[i])
return;
y[i] += a * x[i];
});

Can also be written using a named lambda, declared as follows:

auto my_lambda = [=] (cs_lnum_t i) {
if (is_disabled[i])
return;
y[i] += a * x[i];
};

Then passed as an argument to the dispatch mechanism:

ctx.parallel_for(n, my_lambda);

Note that a named lambda may also use the auto keyword instead of explicit types for its parameter arguments. This is referred to as a generic lambda. Starting with C++ 20, template arguments may also be used instead of auto.