For solvers such as code_saturne, which can run on large resources for long durations, improving performance is always essential, to reduce both user wait times and IT costs (of which a large part is nowadays energy cost).
Performance gains are usually a combination of progress in computing power and in algorithms. For similar software over the last few decades, both factors have been important, with algorithm progress being a major driver.
So although the first step in improving program performance is on the algorithmic (and theory) side, detailed analysis of the behavior of those algorithms on actual hardware is important.
To assist developers and users in performance optimization, code_saturne includes many timers, and tries to log synthetic performance information.
This allows comparing the performance of numerical options and checking that no "unexpected" performance bottlenecks are present.
For more detailed analysis, the use of profiling tools is recommended.
To be able to understand the performance behavior of code_saturne, the user should have at least introductory knowlege of several hardware and programming model related aspects.
When running, the code_saturne solver generates a timer_stats.csv file, which traces the elapsed time for each major operation type (mesh modification, post-processing, gradient reconstructions, linear solvers, and such). This information may be easily plotted using a spreadsheet or a visualization tool such as ParaView.
The code also generates a performance.log file, which summarizes timings for various operations, in a manner independent of the number of time steps actually run (so this file is complete only after a successful run).
To obtain more detailed performance information, use of a profiling tool is needed.
Use --enable-profile to configure builds for profiling.
Several types of tools may be available. We list a few commonly available tools, though the list is far from exhaustive:
The Valgrind tool suite includes several tool which are very useful for profiling. Note that as usual when running under Valgrind, there is an overhead relative to actual performance, and the obtained timing results may be simulated as much as measured, but the information obtained is very similar to that obtained with less ubiquitous tools.
Combined with the kcachegrind visualization tool, it is extremely easy to use on a Linux workstation. It allows easy visualization of call trees and hot spots, as illustrated below:
Other advanced profiling tools may be provided by various vendors, for example:
Whatever the profiling tool used, the code_saturne run (or code_saturne submit) command's --tool-args or --mpi-tool-args command may be used to insert a profiling command in the code's launch sequence.
The profiling can also be prepared in a step by step manner, described here.
code_saturne submit --initialize In either case, the code will prepare the execution directory, and preprocess the mesh if needed, but not remove the executable and temporary script.cd to the execution directory, and edit the run_solver script script as described in the following sections, depending on the profiling too used.Once the run_solver script has been adapted for profiling, it can be executed.
run_solver script rather than running it directly.runcase file to the same position (starting at line 2) in the run_solver file.runcase starting with #SLURM to run_solver.If Intel's VTune is available, the following procedure may be used after the preparation step described above.
Edit the run_solver script script:
mpiexec might be replaced by another command, such as srun depending on the system, but the logic remains the same).Once the run_solver code has finished running, simply run
This may require loading an environment module (as in the run_solver file for the VTune installation (intel-Basekit in the previous example).
VTune allows many exploration views, for example:
or
If NVIDIA's's Nsight Systems is available, a similar procedure may be used following the common preparation step described above.
Edit the run_solver script script:
cs_solver command, insert the profiling commands. For example, replace ```{.sh} ./cs_solver ``` with: ```{.sh} nsys profile [options] ./cs_solver"
``` <br>
or:
```{.sh}
nsys launch ./cs_solver" ``` (see the Nsight Systems user documentation for more options)Once the code has finished running, simply run
and load the profiling output file.
Profiling with an annotated build
When using Nsight Systems, it is often difficult to match CUDA kernels with the calling code. Backtraces are generated at sampling intervals, which is very useful, but using NVIDIA's NVTX tools extension, it is possible to add annotations to the profiling timiline, making analysis much easier.
At code_saturne's configure stage, this requires adding:
The path to a header directory including the nvtx3 subdirectory may depend on the toolkit installation. These may include:
$NVHPC_ROOT/cuda/<cuda_version>/targets/x86_64-linux/include (using the NVHPC tookkit)/opt/cuda/targets/x86_64-linux/include (using a distribution-based install, here on Arch Linux)When profiling such a build, annotations then appear in the NVTX view under the nsys-ui visualizer.