Intel oneAPI
In December 2020, Intel released its oneAPI software development suite, which is free of charge and replaces the previous or Intel Parallel Studio XE Cluster Edition that CHPC licensed. In our environment, we install the Base and HPC toolkits, which include the following libraries and applications:
Programming languages:
Libraries, Need to be loaded as a separate module, please contact CHPC if they are not available, or you need newer version, as we don't install these unless there is demand. Compiler needs to be loaded beforehand:
Programming tools:
All of the Intel tools have excellent documentation and we recommend you follow the tutorials to learn how to use the tools and the documentation to find details. Most of the tools work well out of the box, however, few of the tools have peculiarities regarding our local installation which we try to make known in this document.
Intel Compilers
The Intel compilers are described on our compiler page. They provide a wealth of code optimization options and generally produce the fastest code on the Intel CPU platforms. Please, note that the oneAPI compilers don't include the libraries by default, so, if you use a library that you used in the past as a part of the Intel compilers (e.g. MKL or TBB), you now need to load the appropriate module, as described below.
Intel Math Kernel Library
MKL is a high performance math library containing full BLAS, LAPACK, ScaLapack, transforms
and more. It is available as module load intel-oneapi-mkl
. For details on MKL use see our math libraries page.
Intel Inspector
Intel Inspector is a thread and memory debugging tool. It helps in finding errors due to multi-threading (thread safety), and due to memory access (segmentation faults), which are difficult to analyze with traditional source code debuggers.
Start with loading the module - module load intel-oneapi-inspector
. Then launch the tool GUI with inspxe-gui
.
Then either use your own serial code or get examples from the Inspector tutorials.
Intel Advisor
Intel Advisor is a thread and vectorization prototyping tool. The vectorization can be very helpful in detecting loops that could take advantage of vectorization with simple code changes, potentially doubling to quadrupling the performance on CPUs with higher vectorization capabilities (CHPC's Kingspeak and Notchpeak clusters).
Start with loading the module - module load intel-oneapi-advisor
. Then launch the tool GUI with advixe-gui
.
Then either use your own serial code or get examples from the Advisor web page.
The thread prototyping process generally involves four steps:
-
Create a project and survey (time) the target code
-
Annotate the target code to tell Advisor what sections to parallelize
-
Analyze the annotated code to predict parallel performance and predict parallel problems
-
Add the parallel framework (OpenMP, TBB, Cilk+) to the code based on feedback from step 3
The vectorization profiling involves the following:
-
Target survey to explore where to add vectorization or threading
-
Find trip counts to see how many iterations each loop executes
-
Check data dependencies in the loop and use the Advisor hints to fix them
-
Check memory accesses to identify and fix complex data access patterns
Intel VTune Profiler
VTune is an advanced performance profiler. Its main appeal is integrated code performance measurement and evaluation and support for multithreading on both CPUs and accelerators (GPUs, Intel Xeon Phis).
For CPU based profiling on CHPC Linux systems. We highly recommend using a whole compute node for your profiling session, profiling on either an interactive node or using shared SLURM partition will have a high likelihood of other user processes on that node to affect your program's performance and produce an unreliable profiling result.
To profile an application in VTune GUI, do the following:
1. Source the VTune environment: module load intel-oneapi-vtune
2. Start the VTune GUI: vtune-gui
3. Follow the GUI instructions to start a new sampling experiment, run it and then
visualize the results.
If you want to use the CPU hardware counters to count CPU events, you also need to load the VTune Linux kernel modules before the profiling, and unload them after done. To be able to do that, you need to be added to the sudoers group that can do this by sending a message to our help desk.
1. Source the VTune environment: module load intel-oneapi-vtune
2. Load the kernel module: sudo $VTUNE_DIR/sepdk/src/insmod-sep -r -g vtune
3. Start the VTune GUI: vtune-gui
4. Follow the GUI instructions to start a new sampling experiment, run it and then
visualize the results.
5. Unload the kernel module: sudo $VTUNE_DIR/sepdk/src/rmmod-sep
To profile a distributed parallel application (e.g. MPI), one has to use the command line interface for VTune. This can be done either in the SLURM job script, or ran as an interactive job. Inside of the script, or interactive job, do the following:
1. Source the VTune environment: module load intel-oneapi-vtune
2. Source the appropriate compiler and MPI, e.g. module load intel intel-oneapi-mpi
3. Run the VTune command line command, e.g. mpirun -np $SLURM_NTASKS vtune -collect hotspots -result-dir /path/to/directory/with/VTune/result
myExecutable.
Note that we are explicitly stating where to put the results, as the VTune default
results directory name can cause problems during the job launch.
4. To analyze the results, we recommend to use the VTune GUI, i.e. on the cluster
interactive node, start, and then use the "Open Result" option in the main GUI window
to find the directory with the result obtained above.
Finding the right command line parameters for the needed analysis can be cumbersome,
which is why VTune provides a button that displays the command line. This button looks
like >_
and is located at the bottom center of the VTune GUI window.
If you need hardware counters, you'll need to start the kernel module on all the job's
nodes before running the mpirun, as:
srun -n $SLURM_JOB_NUM_NODES --ntasks-per-node=1 sudo $VTUNE_DIR/sepdk/src/insmod-sep
-r -g vtune
And similarly unload the kernel module from all job's nodes at the end of the job.
Intel MPI
Intel MPI is a high performance MPI library which runs on many different network interfaces. The main reason for having IMPI, though is its seamless integration with ITAC and its features. It's generally slightly slower than the top choice MPIs that we use on the clusters, though, there may be applications in which IMPI outperforms our other MPIs so we recommend to include IMPI in performance testing before deciding what MPI to use for production runs. For a quick introduction to Intel MPI, see the Getting Started guide.
Intel MPI by default works with whatever interface it finds on the machine at runtime.
To use it module load intel-oneapi-mpi
. For details, see the MPI Library help page.
Intel Trace Analyzer and Collector
ITAC can be used for MPI code checking and for profiling. To use it, module load intel-oneapi-itac
.
MPI/OpenMPI Profiling
It is the best to run ITAC with Intel compiler and MPI, since that way one can take advantage of their interoperability. That is, also
module load intel-oneapi-compilers intel-oneapi-mpi intel-oneapi-itac
On existing code that was built with IMPI, just run
mpirun -trace -n 4 ./a.out
This will produce a set of trace files a.out.stf*
, which are then loaded to the Trace Analyzer as
traceanalyzer a.out.stf &
To take advantage of additional profiling features, compile with -trace
command as
mpiicc -trace code.c
ITAC reference guide is a good resource for more detailed info about other ways to invoke the tracing, instrumentation, etc. A good tutorial on how to use ITAC to profile a MPI code is named Detecting and Removing Unnecessary Serialization
MPI Correctness Check
To run the correctness checker, the easiest is to compile with -check_mpi
flag as
mpiicc -check_mpi code.c
and then plainly run as
mpirun -check -n 4 ./a.out
If the executable was built with other MPI of the MPICH2 family, one can specifically invoke the checker library by
mpirun -genv LD_PRELOAD libVTmc.so -genv VT_CHECK_TRACING on -n 4 ./a.out
At least this is what the manual says, but, it seems like it's just creating the trace
file. So, the safest way is to use -check_mpi
during compilation. The way to tell the MPI checking is enabled is that the program
will start writing out a lot of output describing what it's doing during the runtime,
such as:
[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
Once the program is done, if there is no MPI error, it'll say:
[0] INFO: Error checking completed without finding any problems.
We recommend anyone who is developing MPI program to run their program through the MPI checker before starting some serious use of the program. It can help to uncover hidden problems that could be hard to locate during normal runtime.
Intel's tutorial on this topic is here.