Single executable on all CHPC platforms
Advances in compiler technology and MPI libraries are starting to allow building single executables optimized for multiple CPU architectures and running over multiple networks. The document below summarizes how to achieve this on CHPC machines. Please, note that not all compilers and MPIs allow this, therefore we detail how to do this for those that work and what problems to expect with those that don't work.- Short summary
- General remarks
- Multi-architecture CPU optimization
- MPI multiple network interfaces support
- Multi-architecture CPU optimization with multi-network MPI
- Recommendations for different scenarios - skip to here if you don't want to read all the details
Short summary
In this document we show how to build a single executable optimized for multiple CPU architectures that runs in parallel over multiple network types. This should be beneficial for both CHPC staff and for users who need to run their applications optimally on all CHPC clusters.
Moreover, we evaluate the common Application Binary Interface (ABI) for MPI as implemented in several MPI distributions and show how a single executable can be run using several different MPI distributions without the need to recompile.
A result of this is a single parallel executable that can run optimally on all CHPC clusters and on three out of the four MPI distributions that we support.
General remarks
CPU architectures change from generation to generation, affecting the data/instruction processing and adding/modifying CPU instructions. A common trend recently has been improving vectorization capabilities of the CPUs. As of mid 2018, CHPC runs four generations of Intel CPUs -- Nehalem, SandyBridge, Haswell and Skylake -- each of which shows incremental improvement in vectorization processing power which is significant enough to be important be able to harnessed optimally.
Starting with AVX, Intel CPUs feature increasingly complex logic of clock speed adjustments depending on how many CPU cores and vector units are being used. If less cores and vectorization is utilized, the CPU may run at considerably larger clock speed than when all cores/vector units are used. A good review of this beavior is at this Anandtech webpage. For this reason the parallel performance of the current CPUs may not scale linearly with increased core utilization, although in utilizing all the cores and the most performing vector units (e.g. AVX512 in Skylake) should still be the most efficient.
The two commercial compilers that CHPC licenses, Intel and PGI, both support building multiple optimized code paths for different CPU architectures into a single executable, alleviating the need to build separate executables for each CPU type. The open source GNU compiler does not currently support this option.
Parallel programs add another complexity factor in the form of network interface over which the parallel program (usually using an MPI library) runs. Most of CHPC clusters feature high performance InfiniBand network, but, Lonepeak cluster nodes as well as user desktops, have only slower Ethernet network, which in the past required using separate MPIs built for one network or another.
Most of current MPIs allow to include multiple network channels into a single MPI build, which then allows to run executable built with such MPI on several different networks.
As such, we are getting to a point when it is possible to make a single high performance executable that runs optimally on many CPU architectures and over many networks.
That said, due to the simplicity of build and deployment, as well as good performance, we recommend using the Intel compiler and Intel MPI as the first choice for building user applications and we are planning to build most of the applications that we support in this manner.
Multi-architecture CPU optimization
Intel compilers
Intel calls this approach automatic CPU dispatch, and it is invoked with the -ax
compiler flag. The current highest CPU architecture at CHPC includes AVX512 vectorization
instructions and is achieved with-axCORE-AVX512
. This option instructs the compiler to produce two binary paths, one for the AVX512
compatible CPU, and other for generic x86 CPU. So, the code will run optimally using
AVX512 on an AVX512 CPU, but, run suboptimally using only SSE vectorization on any
other CPU (including those having AVX and higher SSE vectorization instructions).
In order to build executable that vectorizes optimally on all CHPC clusters, that
is on Nehalem, SandyBridge, Haswell and Skylake generations of Intel Xeon CPUs, we
need to add specifications for those particular architectures, i.e. -axCORE-AVX512,CORE-AVX2,AVX,SSE4.2
. To verify this is indeed the case, one can enable detailed compiler reporting via
flags -diag-enable=all -qopt-report
and then examine compile report *.optrpt
files, one for each source file. There we should see the following sections reporting
optimizations for the given architectures:
Begin optimization report for: main(int, char **) [skylake_avx512]
Begin optimization report for: main(int, char **) [core_4th_gen_avx]
Begin optimization report for: main(int, char **) [core_2nd_gen_avx]
Begin optimization report for: main(int, char **) [core_i7_sse4_2]
Begin optimization report for: main(int, char **) [generic]
In general, the *.optrpt
files are a good place to look at to examine how well the compiler vectorized the
program, although even better tool is the graphical Intel Advisor.
Note that using the single optimization option -fast
does not build multiple target executables; to produce highly optimized code add
the flags -O3 -ipo
, e.g.
icc -O3 -ipo -axCORE-AVX512,CORE-AVX2,AVX,SSE4.2 hello_ser.c -vec-report
hello_ser.c(8): (col. 5) remark: LOOP WAS VECTORIZED
hello_ser.c(4): (col. 1) remark: main has been targeted for automatic cpu dispatch
Please, also note that we have had trouble with building some codes with this complex multi-target option. If a program compilation fails, remove the SSE4.2 option. This will cause not to build optimized code for Lonepeak, but, the program will still run there using the generic code path, and most likely not be significantly slower.
Below are results of series of HPL benchmark runs compiled with Intel 15.0.1 compiler, MKL 11.2, Intel MPI 5.0.1 and the -ax option vs. the -x option which optimizes for a specified CPU target only. We ran this benchmark on 4 or 5 different nodes, reporting average with standard deviation - due to fact that there have been small variances from run to run due to system noise. The values are in GFlops per node.
-axCORE-AVX2,AVX,SSE4.2 | -xSSE4.2 | -xAVX | -xCORE-AVX2 | Speedup vs. Ember | Cores/node | Core increase | |
---|---|---|---|---|---|---|---|
Ember Westmere | 119.85+/-1.34 | 120.23+/-0.59 | 1.00 | 12 | 1.00 | ||
Kingspeak Sandybridge | 311.63+/-3.56 | 310.73+/-3.58 | 2.60 | 16 | 1.33 | ||
Kingspeak Haswell | 765.90+/-6.48 | 765.48+/-7.49 | 6.39 | 24 | 2.00 |
From the table above, it is evident that the automatic CPU dispatch (-ax) flag produced CPU specific optimized code and single executable runs on all the three platforms optimally. Since most of this benchmark runtime is spent in the LAPACK routines, part of the MKL library, it does not necessarily show the power of compiler optimization for each of the CPU architecture, nevertheless, the code runs optimally on all the platforms without giving an illegal instruction error which would be the case when we would optimize only for the latest CPU architecture and run on an earlier one.
It is also worth noticing the increase of GFlops performance among the three generation, doubling it per core with the AVX instruction set (2 vs. 4 double precision wide vector unit) and making it ~3x faster with the AVX2 (adding the fused multiply-add instruction, i.e. doing vector multiplication and addition in a single instruction).
NVHPC compilers
PGI compilers call this the unified binary. It is achieved by bundling the different
CPU architecture names into the -tp
compiler flag. For the three CHPC compiler architectures, this corresponds to -tp=nehalem,sandybridge,haswell,skylake
.
Notice that using single optimization option -fastsse
works fine, e.g.
nvc -fastsse -tp=nehalem,sandybridge,haswell,skylake -Minfo=unified,vect hello_ser.c
main:
8, PGI Unified Binary version for -tp=skylake-64
16, Generated 2 alternate versions of the loop
Generated vector simd code for the loop
main:
8, PGI Unified Binary version for -tp=haswell-64
16, Generated 2 alternate versions of the loop
Generated vector simd code for the loop
main:
8, PGI Unified Binary version for -tp=sandybridge-64
16, Generated 3 alternate versions of the loop
Generated vector simd code for the loop
main:
8, PGI Unified Binary version for -tp=nehalem-64
16, Generated 3 alternate versions of the loop
Generated vector simd code for the loop
However, there are problems with difficulty in building MPI distributions with the unified binary approach, which we'll detail below.
It is possible to work around all these issues, however, it makes PGI builds a little more cumbersome.
GNU compilers
GNU compilers (as far as we know) don't allow for multiple code paths in an executable
so one has to build optimized code for each CPU architecture. Also note that the gcc
8.5.0 series which is shipped with Rocky Linux 8, does have optimization flags for
recent Intel CPUs that are on the Notchpeak cluster, but, it does not support optimizations
for the AMD CPUs that are on Notchpeak. The AMD CPUs will take advantage of the older
AVX2 instructions, equivalent to the CPUs on the Kingspeak cluster. For Kingspeak,
use -march=sandybridge
, for Notchpeak Intel nodes, use flag -march=skylake-avx512
. For Notchpeak AMD nodes, use flag -march=broadwell
.
Newer versions of the GNU compilers are installed as needed that include support for
newer CPUs. Version 10.2.0 version includes -march=znver2
flag which includes specific optimizations for the Notchpeak AMD CPUs, for example:
module load gcc/10.2.0
gcc -march=znver2 hello_ser.c
MPI multiple network interfaces support
Intel MPI
Intel MPI supports multiple networks. To use
module load intel impi
The choice of network is made by the FI_PROVIDER variable, with allowed variables being {verbs, tcp}. While the best available fabric should be selected by default, we have seen cases when the application would crash at the start when the fabric is not specified. If this happens, the network can be explicitly specified by
For Infiniband -- mpirun -genv FI_PROVIDER=verbs -np 2 ./latbw_impi
For ethernet -- mpirun -genv FI_PROVIDER=tcp -np 2 ./latbw_impi
Here's a table of latency/bandwidth. Similar performance can be expected on other clusters.
Ash Latency [us] | Ash Bandwidth [MB/s] | KP Latency [us] | KP Bandwidth [MB/s] | NP Latency [us] | NP Bandwidth [MB/s] | |
---|---|---|---|---|---|---|
verbs | 2.05 | 6254 | 1.68 | 3376.022 | 1.20 | 6191.268 |
tcp |
23.31 | 2719 | 14.60 | 1748.446 | 13.20 | 2421.430 |
Notice that the tcp ran over the InfiniBand using ipoib therefore we see fairly decent latencies and bandwidths.
A different way of launching can be chosen by -bootstrap
option. There are many different choices for this option; of value at CHPC are slurm
(the default if nothing is specified), and ssh
. To use ssh
over multiple nodes, one has to prepare a host file (as we were used in the PBS world),
and feed it to the mpirun:mpirun -genv FI_PROVIDER=verbs -bootstrap ssh -machinefile nodefile -np 2 ./latbw_impi
Now, the fun part is, that, depending on how we name the node names in the host file, the tcp will run over the given network. If we use hostnames such as kp001,..., it'll run over the Ethernet, if we use kp001.ipoib,..., it'll run over the InfiniBand using ipoib.
Due to the simplicity of build and deployment, and good CPU affinity support, we recommend using Intel MPI in user applications. However, be aware that OpenMPI and MVAPICH2 have better latencies. Applications that send many small messages will likely perform the best with OpenMPI.
Also note that thanks to the common Application Binary Interface (ABI) between Intel MPI (>= 5.0) , MPICH (>= 3.1) and MVAPICH2 (>= 2.0), one can build a dynamic executable (default on our systems) with one MPI, and run with the other. See details about this at https://software.intel.com/en-us/articles/using-intelr-mpi-library-50-with-mpich3-based-applications.
MPICH
MPICH's most commonly used nemesis
channel supports several subchannels (called netmods), including tcp
, ib
, mxm
and ofi
. The last three use InfiniBand, however ib
and mxm
are deprecated. We therefore built MPICH from version 3.2.1 with the relatively new ofi
OpenFabrics netmod. As such, it does not appear to be optimized since latencies and
bandwidths seem to be lower than that of MVAPICH2 or Intel MPI. We therefore recommend
to build with these if the program will be run mostly on the InfiniBand clusters.
Since the tcp
netmod is default, to run over InfiniBand, set environment variable MPICH_NEMESIS_NETMOD=ofi
, e.g.mpirun -genv MPICH_NEMESIS_NETMOD ofi -np 2 ./a.out
For a single workstation (running MPI program with multiple processes on a single
node), the network is not accessed and therefore the program works even if IB is not
present on the system.
Also, note that unlike Intel MPI or MVAPCH2, MPICH does not set task affinity to the
CPUs (i.e., binds tasks to CPUs for better performance) by default. To enable the
task affinity, you need to use flag-bind-to
, e.g. to bind each MPI task to a core, -bind-to=core
.
It is possible, however, to run the MPICH built executable on the clusters with InfiniBand,
taking advantage of the common ABI with Intel MPI. Simply add the Intel MPI to your
environment, withmodule load impi
and then run with Intel MPI asmpirun -np 2 ./a.out
OpenMPI
OpenMPI has provided multi-network support for a while, and chooses the best available network automatically.
To force use Ethernet, add the --mca btl tcp,self
flag to mpirun, the default is to use the InfiniBand.
mpirun --mca btl tcp,self -np $SLURM_NTASKS $EXE # uses Ethernet
mpirun -np $SLURM_NTASKS $EXE # uses InfiniBand
Here's a table of latency/bandwidth on ember and kingspeak. Similar performance can be expected on other clusters.
KP Latency [us] | KP Bandwidth [MB/s] | NP Latency [us] | NP Bandwidth [MB/s] | |
---|---|---|---|---|
ib | 1.486 | 3348.124 | 2.962 | 6081.060 |
tcp |
15.189 | 229.600 | 18.405 | 230.317 |
It looks like the TCP bandwidth is limited to 1 Mbit/sec, which indicates that OpenMPI chose the Ethernet network to run over.
Since OpenMPI does not yet support the common MPI ABI, its executable can not be exchanged with intel MPI, MVAPICH2, and MPICH.
Multi-architecture CPU optimization with multi-network MPI
Intel MPI
Intel MPI multi-network support along with the automatic CPU dispatch inside of the
Intel compilers is quite straightforward and works well. During the compilation, all
one needs is to include the -axCORE-AVX512,CORE-AVX2,AVX,SSE4.2
flag, -O3 -ip
to also include good optimization, and, then, at runtime select the appropriate network
fabrics via the I_MPI_FABRICS environment variable.
Intel MPI is thus a good choice for a single executable that runs optimized over all CHPC Linux machines. As such, most of the applications that we support are built that way.
Note that there is also mpitune
utility that allows one to run multiple scenarios automatically and come up with
the best Intel MPI runtime parameters (all MPIs feature a number of internal switches
that adjust communication parameters based on message sizes, task counts, etc). Details
on the mpitune
utility are at Intel's mpitune page.
MVAPICH2
The default network channel of MVAPICH2 is mrail
, which only supports InfiniBand, but, it is supposed to provides better performance
than the nemesis
channel which was inherited from MPICH. However, in reality based on the LAMMPS benchmarks
below (outdated) MPICH with mxm
performs faster. MVAPICH2's strength may lie in specific communication and InfiniBand
optimizations which LAMMPS may not be utilizing.
MVAPICH2's good performance on InfiniBand makes it a good choice for a program that will run on InfiniBand clusters. If using Intel or PGI compilers, you can build executable that can run optimally on all three CPU architectures that our clusters have.
Note that you can still run MVAPICH2 executables on Ethernet only clusters or on single
desktops using Intel MPI or MPICH libraries thanks to common ABI. In case the program
complains about missing libpmi.so,setenv LD_LIBRARY_PATH "/uufs/chpc.utah.edu/sys/pkg/slurm/std/lib:$LD_LIBRARY_PATH"
for tcsh
or export LD_LIBRARY_PATH "/uufs/chpc.utah.edu/sys/pkg/slurm/std/lib:$LD_LIBRARY_PATH"
for bash,
and then run the executable as if using Intel MPI or MPICH.
MPICH
MPICH with multi-network (over MXM), multi-architecture works well only with Intel
compilers, GNU does not support multi-architecture, whereas PGI has problems with
running unified binary-built MPICH on Sandybridge and higher CPUs. Therefore we only
provide MPICH built with Intel compiler and the -axCORE-AVX512,CORE-AVX2,AVX,SSE4.2
flag, along with the PGI and GNU MPICH builds with the lowest common denominator
optimization, which is the Nehalem architecture. Since there should be minimum vectorization
potential inside the MPI, this should not affect the performance radically. You can
then build PGI unified binary applications on top of the Nehalem optimized MPI.
Furthermore, from MPICH 3.2.1, the MXM interface is not included in our build, replaced with OpenFabrics (ofi), which does not appear to be as well performing as MXM used to be.
OpenMPI
OpenMPI with multi-network, multi-architecture works well with Intel compilers. GNU does not support multi-architecture. PGI has problems with running unified binary-built OpenMPI on Sandybridge and higher CPUs, which is why we have built OpenMPI optimized for Nehalem as in case of MPICH. Build PGI unified binary on top of the Nehalem optimized MPI.
LAMMPS benchmark results
Below are benchmarks of LAMMPS molecular dynamics code results for jobs run on four Kingspeak Haswell and SandyBridge nodes using Intel compilers and MPI (-axCORE-AVX2) to build the code and running using different Intel MPI runtime option, and also MPICH and MVAPICH2. Total runtime and communication time in seconds are reported in the table below, which means lower number is better.
The runs were performed as follows:
- IMPI default
module load intel impi
set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
mpirun -np $SLURM_NTASKS $EXE < in.spce - IMPI shm:dapl
module load intel impi
set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
mpirun -genv I_MPI_FABRICS shm:dapl -np $SLURM_NTASKS $EXE < in.spce - IMPI shm:ofa
module load intel impi
set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
mpirun -genv I_MPI_FABRICS shm:ofa -np $SLURM_NTASKS $EXE < in.spce - IMPI srun
module load intel impi
setenv I_MPI_PMI_LIBRARY /uufs/kingspeak.peaks/sys/pkg/slurm/std/lib/libpmi.so
set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
srun -n $SLURM_NTASKS $EXE < in.spce - MPICH default
module load intel mpich2
setenv LD_LIBRARY_PATH "/uufs/chpc.utah.edu/sys/installdir/mpich/3.1.4i/lib:$LD_LIBRARY_PATH"
setenv MPICH_NEMESIS_NETMOD mxm
set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
mpirun -np $SLURM_NTASKS $EXE < in.spce - MPICH affinity
module load intel mpich2
setenv LD_LIBRARY_PATH "/uufs/chpc.utah.edu/sys/installdir/mpich/3.1.4i/lib:$LD_LIBRARY_PATH"
setenv MPICH_NEMESIS_NETMOD mxm
set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
mpirun -bind-to core -np $SLURM_NTASKS $EXE < in.spce - MVAPICH2 default
module load intel mvapich2
set EXE = /uufs/chpc.utah.edu/sys/installdir/lammps/9Dec14/src/lmp_intel2015
srun -n $SLURM_NTASKS $EXE < in.spce
LAMMPS LJ | |||||
96 procs, Haswell | 64 procs, Sandybridge | ||||
Total | Comm | Total | Comm | ||
IMPI default | 41.02 | 10.46 | 73.61 | 23.40 | core affinity |
IMPI shm:dapl | 40.79 | 10.25 | 73.55 | 23.31 | core affinity |
IMPI shm:ofa | 42.79 | 13.70 | 70.61 | 22.94 | core affinity |
IMPI srun | 64.15 | 30.48 | 91.71 | 38.14 | no affinity |
MPICH mxm | 64.93 | 31.07 | 89.07 | 36.36 | no affinity |
MPICH mxm bind core | 45.66 | 14.63 | 74.96 | 24.59 | core affinity |
MVAPICH2 default | 42.61 | 12.30 | 74.67 | 25.19 | core affinity |
LAMMPS SPCE | |||||
96 procs, Haswell | 64 procs, Sandybridge | ||||
Total | Comm | Total | Comm | ||
IMPI default | 60.39 | 2.61 | 76.91 | 3.71 | core affinity |
IMPI shm:dapl | 60.16 | 2.60 | 76.78 | 3.72 | core affinity |
IMPI shm:ofa | 60.68 | 2.77 | 74.60 | 3.41 | core affinity |
IMPI srun | 84.37 | 4.14 | 100.87 | 4.76 | no affinity |
MPICH mxm | 85.82 | 5.17 | 106.45 | 5.48 | no affinity |
MPICH mxm bind core | 60.77 | 3.69 | 75.94 | 4.18 | core affinity |
MVAPICH2 default | 60.58 | 3.03 | 78.09 | 4.44 | core affinity |
There are several observations that can be made:
- Some MPI distributions set process affinity by default (Intel MPI, MVAPICH2, OpenMPI while others do not (MPICH). Slurm currently is not set up for task affinity and as such srun with Intel MPI does not perform well (this is to be retested once we roll Slurm affinity to the clusters). MVAPICH2 overrides the Slurm affinity settings and sets its own.
- Using the process affinity is a good idea for performance reasons.
- Intel MPI's ofa and dapl fabrics performance is comparable, but, in some cases one is slightly better than the other, and vice versa.
- MPICH mxm netmod is close to be competitive to Intel MPI, it may improve with the MPICH 3.2 release (mxm contribution from Mellanox) and/or our Mellanox OFED stack update in the future.
- MVAPICH2 is not as fast as we have hoped, Intel MPI is faster for LAMMPS.
Recommendations for different scenarios
To summarize the material presented above, here are our recommendations for optimal performance for several common scenarios.
For an optimized application running on all CHPC Linux machines
Use the Intel compilers with Intel MPI and the automatic CPU dispatch with -axCORE-AVX512,CORE-AVX2,AVX,SSE4.2
compiler flag. It is simple and it works. Test the dapl and ofa fabrics and pick
up one that performs better using the I_MPI_FABRICS option to mpirun.
Alternatively, all other MPI distributions (MVAPICH2, MPICH, OpenMPI) work with the
Intel compiler and the -axCORE-AVX2,AVX,SSE4.2
option as well. If you use MVAPICH2, you can use the same executable to run on the
Ethernet clusters and desktops using MPICH or Intel MPI thanks to the common ABI (see
the Intel MPI section).
For an optimized application using GNU compilers
Since GNU does not allow for multi-architecture optimization, you have to build separate executables for the Kingspeak SandyBridge nodes (16, 20 cores) and Haswell nodes (24 cores). Use the -march flag to specify the appropriate given CPU architecture (-march=westmere - Lonepeak, -march=sandybridge - Kingspeak 16 and 20 core nodes, -march=haswell - Kingspeak 24 and 28 core nodes, Notchpeak AMD nodes, -march=skylake - Notchpeak Intel nodes).
As for what MPI to use, on the clusters, use Open MPI or MVAPICH2 for single threaded programs and Intel MPI for multi-threaded program - the latter provides better CPU affinity control.
Be aware that OpenMPI lacks the ABI compatibility with the other MPI distributions which makes it less flexible.
For an optimized application using NVHPC compilers
The NVHPC unified binary works fairly well with applications, however, we did not have much success building MPI libraries with it, so, the MPIs in the chpc.utah.edu branch have been built to the lowest common denominator (Lonepeak). However, we don't expect any performance impact on using these MPI builds.
To then build application that will run optimized on all CHPC InfiniBand clusters, use MVAPICH2 with -tp=nehalem,sandybridge,haswell,skylake compiler flag. For flexible binary that works both on InfiniBand and on Ethernet, use Intel MPI or MPICH.
OpenMPI would be another good choice.