Node Sharing
CHPC now has the usage accounting structure in place to allow multiple batch jobs to share a single node. We have been using the node-sharing feature of Slurm since the addition of the GPU nodes to kingspeak, as it is typically most efficient to run 1 job per GPU on nodes with multiple GPUs.
More recently, we have offered node sharing to select owner groups for testing, and based on that experience we are making node sharing available for any group that owns nodes. If your group owns nodes and would like to be set up to use node sharing, please contact CHPC via helpdesk@chpc.utah.edu.
Node sharing is enabled on all compute nodes in the general CHPC computing environment. It is the default behavior on granite. It is also available, but not enabled by default, on notchpeak, kingspeak, and lonepeak.
On this page
The table of contents requires JavaScript to load.
Specifying requested resources
For node sharing on the GPU nodes, see the GPU page.
For node sharing on the non-GPU nodes, node sharing requires that users explicitly request the number of cores and the amount of memory that should be allocated. The remaining cores and memory will then be available for other jobs. The requested number of cores and amount of memory will be used to set up the “cgroup” for the job, which is the mechanism used to enforce these limits.
On the granite cluster, node sharing is enabled by default, and there is not a specific "shared" partition. On clusters that predate granite, there are different partitions for whole-node jobs and jobs with node sharing. On the clusters predating granite, the node sharing partitions are:
- For general nodes of a cluster
#SBATCH --partition=cluster-shared
- For guest access on owner nodes of a cluster
#SBATCH --partition=cluster-shared-guest
- For owner nodes
#SBATCH --partition=partitionname-shared-kp
In addition, on notchpeak there are two nodes (AMD Epyc processors, 64 cores, 512 GB memory) reserved for short jobs, which can only be used in a shared manner. To use these nodes, both the account and partition should be set to notchpeak-shared-short. See the Notchpeak Cluster Guide for more information.
The number of cores requested must be specified using the ntasks sbatch directive:
#SBATCH --ntasks=2
will request 2 cores.
The amount of memory requested can be specified with the memory batch directive.
#SBATCH --mem=32G
This can also be specified in MB (which is the assumed unit if none is specified):
#SBATCH --mem=32000
If there is no memory directive used, the default behavior is that 2G/core will be allocated to the job.
With node sharing, when using sinfo
, you will notice that there is an additional state for jobs that are partially
allocated: mix
.
Task affinity
Node sharing automatically sets task to CPU core affinity, allowing the job to run only on as many cores as there are requested tasks. To find out what cores was the job pinned to, run
cat /cgroup/cpuset/slurm/uid_$SLURM_JOB_UID/job_$SLURM_JOB_ID/cpuset.cpus
Note that since we have CPU hyperthreading on (allowing two logical cores per one
physical core), this command will report a pair of logical cores for each physical
core, e.g. (2,30) corresponds to core number 3 (numbering starts from 0) on a 28 core
node, and its associated hypercore. Node core numbering from the system perspective
is obtained by running numactl -H
.
Potential implications for performance
Despite the task affinity, the performance of any job run in a shared manner can be impacted by other jobs run on the same node due to shared infrastructure being used, in particular the I/O to storage and the shared communication paths between the memory and CPUs. If you are doing benchmarking using only a portion of a node, you should not use node sharing but instead request the entire node.
Implications for the amount of allocation used
On the granite cluster, the allocation usage of a shared job is based on the maximum of (a) the fraction of cores used, (b) the fraction of memory used, and (c) the fraction of GPUs used. On clusters predating granite, GPUs are not allocated, so only (a) and (b) apply for this calculation. We calculate usage in this manner because using one resource—cores, memory, or GPUs—may affect other researchers' ability to use other resources. As an extreme example, if a node had available GPUs but no available CPU cores, a job requiring a GPU would be unable to start; in effect, the other job(s) running on the node are tying up resources, even if they are not used. The same principle applies to memory and GPUs.
In the case of owner resources, where the quarterly allocation listed at https://www.chpc.utah.edu/usage/cluster/current-project-general.php is based on the number of cores or GPUs and the number of hours in the quarter, this can lead to scenarios where more core or GPU hours than are available for a given quarter are used. Note that this will not cause any issues, but that it can lead instances where the usage is greater than the allocated amount and therefore a negative balance will show.
As an example, consider a CPU-only node with 24 cores and 128 GB. Without node sharing, during any 1-hour period, the maximum usage would be 24 core hours. However, with node sharing, one could envision that a serial job which requires a lot of memory requests 1 core and 96 GB of memory along with a second job requesting 20 cores and the remaining 32 GB of memory. As all of the memory has been allocated, the remaining 3 cores will stay idle. If both jobs run simultaneously for 1 hour, a total of 38 core hours of allocation will be used in this hour (18 core hours for the 96 GB memory job, based on 3/4 of the memory of the node being used, along with 20 core hours for the 20 core job).