Submitting Jobs¶
This section explains how to write and submit a job script using the slurm scheduler. This is perhaps the most important part of the documentation. Whether you understand all of the theory or not, you can always jump here and start running things immediately.
The SLURM Quick Start User Guide is a good place to get familiar with the basic commands for starting, stopping and monitoring jobs. Further information can be found in the documentation.
Job script¶
In order to run a job on the cluster you will need to write a submission script. This script tells the cluster what kind of environment you need, what modules must be loaded, and of course, what file needs to be run. An important point to stress here is that, whether you loaded modules on the head node or not (the node you ssh into first), all modules must be added to your submit script. This is because when the cluster actually starts running your jobs, it does so on a completely different node, therefore, it has no memory of which modules you loaded manually when your first accessed the cluster.
A job script has the following general structure:
1 #!/bin/bash
2 #SBATCH --time=00:10:00
3 #SBATCH --output %j.stdout
4 #SBATCH --error %j.stderr
5 module load spack/default gcc/12.3.0 cuda/12.3.0 openmpi/4.1.6 \
6 fftw/3.3.10 boost/1.83.0 python/3.12.1
7 source espresso-4.3/venv/bin/activate
8 srun --cpu-bind=cores python3 espresso-4.3/testsuite/python/particle.py
9 deactivate
10 module purge
Let’s break it down line by line:
L1: shebang, which is needed to select the shell interpreter
L2-4: slurm batch options; those can be provided by the command line too
L5-6: load modules; you can use Spack, EasyBuild, EESSI, or custom modules
L7: enter a Python environment, e.g. venv, virtualenv or conda; only relevant to Python users
L8: slurm launcher (depends on the cluster)
L9: leave the Python environment
L10: clear modules
This job script is specific to Ant. On other clusters, you will need to adapt
the launcher and the module load
command. Use module spider
to find out
how package names are spelled (lowercase, titlecase, with extra suffixes) and
which versions are available. Often, a package version will only be available
for a specific C compiler or CUDA release; Modulefiles typically do a good job
of informing you of incompatible package versions. Some clusters require the
srun
launcher, while other clusters allow you to use the mpirun
launcher.
You can find examples of slurm job scripts for all clusters the ICP has access
to in this user guide.
Submit jobs¶
Submit jobs with sbatch:
sbatch --job-name="test" --nodes=1 --ntasks=4 --mem-per-cpu=2GB job.sh
All command line options override the corresponding #SBATCH
variables in the job script.
A “task” is usually understood as a CPU core. There are many options
(complete list),
but here are the most important ones:
--time=<duration>
: wall clock time, typically provided as colon-separated integers, e.g. “minutes”, “minutes:seconds”, “hours:minutes:seconds”--output=<filepath>
: output file for stdout, by default “slurm-%j.out”, where the “%j” is replaced by the job ID--error=<filepath>
: output file for stderr, by default “slurm-%j.out”, where the “%j” is replaced by the job ID--job-name=<name>
: job name; long names are allowed, but most slurm tools will only show the first 8 characters--chdir="${HOME}/hydrogels"
: set the working directory of the job script--gres=<category>[[:type]:count]
: request generic resources, e.g.--gres=gpu:l4:2
to request 2x NVIDIA L4--licenses=<license>[@db][:count][,license[@db][:count]...]
: comma-separated list of licenses for commercial software, e.g.--licenses=nastran@slurmdb:12,matlab
to request 12 NASTRAN licenses from slurmdb and 1 MATLAB license--partition=<name>
: name of the partition (only on clusters that use partitions)--exclude=<name>[,name...]
: prevent job from landing on a comma-separated list of nodes--exclusive
: prevent any other job from landing on the node(s) on which this job is running (only relevant for benchmark jobs!)--test-only
: dry run: give an estimate of the job start date (but doesn’t schedule it!)amount of RAM: only one of the following:
--mem=<size>[K|M|G|T]
: total job memory, default is in megabytes (M
)--mem-per-cpu=<size>[K|M|G|T]
: job memory per CPU, default is in megabytes (M
)--mem-per-gpu=<size>[K|M|G|T]
: job memory per GPU, default is in megabytes (M
)
number of tasks: any valid combination of the following:
--nodes=<count>
: number of nodes--ntasks=<count>
: number of tasks--ntasks-per-core=<count>
: number of tasks per core, usually this should be 1--ntasks-per-node=<count>
: number of tasks per node--ntasks-per-gpu=<count>
: number of tasks per GPU
Core pinning¶
Some clusters have hyperthreading enabled, in which case you will need to check in their online documentation how to set the MPI binding policy to use logical core #0 in each physical core (i.e. skip logical core #1).
Some clusters provide the hwloc tool to help you troubleshoot the binding policy.
For example, on HPC Vega, one must use sbatch --hint=nomultithread ...
and
srun --cpu-bind=cores ...
to bind per core. Starting an interactive job with:
srun --partition=dev --time=0:05:00 --job-name=interactive --hint=nomultithread \
--ntasks=32 --ntasks-per-node=32 --mem-per-cpu=400MB --pty /usr/bin/bash
and executing hwloc reporting:
module load Boost/1.72.0-gompi-2020a hwloc
mpiexec -n ${SLURM_NTASKS} --bind-to core --map-by core --report-bindings /bin/true
outputs:
rank 0 bound to socket 1[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../..]
rank 1 bound to socket 1[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../..]
rank 2 bound to socket 1[core 2[hwt 0-1]]: [../../BB/../../../../../../../../../../../../..]
rank 3 bound to socket 1[core 3[hwt 0-1]]: [../../../BB/../../../../../../../../../../../..]
rank 4 bound to socket 1[core 4[hwt 0-1]]: [../../../../BB/../../../../../../../../../../..]
rank 5 bound to socket 1[core 5[hwt 0-1]]: [../../../../../BB/../../../../../../../../../..]
rank 6 bound to socket 1[core 6[hwt 0-1]]: [../../../../../../BB/../../../../../../../../..]
rank 7 bound to socket 1[core 7[hwt 0-1]]: [../../../../../../../BB/../../../../../../../..]
rank 8 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../../BB/../../../../../../..]
rank 9 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../../../BB/../../../../../..]
rank 10 bound to socket 1[core 10[hwt 0-1]]: [../../../../../../../../../../BB/../../../../..]
rank 11 bound to socket 1[core 11[hwt 0-1]]: [../../../../../../../../../../../BB/../../../..]
rank 12 bound to socket 1[core 12[hwt 0-1]]: [../../../../../../../../../../../../BB/../../..]
rank 13 bound to socket 1[core 13[hwt 0-1]]: [../../../../../../../../../../../../../BB/../..]
rank 14 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../../BB/..]
rank 15 bound to socket 1[core 15[hwt 0-1]]: [../../../../../../../../../../../../../../../BB]
Each CPU has 64 physical cores and 2 hyperthreads per core, thus 128 logical cores. The ASCII diagram separates physical cores with a forward slash character. Each hyperthread is represented as a dot. Bound hyperthreads are represented as a “B”. This table shows the CPU is using 2 hyperthreads per core, and each MPI rank is bound to both logical cores of the same physical core.
For more details, check these external resources:
pinning with psslurm: slides with a graphical representation of pinning from
srun
on a 2-socket mainboard with hyperthreading enabledLLview pinning app: GUI to show exactly how the
srun
pinning options get mapped on the hardware (currently only implemented for hardware at Jülich and partners)
Cancel jobs¶
Cancel jobs with scancel:
# cancel a specific job
scancel 854836
# cancel all my jobs
scancel -u ${USER}
# cancel all my jobs that are still pending
scancel -t PENDING -u ${USER}
Monitor jobs¶
List queued jobs with squeue:
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
853911 ant hydrogel jgrad PD 0:00 1 (Priority)
853912 ant solvent jgrad R 8:32:40 2 compute[08,12]
853913 ant test jgrad R 01:17 1 compute11
The status column (ST
) can take various values
(complete list),
but here are the most common:
PD
: pendingR
: runningS
: suspended
The nodelist/reason column (NODELIST(REASON)
) states the node(s) where the
job is running, or if it’s not running, the reason why. There are many reasons
(complete list),
but here are the most common:
Resources
: the queue is full and the job is waiting on resources to become availablePriority
: there are pending jobs with a higher priorityDependency
: the job depends on another job that hasn’t completed yetLicenses
: the job is waiting for a license to become available
Check job status with sacct:
$ sacct -j 854579 # pending job
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
854579 dft_water ant root 12 PENDING 0:0
$ sacct -j 853980 # running job
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
853980 WL2057-0-+ ant root 5 RUNNING 0:0
853980.batch batch root 5 RUNNING 0:0
853980.0 lmp_lmpwl root 5 RUNNING 0:0
$ sacct -j 853080 # completed job
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
853080 capillary+ ant root 1 COMPLETED 0:0
853080.batch batch root 1 COMPLETED 0:0
$ sacct -j 853180 # cancelled job
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
853180 2flowfiel+ ant root 1 CANCELLED+ 0:0
$ sacct # show most recent jobs
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
854836 solvent ant root 4 CANCELLED+ 0:0
854837 mpi_test ant root 2 FAILED 2:0
854837.0 mpi_test root 2 COMPLETED 0:0
Check job properties with scontrol:
$ scontrol show job 853977
JobId=853977 JobName=WL2051-0-40
UserId=rkajouri(1216) GroupId=rkajouri(1216) MCS_label=N/A
Priority=372 Nice=0 Account=root QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
RunTime=1-02:24:05 TimeLimit=2-00:00:00 TimeMin=N/A
Partition=ant AllocNode:Sid=ant:515436
NodeList=compute11
NumNodes=1 NumCPUs=5 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
WorkDir=/work/rkajouri/simulations/confinement/bulk_water
Cluster information¶
List partitions and hardware information with sinfo:
$ sinfo # node list
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
ant* up 2-00:00:00 1 down* compute02
ant* up 2-00:00:00 2 mix compute[08,12]
ant* up 2-00:00:00 14 alloc compute[01,03-07,09-11,13-15,17-18]
ant* up 2-00:00:00 1 down compute16
$ sinfo -s # partition summary
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
ant* up 2-00:00:00 16/0/2/18 compute[01-18]
$ sinfo -o "%20N %10c %10m %20G " # node accelerators
NODELIST CPUS MEMORY GRES
compute[01-18] 64 386000 gpu:l4:2
$ sinfo -R # show down nodes
REASON USER TIMESTAMP NODELIST
Not responding slurm 2024-07-21T07:37:47 compute02
Node unexpectedly re slurm 2024-07-23T16:47:22 compute16
$ sinfo -o "%10R %24N %20H %10U %30E " # show down nodes with more details
PARTITION NODELIST TIMESTAMP USER REASON
ant compute02 2024-07-21T07:37:47 slurm(202) Not responding
ant compute16 2024-07-23T16:47:22 slurm(202) Node unexpectedly rebooted
ant compute[01,03-15,17-18] Unknown Unknown none
The most common node states are:
alloc
: node is fully used and cannot accept new jobsmix
: node is partially used and can accept small jobsidle
: node isn’t used and is awaiting jobsplnd
: node isn’t used but a large job is planned and will be allocated as soon as other nodes become idle to satisfy allocation requirementsdown
: node is down and awaiting maintenance
Show node properties with scontrol:
$ scontrol show node=compute11
NodeName=compute11 Arch=x86_64 CoresPerSocket=32
Gres=gpu:l4:2
OS=Linux 5.14.0-362.24.1.el9_3.0.1.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 4 22:31:43 UTC 2024
RealMemory=386000 AllocMem=146880 FreeMem=376226 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=ant
$ scontrol show nodes
NodeName=compute01 Arch=x86_64 CoresPerSocket=32
[...]
NodeName=compute18 Arch=x86_64 CoresPerSocket=32