Submitting Jobs¶

This section explains how to write and submit a job script using the slurm scheduler. This is perhaps the most important part of the documentation. Whether you understand all of the theory or not, you can always jump here and start running things immediately.

The SLURM Quick Start User Guide is a good place to get familiar with the basic commands for starting, stopping and monitoring jobs. Further information can be found in the documentation.

Job script¶

In order to run a job on the cluster you will need to write a submission script. This script tells the cluster what kind of environment you need, what modules must be loaded, and of course, what file needs to be run. An important point to stress here is that, whether you loaded modules on the head node or not (the node you ssh into first), all modules must be added to your submit script. This is because when the cluster actually starts running your jobs, it does so on a completely different node, therefore, it has no memory of which modules you loaded manually when your first accessed the cluster. SLURM jobs inherit the user name (${USER}) and user id (${UID}) of the user who submits the job.

A job script has the following general structure:

 #!/bin/bash
 #SBATCH --time=00:10:00
 #SBATCH --output %j.stdout
 #SBATCH --error  %j.stderr
 module load spack/default gcc/12.3.0 cuda/12.3.0 openmpi/4.1.6 \
             fftw/3.3.10 boost/1.83.0 python/3.12.1
 source espresso-4.3/venv/bin/activate
 srun --cpu-bind=cores python3 espresso-4.3/testsuite/python/particle.py
 deactivate
 module purge

Let’s break it down line by line:

L1: shebang, which is needed to select the shell interpreter
L2-4: slurm batch options; those can be provided by the command line too
L5-6: load modules; you can use Spack, EasyBuild, EESSI, or custom modules
L7: enter a Python environment, e.g. venv, virtualenv or conda; only relevant to Python users
L8: slurm launcher (depends on the cluster)
L9: leave the Python environment
L10: clear modules

This job script is specific to Ant. On other clusters, you will need to adapt the launcher and the module load command. Use module spider to find out how package names are spelled (lowercase, titlecase, with extra suffixes) and which versions are available. Often, a package version will only be available for a specific C compiler or CUDA release; Modulefiles typically do a good job of informing you of incompatible package versions. Some clusters require the srun launcher, while other clusters allow you to use the mpirun launcher. You can find examples of slurm job scripts for all clusters the ICP has access to in this user guide.

Submit jobs¶

Submit jobs with sbatch:

sbatch --job-name="test" --nodes=1 --ntasks=4 --mem-per-cpu=2GB job.sh

All command line options override the corresponding #SBATCH variables in the job script. A “task” is usually understood as a CPU core. There are many options (complete list), but here are the most important ones:

--time=<duration>: wall clock time, typically provided as colon-separated integers, e.g. “minutes”, “minutes:seconds”, “hours:minutes:seconds”
--output=<filepath>: output file for stdout, by default “slurm-%j.out”, where the “%j” is replaced by the job ID
--error=<filepath>: output file for stderr, by default “slurm-%j.out”, where the “%j” is replaced by the job ID
--job-name=<name>: job name; long names are allowed, but most slurm tools will only show the first 8 characters
--chdir="${HOME}/hydrogels": set the working directory of the job script
--gres=<category>[[:type]:count]: request generic resources, e.g. --gres=gpu:l4:2 to request 2x NVIDIA L4
--licenses=<license>[@db][:count][,license[@db][:count]...]: comma-separated list of licenses for commercial software, e.g. --licenses=nastran@slurmdb:12,matlab to request 12 NASTRAN licenses from slurmdb and 1 MATLAB license
--partition=<name>: name of the partition (only on clusters that use partitions)
--exclude=<name>[,name...]: prevent job from landing on a comma-separated list of nodes
--exclusive: prevent any other job from landing on the node(s) on which this job is running (only relevant for benchmark jobs!)
--test-only: dry run: give an estimate of the job start date (but doesn’t schedule it!)
amount of RAM: only one of the following:
- --mem=<size>[K|M|G|T]: total job memory, default is in megabytes (M)
- --mem-per-cpu=<size>[K|M|G|T]: job memory per CPU, default is in megabytes (M)
- --mem-per-gpu=<size>[K|M|G|T]: job memory per GPU, default is in megabytes (M)
number of tasks: any valid combination of the following:
- --nodes=<count>: number of nodes
- --ntasks=<count>: number of tasks
- --ntasks-per-core=<count>: number of tasks per core, usually this should be 1
- --ntasks-per-node=<count>: number of tasks per node
- --ntasks-per-gpu=<count>: number of tasks per GPU

Core pinning¶

Some clusters have hyperthreading enabled, in which case you will need to check in their online documentation how to set the MPI binding policy to use logical core #0 in each physical core (i.e. skip logical core #1).

Some clusters provide the hwloc tool to help you troubleshoot the binding policy. For example, on HPC Vega, one must use sbatch --hint=nomultithread ... and srun --cpu-bind=cores ... to bind per core. Starting an interactive job with:

srun --partition=dev --time=0:05:00 --job-name=interactive --hint=nomultithread \
     --ntasks=32 --ntasks-per-node=32 --mem-per-cpu=400MB --pty /usr/bin/bash

and executing hwloc reporting:

module load Boost/1.72.0-gompi-2020a hwloc
mpiexec -n ${SLURM_NTASKS} --bind-to core --map-by core --report-bindings /bin/true

outputs:

rank  0 bound to socket 1[core  0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../..]
rank  1 bound to socket 1[core  1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../..]
rank  2 bound to socket 1[core  2[hwt 0-1]]: [../../BB/../../../../../../../../../../../../..]
rank  3 bound to socket 1[core  3[hwt 0-1]]: [../../../BB/../../../../../../../../../../../..]
rank  4 bound to socket 1[core  4[hwt 0-1]]: [../../../../BB/../../../../../../../../../../..]
rank  5 bound to socket 1[core  5[hwt 0-1]]: [../../../../../BB/../../../../../../../../../..]
rank  6 bound to socket 1[core  6[hwt 0-1]]: [../../../../../../BB/../../../../../../../../..]
rank  7 bound to socket 1[core  7[hwt 0-1]]: [../../../../../../../BB/../../../../../../../..]
rank  8 bound to socket 1[core  8[hwt 0-1]]: [../../../../../../../../BB/../../../../../../..]
rank  9 bound to socket 1[core  9[hwt 0-1]]: [../../../../../../../../../BB/../../../../../..]
rank 10 bound to socket 1[core 10[hwt 0-1]]: [../../../../../../../../../../BB/../../../../..]
rank 11 bound to socket 1[core 11[hwt 0-1]]: [../../../../../../../../../../../BB/../../../..]
rank 12 bound to socket 1[core 12[hwt 0-1]]: [../../../../../../../../../../../../BB/../../..]
rank 13 bound to socket 1[core 13[hwt 0-1]]: [../../../../../../../../../../../../../BB/../..]
rank 14 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../../BB/..]
rank 15 bound to socket 1[core 15[hwt 0-1]]: [../../../../../../../../../../../../../../../BB]

Each CPU has 64 physical cores and 2 hyperthreads per core, thus 128 logical cores. The ASCII diagram separates physical cores with a forward slash character. Each hyperthread is represented as a dot. Bound hyperthreads are represented as a “B”. This table shows the CPU is using 2 hyperthreads per core, and each MPI rank is bound to both logical cores of the same physical core.

For more details, check these external resources:

Processor Affinity
pinning with psslurm: slides with a graphical representation of pinning from srun on a 2-socket mainboard with hyperthreading enabled
LLview pinning app: GUI to show exactly how the srun pinning options get mapped on the hardware (currently only implemented for hardware at Jülich and partners)

Cancel jobs¶

Cancel jobs with scancel:

# cancel a specific job
scancel 854836
# cancel all my jobs
scancel -u ${USER}
# cancel all my jobs that are still pending
scancel -t PENDING -u ${USER}

Monitor jobs¶

List queued jobs with squeue:

$ squeue --me
   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  853911       ant hydrogel    jgrad PD       0:00      1 (Priority)
  853912       ant  solvent    jgrad  R    8:32:40      2 compute[08,12]
  853913       ant     test    jgrad  R      01:17      1 compute11

The status column (ST) can take various values (complete list), but here are the most common:

PD: pending
R: running
S: suspended

The nodelist/reason column (NODELIST(REASON)) states the node(s) where the job is running, or if it’s not running, the reason why. There are many reasons (complete list), but here are the most common:

Resources: the queue is full and the job is waiting on resources to become available
Priority: there are pending jobs with a higher priority
Dependency: the job depends on another job that hasn’t completed yet
Licenses: the job is waiting for a license to become available

Check job status with sacct:

$ sacct -j 854579 # pending job
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
854579        dft_water        ant       root         12    PENDING      0:0
$ sacct -j 853980 # running job
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
853980       WL2057-0-+        ant       root          5    RUNNING      0:0
853980.batch      batch                  root          5    RUNNING      0:0
853980.0      lmp_lmpwl                  root          5    RUNNING      0:0
$ sacct -j 853080 # completed job
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
853080       capillary+        ant       root          1  COMPLETED      0:0
853080.batch      batch                  root          1  COMPLETED      0:0
$ sacct -j 853180 # cancelled job
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
853180       2flowfiel+        ant       root          1 CANCELLED+      0:0
$ sacct # show most recent jobs
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
854836          solvent        ant       root          4 CANCELLED+      0:0
854837         mpi_test        ant       root          2     FAILED      2:0
854837.0       mpi_test                  root          2  COMPLETED      0:0

Check job properties with scontrol:

$ scontrol show job 853977
JobId=853977 JobName=WL2051-0-40
   UserId=rkajouri(1216) GroupId=rkajouri(1216) MCS_label=N/A
   Priority=372 Nice=0 Account=root QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   RunTime=1-02:24:05 TimeLimit=2-00:00:00 TimeMin=N/A
   Partition=ant AllocNode:Sid=ant:515436
   NodeList=compute11
   NumNodes=1 NumCPUs=5 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   WorkDir=/work/rkajouri/simulations/confinement/bulk_water

Cluster information¶

List partitions and hardware information with sinfo:

$ sinfo # node list
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
ant*         up 2-00:00:00      1  down* compute02
ant*         up 2-00:00:00      2    mix compute[08,12]
ant*         up 2-00:00:00     14  alloc compute[01,03-07,09-11,13-15,17-18]
ant*         up 2-00:00:00      1   down compute16
$ sinfo -s # partition summary
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
ant*         up 2-00:00:00        16/0/2/18 compute[01-18]
$ sinfo -o "%20N  %10c  %10m  %20G " # node accelerators
NODELIST              CPUS        MEMORY      GRES
compute[01-18]        64          386000      gpu:l4:2
$ sinfo -R # show down nodes
REASON               USER      TIMESTAMP           NODELIST
Not responding       slurm     2024-07-21T07:37:47 compute02
Node unexpectedly re slurm     2024-07-23T16:47:22 compute16
$ sinfo -o "%10R %24N %20H %10U %30E " # show down nodes with more details
PARTITION  NODELIST                 TIMESTAMP            USER       REASON
ant        compute02                2024-07-21T07:37:47  slurm(202) Not responding
ant        compute16                2024-07-23T16:47:22  slurm(202) Node unexpectedly rebooted
ant        compute[01,03-15,17-18]  Unknown              Unknown    none

The most common node states are:

alloc: node is fully used and cannot accept new jobs
mix: node is partially used and can accept small jobs
idle: node isn’t used and is awaiting jobs
plnd: node isn’t used but a large job is planned and will be allocated as soon as other nodes become idle to satisfy allocation requirements
down: node is down and awaiting maintenance

Show node properties with scontrol:

$ scontrol show node=compute11
NodeName=compute11 Arch=x86_64 CoresPerSocket=32
   Gres=gpu:l4:2
   OS=Linux 5.14.0-362.24.1.el9_3.0.1.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 4 22:31:43 UTC 2024
   RealMemory=386000 AllocMem=146880 FreeMem=376226 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ant
$ scontrol show nodes
NodeName=compute01 Arch=x86_64 CoresPerSocket=32
[...]
NodeName=compute18 Arch=x86_64 CoresPerSocket=32