HTCondor¶
On the ICP workstations, we use HTCondor to efficiently manage high-throughput computing (HTC) jobs that require extended runtimes. HTCondor allows users to submit jobs to a centralized scheduler, which queues and distributes them across available resources.
Things to remember¶
HTCondor is a system to run multiple jobs (high throughput computing) on unused ICP compute resources.
All workstations, including the CIP pool at the ICP, are part of the Condor cluster.
You can submit jobs from any workstation except the ssh login nodes.
Condor is best suited for serial jobs, but parallel jobs with a few cores can also work.
Make sure your jobs do not write to disk more often than once per minute.
Don’t write too often or in large amounts to the home directory from a Condor job, as it can fill up your home directory. If you need to dump a lot of data, use
/data
or/work
directories. More info about data storage can be found on the wiki.Condor evicts jobs if a user needs more resources on the computer where the job is running or if another user has a higher priority (i.e., has used Condor less). Therefore, if your jobs have a long runtime, you need to implement checkpointing to resume simulation from the point when the job was evicted.
Condor sends
SIGTERM (15)
signal to a job before kicking it out. You can catch that signal to perform checkpointing.
For a summry of important commands when using HTCondor can be found in this quick start guide.
Submitting a Job¶
To submit a simulation on Condor, you need a Condor job script, which you can submit using the condor_submit <job script>
command. You can add more switches to condor_submit if you wish to customize the process.
Job Script¶
An example job script:
# Simple HTCondor submit description file
# Everything with a leading # is a comment
executable = myexe # Path to your executable file
arguments = arguments # Arguments you want to pass through the command line
output = OUTPUT # Path for output file, which records the console output
error = ERROR # Path for error file, which records error signals
log = LOG # Path for Condor log file
request_cpus = 1 # Number of CPUs required for your job
request_memory = 1024M # RAM allocated to your job, you can also use G for GB ex. 1G instead of 1024M
request_disk = 10240K # Disk space required for your job
queue # Queues the Condor job
HTCondor website explains this in more detail.
For submitting multiple jobs with varying arguments, you can use the following script:
# Condor submission file for running a program with different arguments
executable = <path_to_executable>
arguments = $(ARG)
request_memory = 1G
request_cpus = 1
# Specify the output, error, and log files
output = output_$(ARG).out
error = error_$(ARG).err
log = log_$(Cluster).log
# Queue jobs with different arguments
queue ARG from args.txt
It loads arguments from the file args.txt, which is in the same directory from where you submit the job. Otherwise, you need to specify the full path. The args.txt file looks like this:
argument1
argument2
argument3
.
.
.
argumentN
Monitoring a Simulation¶
As soon as you submit a simulation on the cluster, a job id appears on the screen. You can track your job with condor_q
.
condor_q
shows information on jobs submitted by the user.condor_q -g
shows information on jobs submitted by all users.condor_q -r
shows information on jobs submitted on the current machine.condor_q -run
shows information on which machines are your jobs running.
Refer to condor_q for more details.
Removing a Job¶
To remove a job from the cluster queue, use condor_rm <job_id/user_name>
. You can retrieve your <job_id> by running condor_q
. condor_rm <user_name>
removes all the jobs submitted by the user.
Useful Commands¶
condor_status
: Shows the status of all machines on Condor cluster.condor_status -constraint 'State == "<State>"'
: Use to filter machines by their state:Claimed
,Unclaimed
, orDrained
.condor_userprio
: Shows current resource utilization by each user. Gives an idea of free resources on the Condor cluster.condor_drain <machine_name>
: Drains all Condor jobs on a particular machine. Use only if Condor jobs are inhibiting tasks on your own machine.condor_drain -cancel <machine_name>
: Use it to un-drain a machine. Only un-drain other machines if you are certain that no one is using them.condor_vacate
: Evicts all Condor jobs on your machine. Unlike draining, this is not permanent and allows other Condor jobs to run immediately.
Additional Notes¶
To run your job on a specific machine on the ICP cluster, add the following to your job script:
requirements = (machine == "<machine_name>.icp.uni-stuttgart.de")
Similarly, to avoid a specific machine on the ICP cluster, add:
requirements = (machine != "<machine_name>.icp.uni-stuttgart.de")
For a list of options, refer to the requirements section in the manual.
For MPI parallel simulations, you can request several processor cores:
executable = /usr/bin/mpirun arguments = -np <n-process> <path_to_your_program> <arguments_to_your_program> request_cpus = <n-process>
Note: For jobs with more than ~8 cores, it is better to use Ant, Bee, or one of the other clusters.
To request GPUs for running simulations, add the following line to your job script:
request_GPUs = 1 # Some machines also have 2 GPUs
Refer to the GPU page to get more information.
There is a possibility to use multiple conditions in the job script. Visit the conditionals page for more information.