.. _HTCondor: HTCondor ======== On the ICP workstations, we use HTCondor to efficiently manage high-throughput computing (HTC) jobs that require extended runtimes. HTCondor allows users to submit jobs to a centralized scheduler, which queues and distributes them across available resources. Things to remember ------------------ - HTCondor is a system to run multiple jobs (high throughput computing) on unused ICP compute resources. - All workstations, including the CIP pool at the ICP, are part of the Condor cluster. - You can submit jobs from any workstation except the ssh login nodes. - Condor is best suited for serial jobs, but parallel jobs with a few cores can also work. - Make sure your jobs do not write to disk more often than once per minute. - Don't write too often or in large amounts to the home directory from a Condor job, as it can fill up your home directory. If you need to dump a lot of data, use ``/data`` or ``/work`` directories. More info about data storage can be found `on the wiki `_. - Condor evicts jobs if a user needs more resources on the computer where the job is running or if another user has a higher priority (i.e., has used Condor less). Therefore, if your jobs have a long runtime, you need to implement `checkpointing `_ to resume simulation from the point when the job was evicted. - Condor sends ``SIGTERM (15)`` signal to a job before kicking it out. You can catch that signal to perform checkpointing. For a summry of important commands when using HTCondor can be found in this `quick start guide `_. Submitting a Job ---------------- To submit a simulation on Condor, you need a Condor job script, which you can submit using the ``condor_submit `` command. You can add more switches to `condor_submit `_ if you wish to customize the process. Job Script ---------- An example job script: .. code-block:: none # Simple HTCondor submit description file # Everything with a leading # is a comment executable = myexe # Path to your executable file arguments = arguments # Arguments you want to pass through the command line output = OUTPUT # Path for output file, which records the console output error = ERROR # Path for error file, which records error signals log = LOG # Path for Condor log file request_cpus = 1 # Number of CPUs required for your job request_memory = 1024M # RAM allocated to your job, you can also use G for GB ex. 1G instead of 1024M request_disk = 10240K # Disk space required for your job queue # Queues the Condor job HTCondor website explains this in more `detail `_. For submitting multiple jobs with varying arguments, you can use the following script: .. code-block:: none # Condor submission file for running a program with different arguments executable = arguments = $(ARG) request_memory = 1G request_cpus = 1 # Specify the output, error, and log files output = output_$(ARG).out error = error_$(ARG).err log = log_$(Cluster).log # Queue jobs with different arguments queue ARG from args.txt It loads arguments from the file **args.txt**, which is in the same directory from where you submit the job. Otherwise, you need to specify the full path. The **args.txt** file looks like this: .. code-block:: none argument1 argument2 argument3 . . . argumentN Monitoring a Simulation ----------------------- As soon as you submit a simulation on the cluster, a **job id** appears on the screen. You can track your job with ``condor_q``. - ``condor_q`` shows information on jobs submitted by the user. - ``condor_q -g`` shows information on jobs submitted by all users. - ``condor_q -r`` shows information on jobs submitted on the current machine. - ``condor_q -run`` shows information on which machines are your jobs running. Refer to `condor_q `_ for more details. Removing a Job -------------- To remove a job from the cluster queue, use ``condor_rm ``. You can retrieve your by running ``condor_q``. ``condor_rm `` removes all the jobs submitted by the user. Useful Commands --------------- - ``condor_status``: Shows the status of all machines on Condor cluster. - ``condor_status -constraint 'State == ""'``: Use to filter machines by their state: ``Claimed``, ``Unclaimed``, or ``Drained``. - ``condor_userprio``: Shows current resource utilization by each user. Gives an idea of free resources on the Condor cluster. - ``condor_drain ``: Drains all Condor jobs on a particular machine. Use only if Condor jobs are inhibiting tasks on your own machine. - ``condor_drain -cancel ``: Use it to un-drain a machine. Only un-drain other machines if you are certain that no one is using them. - ``condor_vacate``: Evicts all Condor jobs on your machine. Unlike draining, this is not permanent and allows other Condor jobs to run immediately. Additional Notes ---------------- - To run your job on a specific machine on the ICP cluster, add the following to your job script: .. code-block:: none requirements = (machine == ".icp.uni-stuttgart.de") - Similarly, to avoid a specific machine on the ICP cluster, add: .. code-block:: none requirements = (machine != ".icp.uni-stuttgart.de") For a list of options, refer to the `requirements section in the manual `_. - For MPI parallel simulations, you can request several processor cores: .. code-block:: none executable = /usr/bin/mpirun arguments = -np request_cpus = Note: For jobs with more than ~8 cores, it is better to use Ant, Bee, or one of the other clusters. - To request GPUs for running simulations, add the following line to your job script: .. code-block:: none request_GPUs = 1 # Some machines also have 2 GPUs Refer to the `GPU page `_ to get more information. - There is a possibility to use multiple conditions in the job script. Visit the `conditionals page `_ for more information.