HTCondor's parallel universe supports a wide variety of parallel programming environments, and it encompasses the execution of MPI jobs. It supports jobs which need to be co-scheduled. A co-scheduled job has more than one process that must be running at the same time on different machines to work correctly. The parallel universe supersedes the mpi universe. The mpi universe eventually will be removed from HTCondor.
HTCondor must be configured such that resources (machines) running parallel jobs are dedicated. Note that dedicated has a very specific meaning in HTCondor: dedicated machines never vacate their executing HTCondor jobs, should the machine's interactive owner return. This is implemented by running a single dedicated scheduler process on a machine in the pool, which becomes the single machine from which parallel universe jobs are submitted. Once the dedicated scheduler claims a dedicated machine for use, the dedicated scheduler will try to use that machine to satisfy the requirements of the queue of parallel universe or MPI universe jobs. If the dedicated scheduler cannot use a machine for a configurable amount of time, it will release its claim on the machine, making it available again for the opportunistic scheduler.
Since HTCondor does not ordinarily run this way, (HTCondor usually uses opportunistic scheduling), dedicated machines must be specially configured. Section 3.12.8 of the Administrator's Manual describes the necessary configuration and provides detailed examples.
To simplify the scheduling of dedicated resources, a single machine becomes the scheduler of dedicated resources. This leads to a further restriction that jobs submitted to execute under the parallel universe must be submitted from the machine acting as the dedicated scheduler.
Given correct configuration, parallel universe jobs may be submitted from the machine running the dedicated scheduler. The dedicated scheduler claims machines for the parallel universe job, and invokes the job when the correct number of machines of the correct platform (architecture and operating system) are claimed. Note that the job likely consists of more than one process, each to be executed on a separate machine. The first process (machine) invoked is treated different than the others. When this first process exits, HTCondor shuts down all the others, even if they have not yet completed their execution.
An overly simplified submit description file for a parallel universe job appears as
############################################# ## submit description file for a parallel program ############################################# universe = parallel executable = /bin/sleep arguments = 30 machine_count = 8 queue
This job specifies the universe as parallel, letting HTCondor know that dedicated resources are required. The machine_count command identifies the number of machines required by the job.
When submitted, the dedicated scheduler allocates eight machines with the same architecture and operating system as the submit machine. It waits until all eight machines are available before starting the job. When all the machines are ready, it invokes the /bin/sleep command, with a command line argument of 30 on all eight machines more or less simultaneously.
The addition of several related OpSys attributes means that you may specify versions of Linux operating systems to run on in a heterogeneous pool.
If your pool consists of Linux machines installed with the RedHat and Ubuntu operating systems, and you'd like to run on only the RedHat machines, use the following example.
############################################# ## submit description file for a parallel program targeting RedHat machines ############################################# universe = parallel executable = /bin/sleep arguments = 30 machine_count = 8 requirements = (OpSysName == "RedHat") queue
In addition, you may narrow down your machine selection to the version you'd like to run on using the OpSysAndVer attribute.
############################################# ## submit description file for a parallel program targeting RedHat 6 machines ############################################# universe = parallel executable = /bin/sleep arguments = 30 machine_count = 8 requirements = (OpSysAndVer == "RedHat6") queue
A more realistic example of a parallel job utilizes other features.
###################################### ## Parallel example submit description file ###################################### universe = parallel executable = /bin/cat log = logfile input = infile.$(NODE) output = outfile.$(NODE) error = errfile.$(NODE) machine_count = 4 queue
The specification of the input, output, and error files utilize the predefined macro $(NODE). See the condor_submit manual page on page for further description of predefined macros. The $(NODE) macro is given a unique value as processes are assigned to machines. The $(NODE) value is fixed for the entire length of the job. It can therefore be used to identify individual aspects of the computation. In this example, it is used to utilize and assign unique names to input and output files.
This example presumes a shared file system across all the machines claimed for the parallel universe job. Where no shared file system is either available or guaranteed, use HTCondor's file transfer mechanism, as described in section 2.5.4 on page . This example uses the file transfer mechanism.
###################################### ## Parallel example submit description file ## without using a shared file system ###################################### universe = parallel executable = /bin/cat log = logfile input = infile.$(NODE) output = outfile.$(NODE) error = errfile.$(NODE) machine_count = 4 should_transfer_files = yes when_to_transfer_output = on_exit queue
The job requires exactly four machines, and queues four processes. Each of these processes requires a correctly named input file, and produces an output file.
The different machines executing for a parallel universe job may specify different machine requirements. A common example requires that the head node execute on a specific machine. It may be also useful for debugging purposes.
Consider the following example.
###################################### ## Example submit description file ## with multiple procs ###################################### universe = parallel executable = example machine_count = 1 requirements = ( machine == "machine1") queue requirements = ( machine =!= "machine1") machine_count = 3 queue
The dedicated scheduler allocates four machines. All four executing jobs have the same value for $(Cluster) macro. The $(Process) macro takes on two values; the value 0 will be assigned for the single executable that must be executed on machine1, and the value 1 will be assigned for the other three that must be executed anywhere but on machine1.
Carefully consider the ordering and nature of multiple sets of requirements in the same submit description file. The scheduler matches jobs to machines based on the ordering within the submit description file. Mutually exclusive requirements eliminate the dependence on ordering within the submit description file. Without mutually exclusive requirements, the scheduler may be unable to schedule the job. The ordering within the submit description file may preclude the scheduler considering the specific allocation that could satisfy the requirements.
MPI applications utilize a single executable that is invoked in order to execute in parallel on one or more machines. HTCondor's parallel universe provides the environment within which this executable is executed in parallel. However, the various implementations of MPI (for example, LAM or MPICH) require further framework items within a system-wide environment. HTCondor supports this necessary framework through user visible and modifiable scripts. An MPI implementation-dependent script becomes the HTCondor job. The script sets up the extra, necessary framework, and then invokes the MPI application's executable.
HTCondor provides these scripts in the $(RELEASE_DIR)/etc/examples directory. The script for the LAM implementation is lamscript. The script for the MPICH implementation is mp1script. Therefore, an HTCondor submit description file for these implementations would appear similar to:
###################################### ## Example submit description file ## for MPICH 1 MPI ## works with MPICH 1.2.4, 1.2.5 and 1.2.6 ###################################### universe = parallel executable = mp1script arguments = my_mpich_linked_executable arg1 arg2 machine_count = 4 should_transfer_files = yes when_to_transfer_output = on_exit transfer_input_files = my_mpich_linked_executable queue
or
###################################### ## Example submit description file ## for LAM MPI ###################################### universe = parallel executable = lamscript arguments = my_lam_linked_executable arg1 arg2 machine_count = 4 should_transfer_files = yes when_to_transfer_output = on_exit transfer_input_files = my_lam_linked_executable queue
The executable is the MPI implementation-dependent script. The first argument to the script is the MPI application's executable. Further arguments to the script are the MPI application's arguments. HTCondor must transfer this executable; do this with the transfer_input_files command.
For other implementations of MPI, copy and modify one of the given scripts. Most MPI implementations require two system-wide prerequisites. The first prerequisite is the ability to run a command on a remote machine without being prompted for a password. ssh is commonly used, but other commands may be used. The second prerequisite is an ASCII file containing the list of machines that may utilize ssh. These common prerequisites are implemented in a further script called sshd.sh. sshd.sh generates ssh keys (to enable password-less remote execution), and starts an sshd daemon. The machine name and MPI rank are given to the submit machine.
The sshd.sh script requires the definition of two HTCondor configuration variables. Configuration variable CONDOR_SSHD is an absolute path to an implementation of sshd. sshd.sh has been tested with openssh version 3.9, but should work with more recent versions. Configuration variable CONDOR_SSH_KEYGEN points to the corresponding ssh-keygen executable.
Scripts lamscript and mp1script each have their own idiosyncrasies. In mp1script, the PATH to the MPICH installation must be set. The shell variable MPDIR indicates its proper value. This directory contains the MPICH mpirun executable. For LAM, there is a similar path setting, but it is called LAMDIR in the lamscript script. In addition, this path must be part of the path set in the user's .cshrc script. As of this writing, the LAM implementation does not work if the user's login shell is the Bourne or compatible shell.
These MPI jobs operate as all parallel universe jobs do. The default policy is that when the first node exits, the whole job is considered done, and HTCondor kills all other running nodes in that parallel job. Alternatively, a parallel universe job that sets the attribute
+ParallelShutdownPolicy = "WAIT_FOR_ALL"in its submit description file changes the policy, such that HTCondor will wait until every node in the parallel job has completed to consider the job finished.