Submitting a Large Cluster


We will now use some of the more advanced features of the submit description file to submit a large cluster of jobs.

First, change into a directory with an example job:

% cd ~/examples/large/one_dir

This example will use a cluster of 50 jobs, each with a different command line argument and a different output file. First, take a look at the submit file:

% cat one_dir.submit

There are a couple of interesting things in the file:

  1. We are using a single log file for the entire cluster
  2. The use of "$(Process)" in specifying the filename for STDOUT
  3. The use of "$(Process)" in specifying the command-line argument for each job
  4. The use of "Notification = Never" to disable email for this cluster
  5. The "Queue 50" line to specify we want 50 jobs

Now, build the large.condor program using make:

% make

This will compile large.condor and link it with condor_compile. This program just prints out the command line argument it is given to STDOUT.

Now, we can submit the cluster:

% condor_submit one_dir.submit

You will see that 50 jobs were submitted to your cluster. Remember the cluster number it tells you, you'll need this number in a little while. Now, you can monitor the progress of your jobs with condor_q and by viewing the User Log.

% condor_q
% less large.log

Now, we'll change the job priority of some jobs to see them executed faster: (Note: You must change each instance of "cluster" in the below command to be the number of the cluster you actually submitted).

% condor_prio -p 19 cluster.48 cluster.49 

This will give processes 48 and 49 a higher priority, so they should be the next ones to run. You can see the higher priority and watch them execute faster with condor_q:

% condor_q

As jobs complete, you can examine their output in the appropriate large.out.* file. In each one, the data in the file should be the same as the number in the filename.

% cat large.out.48

Keep running condor_q until all the jobs in your cluster complete. If you're tired of waiting, you can remove the remaining jobs in your cluster with "condor_rm -a".

You might notice there are a lot of files in your directory. This can be hard to manage particularly with hundreds or thousands of jobs, instead of just 50. The next example will solve this.


Now, we'll submit another large cluster, this time using a directory heirarchy to make things more manageable, instead of one big directory. Each job will run in its own directory.

First, change into the new example directory:

% cd ~/examples/large/many_dirs

Now, examine the new submit file:

% cat many_dirs.submit

There are a couple of interesting things in the file:

  1. The use of "$(Process)" in specifying the Initialdir, or the directory that each jobs uses when it starts up.
  2. Once again, we are using a single log file for the entire cluster, but notice that we have to use "../" because each job is running in its own subdirectory
  3. We are using the exact same large.condor program from before. Notice that we have to use two levels of "../" because Executable is found relative to Initialdir, so we muse use "../" once to get to the "many_dirs" directory, and another to get to the "large" directory, from which "one_dir/large.condor" is the right path.
  4. We don't have to specify seperate filenames for the output files, since a large.out file will appear in each subdirectory.

Next, we must create the directory tree for all the jobs. Someday, Condor will be able to do this for you, but for now, you must do it yourself. There's a simple perl script that will create the 50 directories:

% cat setup_directories

Now, just run this script:

% ./setup_directories

Finally, we can submit the cluster to Condor:

% condor_submit many_dirs.submit

Again, you can monitor the progress of your cluster with condor_q:

% condor_q

As jobs complete, you can examine the output in each subdirectory:

% cat 0/large.out
% cat 1/large.out
...

You can leave the cluster running while we continue with the presentation, or you can remove all your jobs "condor_rm -a".

That's it, we're done!