Condor - High Throughput Computing


The Golden Rules

(when submitting LOTS of jobs to Globus)

If you want to use Globus Toolkit 2 to run hundreds or even thousands of jobs:

  1. Submit your jobs using Condor-G.
  2. If you are planning to have many jobs submitted at once (where many > 250 or so), make certain that Linux OS limits on both the Condor-G client (where the gahp runs) and on the Globus gatekeeper machine are set nice and high. See our scalability document for details. The Globus middleware can use a large number of descriptors and IP ports. Several default Linux settings, such as the maximum number of descriptors per process and the local port range, will need to be increased.
  3. Make certain your Gatekeeper machine is running a recent version of Globus as installed by the VDT, as the VDT includes some patches critical to scalability (like the patch in Globus Bugzilla #802).
  4. Make certain that jobmanager-fork (exactly this!) is available on your gatekeeper, so that Condor-G can run a grid monitor process. So for example, if you submit to a gatekeeper at foo.cs.wisc.edu, ensure that
    
        foo.cs.wisc.edu/jobmanager-fork
    
    
    is an available resource. It cannot be foo.cs.wisc.edu/jobmanager or foo.cs.wisc.edu/fork.
  5. Make certain you are running a recent version of Condor-G that includes the grid_monitor (another good reason to keep your VDT installation up to date). Try the command
    
         condor_config_val grid_monitor
    
    
    to check that your version has the grid monitor process. Also, check that the grid_monitor is enabled, which you can do with the following command:
    
        condor_config_val enable_grid_monitor
    
    
    This should print TRUE.

Further Explanation

Normally, for every job submitted to Globus, the gatekeeper will fork off a globus jobmanager. This jobmanager forks multiple Perl plus shell processes every 10 seconds that probe the local scheduler (lsf, Condor, whatever) for the status of the job. If you submit a couple hundred or more jobs, you will notice this architecture :

  1. causes the load average on your gatekeeper to skyrocket
  2. can cause heavy swapping on your gatekeeper machine (because Perl is interpreted, much of the space of a Perl process is private data and thus cannot be shared across many invocations of the same Perl script)
  3. can pound your local scheduler mercilessly with thousands of job status requests per minute.

The above then cascades into many problems, such as timeouts. The time outs are due to the gatekeeper not responding, because it is swapping or because the local scheduler cannot keep up with all the requests. The really nasty part of this is that a jobmanager exists even when a job is pending within the queue. So, if you submit 2000 jobs to Globus and then Globus submits them into LSF/Condor/PBS where they sit idle in a queue, your gatekeeper will still be swamped because 2000 jobmanager processes will result in an amazing number of processes being forked and reaped over and over (more than 30000 per minute?).

To help with these scalability problems at the Globus gatekeeper, recent versions of Condor-G introduced the grid monitor. It significantly helps the scalability problem at the gatekeeper by doing the following:

  1. Condor-G submits one process to the gatekeeper, called the grid_monitor, via the fork jobmanager to watch over all jobs in the local scheduler for a given user.
  2. When Condor-G submits a job, as soon as Globus reports that the job is pending, Condor-G tells the jobmanager to exit. Condor-G will then restart the jobmanager when the grid_monitor informs Condor-G that the job has started running. Thus, if you submit 2000 jobs, but only have 100 running at a time, you will only have 100 jobmanager processes running, instead of 2000. Even better: if you tell Condor-G you do not need real-time streaming of stdout and stderr, then Condor-G will only restart a jobmanager when the job completes (to retrieve the job's output). To do this, in your Condor-G submit file, include:
    
        stream_output = false
        stream_error = false
    
    

In addition, a patch to the Globus jobmanager itself uses the job status information from the grid monitor, if running, instead of forking a bunch of processes every 10 seconds. This patch is submitted to Globus as part of bugzilla #802; see Bugzilla #802 for details. Also, recent versions of the VDT include this patch.

Realize that the Globus GRAM folks are aware of all these gatekeeper scalability issues. Many of the architecture design decisions in Globus Toolkit 3 were made with the hard life lessons above in mind. Globus is working to ensure that GT3 will ultimately not suffer similar problems.


Condor home page   -   condor-admin@cs.wisc.edu