If you want to use Globus Toolkit 2 to run hundreds or even thousands of jobs:
gahp runs)
and on the Globus gatekeeper machine are set nice and high.
See
our scalability document for details.
The Globus middleware can use a large number of descriptors and
IP ports. Several default Linux settings,
such as the maximum number of descriptors per process and the local port range,
will need to be increased.
jobmanager-fork (exactly this!) is available
on your gatekeeper, so that Condor-G can run a grid monitor process.
So for example, if you submit to a gatekeeper at foo.cs.wisc.edu, ensure that
foo.cs.wisc.edu/jobmanager-fork
is an available resource. It cannot be
foo.cs.wisc.edu/jobmanager or
foo.cs.wisc.edu/fork.
grid_monitor
(another good reason to keep your VDT installation up to date).
Try the command
condor_config_val grid_monitor
to check that your version has the grid monitor process. Also,
check that the grid_monitor is enabled, which you
can do with the following command:
condor_config_val enable_grid_monitor
This should print TRUE.
Normally, for every job submitted to Globus, the gatekeeper will fork off a globus jobmanager. This jobmanager forks multiple Perl plus shell processes every 10 seconds that probe the local scheduler (lsf, Condor, whatever) for the status of the job. If you submit a couple hundred or more jobs, you will notice this architecture :
The above then cascades into many problems, such as timeouts. The time outs are due to the gatekeeper not responding, because it is swapping or because the local scheduler cannot keep up with all the requests. The really nasty part of this is that a jobmanager exists even when a job is pending within the queue. So, if you submit 2000 jobs to Globus and then Globus submits them into LSF/Condor/PBS where they sit idle in a queue, your gatekeeper will still be swamped because 2000 jobmanager processes will result in an amazing number of processes being forked and reaped over and over (more than 30000 per minute?).
To help with these scalability problems at the Globus gatekeeper, recent versions of Condor-G introduced the grid monitor. It significantly helps the scalability problem at the gatekeeper by doing the following:
grid_monitor, via the fork jobmanager to watch
over all jobs in the local scheduler for a given user.
grid_monitor
informs Condor-G that the job has started running.
Thus, if you submit 2000 jobs, but only have 100 running at a time,
you will only have 100 jobmanager processes
running, instead of 2000.
Even better: if you tell Condor-G you do not
need real-time streaming of stdout and stderr,
then Condor-G will only restart
a jobmanager when the job completes (to retrieve the job's output).
To do this, in your Condor-G submit file, include:
stream_output = false
stream_error = false
In addition, a patch to the Globus jobmanager itself uses the job status information from the grid monitor, if running, instead of forking a bunch of processes every 10 seconds. This patch is submitted to Globus as part of bugzilla #802; see Bugzilla #802 for details. Also, recent versions of the VDT include this patch.
Realize that the Globus GRAM folks are aware of all these gatekeeper scalability issues. Many of the architecture design decisions in Globus Toolkit 3 were made with the hard life lessons above in mind. Globus is working to ensure that GT3 will ultimately not suffer similar problems.