Tricks & Tips for Managing Large Condor Pools

Basic Guidelines for Large Condor Pools

Put central manager (collector + negotiator) on a dedicated machine with no other significant duties.
If possible, use multiple schedds.
Put a busy schedd's spool directory on a fast disk with little else using it.
If running a lot of big standard universe jobs, set up multiple checkpoint servers, rather than doing all checkpointing onto the submit node.
Try to limit the frequency of condor_q hitting the schedd, especially prior to 6.9.3. One good way to do this is to use Quill to offload the work of responding to these queries. If you do use Quill, make sure that you put Quill's database on its own disk--not the same disk as the schedd's spool.
If you are not using strong security (i.e. just host IP authorization) in your Condor pool, then you can turn off security negotiation to reduce overhead:
SEC_DEFAULT_NEGOTIATION = OPTIONAL
If you are not using condor_history (or any other means of reading the job history, including history in Quill), turn it off to reduce overhead:
HISTORY =
If you do not allow preemption by user priority or machine rank expression in your pool (i.e. not just preventing job killing with MaxJobRetirementTime, but completely disallowing claims from being preempted), then you can reduce overhead in the negotiator: NEGOTIATOR_CONSIDER_PREEMPTION = False
Be aware that the schedd will only start JOB_START_COUNT jobs every JOB_START_DELAY seconds. The default (as of 6.8.X and 6.9.2) is 1 job every 2 seconds. Increasing these numbers can help spready out the jobs so that they do not all finish and stage out files at the same time. However, this also means that it can take a long time to begin running a large batch of jobs. To run jobs as fast as possible, set JOB_START_DELAY=0.

Rough Estimations of System Requirements

Collector requires a minimum of ~10k RAM per batch slot. Ditto for the negotiator.
Schedd requires a minimum of ~10k RAM per job in the job queue. For jobs with huge environment values or other big ClassAd attributes, the requirements are larger.
Each running job has a condor_shadow process, which requires an additional ~500k RAM. (Disclaimer: we have some reports that in different environments/configurations, this requirement can be inflated by a factor of 2 to 6.)

Example calculations:

Absolute minimum RAM requirements for schedd with up to 10,000 jobs in the queue and up to 2,000 jobs running: 10000*0.01MB + 2000*0.5MB = 1.1GB

Absolute minimum RAM requirements for central manager machine (collector+negotiator) with 5000 batch slots: 2*5000*0.01MB = 100MB

Realistically, you will want to add in at least another 500MB or so to the above numbers. And if you do have other processes running on your submit or central manager machines, you will need extra resources for those.

Also remember to provision a fast dedicated disk for the spool directory of very busy schedds.

Example Configuration Tuning for 6.8

# Increase various timeouts and housecleaning intervals:
SHADOW_TIMEOUT_MULTIPLIER=4
STARTER_TIMEOUT_MULTIPLIER=4
TOOL_TIMEOUT_MULTIPLIER=4
SCHEDD_MIN_INTERVAL=80
PERIODIC_EXPR_INTERVAL=400

# Turn off the security protocol handshake
# (only do this if you are just using the default host IP authorization)
SEC_DEFAULT_NEGOTIATION = OPTIONAL

# Avoid synchronization of checkpointing
PERIODIC_CHECKPOINT = $(LastCkpt) > ( 2 * $(HOUR) + \
      $RANDOM_CHOICE(0,10,20,30,40,50,60) * $(MINUTE) )

Monitoring Health of the Submit Machine

To see whether you are suffering from timeout tuning problems, search for "timeout reading" or "timeout writing" in your ShadowLog.

As a general indicator of health on a submit node, you can summarize the condor_shadow exit codes with a command like this:

$ grep 'EXITING WITH STATUS' ShadowLog | cut -d " " -f 8- | sort | uniq -c

  12099 EXITING WITH STATUS 100
     81 EXITING WITH STATUS 102
   2965 EXITING WITH STATUS 107
    239 EXITING WITH STATUS 108
    332 EXITING WITH STATUS 112

Meaning of common exit codes:

100 success (you want to see a majority of these)
101 evicted (checkpointed)
102 job killed
103 job dumped core
104 job exited with an exception
105 shadow out of memory
107 job evicted without checkpointing OR communication with starter failed, so requeue/reconnect job
108 failed to activate claim to startd (e.g. failed to connect)
111 failed to find checkpoint file
112 job should be put on hold; see HoldReason, HoldReasonCode, and HoldReasonSubCode to understand why
113 job should be removed (e.g. periodic_remove)

Monitoring Health of the Negotiator

Check the duration of the negotiation cycle:

$ grep "Negotiation Cycle" NegotiatorLog

5/3 07:37:35 ---------- Started Negotiation Cycle ----------
5/3 07:39:41 ---------- Finished Negotiation Cycle ----------
5/3 07:44:41 ---------- Started Negotiation Cycle ----------
5/3 07:46:59 ---------- Finished Negotiation Cycle ----------

If the cycle is taking long (e.g. longer than 5 minutes), then see if it is spending a lot of time on a particular user:

$ grep "Negotiating with" NegotiatorLog

5/3 07:53:12   Negotiating with osg_samgrid@hep.wisc.edu at ...
5/3 07:53:13   Negotiating with jherschleb@lmcg.wisc.edu at ...
5/3 07:53:13   Negotiating with camiller@che.wisc.edu at ...
5/3 07:53:14   Negotiating with malshe@cae.wisc.edu at ...

If a particular user is consuming a lot of time in the negotiator (e.g. job after job being rejected), then look at how well that user's jobs are getting "auto clustered". This auto clustering happens, for the most part, behind the scenes and helps improve the efficiency of negotiation by grouping equivalent jobs together.

You can see how the jobs are getting grouped together by looking at the job attribute AutoClusterID. Example:

$ condor_q -f "%s" AutoClusterID -f " %s" ClusterID -f ".%s\n" ProcID

1  649884.0
1  649885.0
50 650082.0

Jobs with the same AutoClusterID are in the same group for negotiation purposes. If you see that many small groups are being created, take a look at the attribute AutoClusterAttrs. This will tell you what attributes are being used to group jobs together. All jobs in a group have identical values for these attributes. In some cases, it may be necessary to tweak the way a particular attribute is being rounded. See SCHEDD_ROUND_ATTR in the manual for more information on that.

Monitoring Health of the Collector

If the collector can't keep up with the ClassAd updates that it is receiving from the Condor daemons in the pool, and you are using UDP updates (the default) then it will "drop" updates. The consequence of dropped updates is stale information about the state of the pool and possibly machines appearing to be missing from the pool (depending on how many successive updates are lost). If you are using TCP updates and the collector cannot keep up, then Condor daemons (e.g. startds) may block/timeout when trying to send udpates.

A simple way to see if you have a serious problem with dropped updates is to observe the total number of machines in the pool, from the point of view of the collector (condor_status -total). If this number drops down to less than it should be, and the missing machines are running Condor and otherwise working fine, then the problem may be dropped updates.

A more direct way to see if your collector is dropping ClassAd updates is to use the tool condor_updates_stats . Example:

condor_status -l | condor_updates_stats

*** Name/Machine = 'vm4@...' MyType = 'Machine' ***
 Type: Main
   Stats: Total=713, Seq=712, Lost=3 (0.42%)
     0: Ok
  ...
   127: Ok

If your problem is simply that UDP updates are coming in uneaven bursts, then the solution is to provide enough UDP buffer space. You can see whether this is the problem by watching the receive queue on the collector's UDP port (visible through netstat -l under unix). If it fills up now and then but is otherwise empty, then increasing the buffer size should help. However, the default in current versions of Condor is 10MB, which is adequate for most large pools that we have seen. Example:

# 20MB
COLLECTOR_SOCKET_BUFSIZE = 20480000

See the Condor Manual entry for COLLECTOR_SOCKET_BUFSIZE for additional information on how to make sure the OS is cooperating with the requested buffer size.

If you are using strong authentication in the updates to the collector, this may add a lot of overhead and cause the collector not to scale high enough for very large pools. One idea is to have a cluster of collectors that are randomly chosen (via $RANDOM_CHOICE) by your execute nodes. These collectors would receive updates via strong authentication and then forward the updates to another collector over a trusted netowrk where you can afford not to use strong authentication.

If all else fails, you can decrease the frequency of ClassAd updates by tuning UPDATE_INTERVAL and MASTER_UPDATE_INTERVAL.

One more tunable parameter in the collector is (only under unix) the maximum number of queries that the collector will try to respond to simultaneously (by creating child processes to handle each one). This is controlled by COLLECTOR_QUERY_WORKERS, which defaults to 16.

High Availability

You can set up redundant collector+negotiator instances, so if the central manager machine goes down, the pool can continue to function. All of the HAD collectors run all the time, but only one negotiator may run at a time, so the condor_had component ensures that a new instance of the negotiator is started up when the existing one dies. The main restriction is that the HAD negotiator won't help users who are flocking to the condor pool. More information about HAD can be found in the Condor manual. Tip: if you do frequent condor_status queries for monitoring, you can direct these to one of your secondary collectors in order to offload work from your primary collector.

100	success (you want to see a majority of these)
101	evicted (checkpointed)
102	job killed
103	job dumped core
104	job exited with an exception
105	shadow out of memory
107	job evicted without checkpointing OR communication with starter failed, so requeue/reconnect job
108	failed to activate claim to startd (e.g. failed to connect)
111	failed to find checkpoint file
112	job should be put on hold; see HoldReason, HoldReasonCode, and HoldReasonSubCode to understand why
113	job should be removed (e.g. periodic_remove)