Information that the condor_collector collects can be used to monitor a pool. The condor_status command can be used to display snapshot of the current state of the pool. Monitoring systems can be set up to track the state over time, and they might go further, to alert the system administrator about exceptional conditions.
Support for the Ganglia monitoring system (http://ganglia.info/) is integral to HTCondor. Nagios (http://www.nagios.org/) is often used to provide alerts based on data from the Ganglia monitoring system. The condor_gangliad daemon provides an efficient way to take information from an HTCondor pool and supply it to the Ganglia monitoring system.
The condor_gangliad gathers up data as specified by its configuration, and it streamlines getting that data to the Ganglia monitoring system. Updates sent to Ganglia are done using the Ganglia shared libraries for efficiency.
If Ganglia is already deployed in the pool, the monitoring of HTCondor is enabled by running the condor_gangliad daemon on a single machine within the pool. If the machine chosen is the one running Ganglia's gmetad, then the HTCondor configuration consists of adding GANGLIAD to the definition of configuration variable DAEMON_LIST on that machine. It may be advantageous to run the condor_gangliad daemon on the same machine as is running the condor_collector daemon, because on a large pool with many ClassAds, there is likely to be less network traffic. If the condor_gangliad daemon is to run on a different machine than the one running Ganglia's gmetad, modify configuration variable GANGLIA_GSTAT_COMMAND to get the list of monitored hosts from the master gmond program.
If the pool does not use Ganglia, the pool can still be monitored by a separate server running Ganglia.
By default, the condor_gangliad will only propagate metrics to hosts that are already monitored by Ganglia. Set configuration variable GANGLIA_SEND_DATA_FOR_ALL_HOSTS to True to set up a Ganglia host to monitor a pool not monitored by Ganglia or have a heterogeneous pool where some hosts are not monitored. In this case, default graphs that Ganglia provides will not be present. However, the HTCondor metrics will appear.
On large pools, setting configuration variable GANGLIAD_PER_EXECUTE_NODE_METRICS to False will reduce the amount of data sent to Ganglia. The execute node data is the least important to monitor. One can also limit the amount of data by setting configuration variable GANGLIAD_REQUIREMENTS. Be aware that aggregate sums over the entire pool will not be accurate if this variable limits the ClassAds queried.
Metrics to be sent to Ganglia are specified in all files within the directory specified by configuration variable GANGLIAD_METRICS_CONFIG_DIR. Each file in the directory is read, and the format within each file is that of New ClassAds. Here is an example of a single metric definition given as a New ClassAd:
[ Name = "JobsSubmitted"; Desc = "Number of jobs submitted"; Units = "jobs"; TargetType = "Scheduler"; ]
A nice set of default metrics is in file: $(GANGLIAD_METRICS_CONFIG_DIR)/00_default_metrics.
Recognized metric attribute names and their use:
"Machine_slot1"
may be
specified to monitor the machine ClassAd for slot 1 only. This is
useful when monitoring machine-wide attributes. The special
value "ANY"
matches any type of ClassAd.
"float"
type is recommended.
"double"
, "float"
, "int32"
,
"uint32"
, "int16"
, "uint16"
,
"int8"
, "uint8"
, and "string"
.
The default is "string"
for string values,
The default is "int32"
for integer values,
The default is "float"
for real values,
The default is "int8"
for boolean values.
Integer values can be coerced to "float"
or "double"
.
This is especially important for values stored internally as 64-bit
values.
"\\1"
is
replaced by the first group, "\\2"
by the second group, and
so on.
"sum"
, "avg"
, "max"
,
and "min"
.
name@hostname
, this may
indicate that there are multiple instances of HTCondor running on
the same machine. To avoid the metrics from these instances
overwriting each other, the default machine name is set to the
daemon name in this case. For aggregate metrics, the default
value of Machine will be the name of the condor_collector host.
"@"
sign, the
default IP value will be set to the same value as Machine
in order to make the IP value unique to each instance of HTCondor
running on the same host.
By default, HTCondor assumes that resources are transient: the condor_collector will discard ClassAds older than CLASSAD_LIFETIME seconds. Its default configuration value is 15 minutes, and as such, the default value for UPDATE_INTERVAL will pass three times before HTCondor forgets about a resource. In some pools, especially those with dedicated resources, this approach may make it unnecessarily difficult to determine what the composition of the pool ought to be, in the sense of knowing which machines would be in the pool, if HTCondor were properly functioning on all of them.
This assumption of transient machines can be modified by the use of absent ClassAds. When a machine ClassAd would otherwise expire, the condor_collector evaluates the configuration variable ABSENT_REQUIREMENTS against the machine ClassAd. If True, the machine ClassAd will be saved in a persistent manner and be marked as absent; this causes the machine to appear in the output of condor_status -absent. When the machine returns to the pool, its first update to the condor_collector will invalidate the absent machine ClassAd.
Absent ClassAds, like offline ClassAds, are stored to disk to ensure that they are remembered, even across condor_collector crashes. The configuration variable COLLECTOR_PERSISTENT_AD_LOG defines the file in which the ClassAds are stored, and replaces the no longer used variable OFFLINE_LOG. Absent ClassAds are retained on disk as maintained by the condor_collector for a length of time in seconds defined by the configuration variable ABSENT_EXPIRE_ADS_AFTER. A value of 0 for this variable means that the ClassAds are never discarded, and the default value is thirty days.
Absent ClassAds are only returned by the condor_collector and displayed when the -absent option to condor_status is specified, or when the absent machine ClassAd attribute is mentioned on the condor_status command line. This renders absent ClassAds invisible to the rest of the HTCondor infrastructure.
A daemon may inform the condor_collector that the daemon's ClassAd should not expire, but should be removed right away; the daemon asks for its ClassAd to be invalidated. It may be useful to place an invalidated ClassAd in the absent state, instead of having it removed as an invalidated ClassAd. An example of a ClassAd that could benefit from being absent is a system with an uninterruptible power supply that shuts down cleanly but unexpectedly as a result of a power outage. To cause all invalidated ClassAds to become absent instead of invalidated, set EXPIRE_INVALIDATED_ADS to True. Invalidated ClassAds will instead be treated as if they expired, including when evaluating ABSENT_REQUIREMENTS.