next up previous contents index
Next: 8.2 Setting up HTCondor Up: 8. Frequently Asked Questions Previous: 8. Frequently Asked Questions   Contents   Index

Subsections

8.1 Obtaining & Installing HTCondor


Where can I download HTCondor?

HTCondor can be downloaded from the mirrors listed at http://research.cs.wisc.edu/htcondor/downloads/.

When I click to download HTCondor, it sends me back to the downloads page!

If you are trying to download HTCondor through a web proxy, try disabling it. Our web site uses the ``referring page'' as you navigate through our download menus in order to give you the right version of HTCondor, but sometimes proxies block this information from reaching our web site.

What platforms are supported?

Supported platforms are listed in section 1.5, on page [*]. There is also platform-specific information at Chapter 7 on page [*].


Can I get the source code?

For HTCondor version 7.0.0 and later releases, the HTCondor source code is available for public download with the binary distributions.


What is Personal HTCondor?

Personal HTCondor is a term used to describe a specific style of HTCondor installation suited for individual users who do not have their own pool of machines, but want to submit HTCondor jobs to run elsewhere.

A Personal HTCondor is essentially a one-machine, self-contained HTCondor pool which can use flocking to access resources in other HTCondor pools. See Section 5.2, on page [*] for more information on flocking.

What do I do now? My installation of HTCondor does not work.

What to do to get HTCondor running properly depends on what sort of error occurs. One common error category are communication errors. HTCondor daemon log files report a failure to bind. For example:

(date and time) Failed to bind to command ReliSock

Or, the errors in the various log files may be of the form:

(date and time) Error sending update to collector(s)
(date and time) Can't send end_of_message
(date and time) Error sending UDP update to the collector

(date and time) failed to update central manager

(date and time) Can't send EOM to the collector

This problem can also be observed by running condor_status. It will give a message of the form:

Error:  Could not fetch ads --- error communication error

To solve this problem, understand that HTCondor uses the first network interface it sees on the machine. Since machines often have more than one interface, this problem usually implies that the wrong network interface is being used. It also may be the case that the system simply has the wrong IP address configured.

It is incorrect to use the localhost network interface. This has IP address 127.0.0.1 on all machines. To check if this incorrect IP address is being used, look at the contents of the CollectorLog file on the pool's your central manager right after it is started. The contents will be of the form:

5/25 15:39:33 ******************************************************
5/25 15:39:33 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
5/25 15:39:33 ** $CondorVersion: 6.2.0 Mar 16 2001 $
5/25 15:39:33 ** $CondorPlatform: INTEL-LINUX-GLIBC21 $
5/25 15:39:33 ** PID = 18658
5/25 15:39:33 ******************************************************
5/25 15:39:33 DaemonCore: Command Socket at <128.105.101.15:9618>

The last line tells the IP address and port the collector has bound to and is listening on. If the IP address is 127.0.0.1, then HTCondor is definitely using the wrong network interface.

There are two solutions to this problem. One solution changes the order of the network interfaces. The preferred solution sets which network interface HTCondor should use by adding the following parameter to the local HTCondor configuration file:

NETWORK_INTERFACE = machine-ip-address

Where machine-ip-address is the IP address of the interface you wish HTCondor to use.

After an installation of HTCondor, why do the daemons refuse to start?

This message appears in the log files:

ERROR "The following configuration macros appear to contain default values 
that must be changed before Condor will run.  These macros are:
hostallow_write 
(found on line 1853 of /scratch/adesmet/TRUNK/work/src/localdir/condor_config)"
at line 217 in file condor_config.C

As of HTCondor 6.8.0, if HTCondor sees the bare key word: YOU_MUST_CHANGE_THIS_INVALID_CONDOR_CONFIGURATION_VALUE as the value of a configuration file entry, HTCondor daemons will log the given error message and exit.

By default, an installation of HTCondor 6.8.0 and later releases will have the configuration file entry HOSTALLOW_WRITE set to the above sentinel value. The HTCondor administrator must alter this value to be the correct domain or IP addresses that the administrator desires. The wild card character (*) may be used to define this entry, but that allows anyone, from anywhere, to submit jobs into the pool. A better value will be of the form *.domainname.com.


Why do standard universe jobs never run after an upgrade?

Standard universe jobs that remain in the job queue across an upgrade from any HTCondor release previous to 6.7.15 to any HTCondor release of 6.7.15 or more recent cannot run. They are missing a required ClassAd attribute (LastCheckpointPlatform) added for all standard universe jobs as of HTCondor version 6.7.15. This new attribute describes the platform where a job was running when it produced a checkpoint. The attribute is utilized to identify platforms capable of continuing the job (using the checkpoint).

This attribute becomes necessary due to bugs in some Linux kernels. A standard universe job may be continued on some, but not all Linux machines. And, the CkptOpSys attribute is not specific enough to be utilized.

There are two possible solutions for these standard universe jobs that cannot run, yet are in the queue:

  1. Remove and resubmit the standard universe jobs that remain in the queue across the upgrade. This includes all standard universe jobs that have flocked in to the pool. Note that the resubmitted jobs will start over again from the beginning.

  2. For each standard universe job in the queue, modify its job ClassAd such that it can possibly run within the upgraded pool. If the job has already run and produced a checkpoint on a machine before the upgrade, determine the machine that produced the checkpoint using the LastRemoteHost attribute in the job's ClassAd. Then look at that machine's ClassAd (after the upgrade) to determine and extract the value of the CheckpointPlatform attribute. Add this (using condor_qedit) as the value of the new attribute LastCheckpointPlatform in the job's ClassAd. Note that this operation must also have to be performed on standard universe jobs flocking in to an upgraded pool. It is recommended that pools that flock between each other upgrade to a post 6.7.15 version of HTCondor.

Note that if the upgrade to HTCondor takes place at the same time as a platform change (such as booting an upgraded kernel), there is no way to properly set the LastCheckpointPlatform attribute. The only option is to remove and resubmit the standard universe jobs.


next up previous contents index
Next: 8.2 Setting up HTCondor Up: 8. Frequently Asked Questions Previous: 8. Frequently Asked Questions   Contents   Index
htcondor-admin@cs.wisc.edu