Most likely, the HTCondor installation has been misconfigured and HTCondor's access control security functionality is preventing daemons and tools from communicating with each other. Other symptoms of this problem include HTCondor tools (such as condor_status and condor_q) not producing any output, or commands that appear to have no effect (for example, condor_off or condor_on).
The solution is to properly configure the HOSTALLOW_* and HOSTDENY_* settings (for host/IP based authentication) or to configure strong authentication and set ALLOW_* and DENY_* as appropriate. Host-based authentication is described in section 3.6.9 on page . Information about other forms of authentication is provided in section 3.6.1 on page .
If the central manager crashes, jobs that are already running will continue to run unaffected. Queued jobs will remain in the queue unharmed, but can not begin running until the central manager is restarted and begins matchmaking again. Nothing special needs to be done after the central manager is brought back on line.
The condor_schedd daemon receives signal 25, dies, and is restarted when the history file reaches a 2 Gbyte size limit. On 32-bit OSes, HTCondor cannot write log files larger than 2 Gbytes. If you need to keep more than 2 Gbytes of history, you can set a maximum history file size of 2 Gbytes and multiple rotations of the file. For example, to keep 6 Gbytes of history, you would put these lines in your HTCondor configuration file:
ENABLE_HISTORY_ROTATION = True MAX_HISTORY_LOG = 2000000000 MAX_HISTORY_ROTATIONS = 2
Depending on how your policy is set up, HTCondor will track any tty on the machine for the purpose of determining if a job is to be vacated or suspended on the machine. It could be the case that after you ssh there, HTCondor notices activity on the tty allocated to your connection and then vacates the job.
One likely error message within the collector log of the form
DaemonCore: PERMISSION DENIED to host <xxx.xxx.xxx.xxx> for command 0 (UPDATE_STARTD_AD)indicates a permissions problem. The condor_startd daemons do not have write permission to the condor_collector daemon. This could be because you used domain names in your HOSTALLOW_WRITE and/or HOSTDENY_WRITE configuration macros, but the domain name server (DNS) is not properly configured at your site. Without the proper configuration, HTCondor cannot resolve the IP addresses of your machines into fully-qualified domain names (an inverse look up). If this is the problem, then the solution takes one of two forms:
HOSTALLOW_WRITE = *.your.domain.com
and this does not work, use
HOSTALLOW_WRITE = 192.131.133.*, 192.131.132.*
Alternatively, this permissions problem may be caused by being too restrictive in the setting of your HOSTALLOW_WRITE and/or HOSTDENY_WRITE configuration macros. If it is, then the solution is to change the macros, for example from
HOSTALLOW_WRITE = condor.your.domain.comto
HOSTALLOW_WRITE = *.your.domain.comor possibly
HOSTALLOW_WRITE = condor.your.domain.com, foo.your.domain.com, \ bar.your.domain.com
Another likely error message within the collector log of the form
DaemonCore: PERMISSION DENIED to host <xxx.xxx.xxx.xxx> for command 5 (QUERY_STARTD_ADS)indicates a similar problem as above, but read permission is the problem (as opposed to write permission). Use the solutions given above.
Under FreeBSD and Mac OSX operating systems, misconfiguration of of a system's outgoing mail causes HTCondor to inadvertently leave paused and zombie mail processes around when HTCondor attempts to send notification e-mail. The solution to this problem is to correct the mailer configuration.
Execute the following command as the user under which HTCondor daemons run to determine whether outgoing e-mail works.
$ uname -a | mail -v your@emailaddress.com
If no e-mail arrives, then outgoing e-mail does not work correctly.
Note that this problem does not manifest itself on non-BSD Unix platforms, such as Linux.
Some older Xen kernels had a problem where the kernel's jiffy counter could jump backwards in time. This breaks an assumption made by the condor_procd. This problem can only be worked around by upgrading the Xen kernel to a version that fixes the issue with the jiffy counter. Running HTCondor on an affected Xen kernel often results in failures of the following forms in HTCondor daemon log files:
error: parent process's birthday is later than our own
ERROR: No family with the given PID is registered