next up previous contents index
Next: 8.4 HTCondor on Windows Up: 8. Frequently Asked Questions Previous: 8.2 Setting up HTCondor   Contents   Index

Subsections

8.3 Running HTCondor Jobs

Why aren't any or all of my jobs running?

Please see Section 2.6.5, on page [*] for information on why a job might not be running.

I'm at the University of Wisconsin-Madison Computer Science Dept., and I am having problems!

Please see the web page http://www.cs.wisc.edu/condor/uwcs. As it explains, your home directory is in AFS, which by default has access control restrictions which can prevent HTCondor jobs from running properly. The above URL will explain how to solve the problem.

I'm getting a lot of e-mail from HTCondor. Can I just delete it all?

Generally you shouldn't ignore all of the mail HTCondor sends, but you can reduce the amount you get by telling HTCondor that you don't want to be notified every time a job successfully completes, only when a job experiences an error. To do this, include a line in your submit file like the following:

Notification = Error

See the Notification parameter in the condor_q man page on page [*] of this manual for more information.

Why will my vanilla jobs only run on the machine where I submitted them from?

Check the following:

  1. Did you submit the job from a local file system that other computers can't access?

    See Section 3.3.7, on page [*].

  2. Did you set a special requirements expression for vanilla jobs that's preventing them from running but not other jobs?

    See Section 3.3.7, on page [*].

  3. Is HTCondor running as a non-root user?

    See Section 3.6.13, on page [*].


Why does the requirements expression for the job I submitted
have extra things that I did not put in my submit description file?

There are several extensions to the submitted requirements that are automatically added by HTCondor. Here is a list:

When I use condor_compile to produce a job, I get an error that says, "Internal ld was not invoked!". What does this mean?

condor_compile enforces a specific behavior in the compilers and linkers that it supports (for example gcc, g77, cc, CC, ld) where a special linker script provided by HTCondor must be invoked during the final linking stages of the supplied compiler or linker.

In some rare cases, as with gcc compiled with the options -with-as or -with-ld, the enforcement mechanism we rely upon to have gcc choose our supplied linker script is not honored by the compiler. When this happens, an executable is produced, but the executable is devoid of the HTCondor libraries which both identify it as an HTCondor executable linked for the standard universe and implement the feature sets of remote I/O and transparent process checkpointing and migration.

Often, the only fix in order to use the compiler desired, is to reconfigure and recompile the compiler itself, such that it does not use the errant options mentioned.

With HTCondor's standard universe, we highly recommend that your source files are compiled with the supported compiler for your platform. See section 1.5 for the list of supported compilers. For a Linux platform, the supported compiler is the default compiler that came with the distribution. It is often found in the directory /usr/bin.

Why might my job be preempted (evicted)?

There are four circumstances under which HTCondor may evict a job. They are controlled by different expressions.

Reason number 1 is the user priority: controlled by the PREEMPTION_REQUIREMENTS expression in the configuration file. If there is a job from a higher priority user sitting idle, the condor_negotiator daemon may evict a currently running job submitted from a lower priority user if PREEMPTION_REQUIREMENTS is True. For more on user priorities, see section 2.7 and section 3.4.

Reason number 2 is the owner (machine) policy: controlled by the PREEMPT expression in the configuration file. When a job is running and the PREEMPT expression evaluates to True, the condor_startd will evict the job. The PREEMPT expression should reflect the requirements under which the machine owner will not permit a job to continue to run. For example, a policy to evict a currently running job when a key is hit or when it is the 9:00am work arrival time, would be expressed in the PREEMPT expression and enforced by the condor_startd. For more on the PREEMPT expression, see section 3.5.

Reason number 3 is the owner (machine) preference: controlled by the RANK expression in the configuration file (sometimes called the startd rank or machine rank). The RANK expression is evaluated as a floating point number. When one job is running, a second idle job that evaluates to a higher RANK value tells the condor_startd to prefer the second job over the first. Therefore, the condor_startd will evict the first job so that it can start running the second (preferred) job. For more on RANK, see section 3.5.

Reason number 4 is if HTCondor is to be shutdown: on a machine that is currently running a job. HTCondor evicts the currently running job before proceeding with the shutdown.

What signals get sent to my jobs when HTCondor needs to preempt or kill them, or when I remove them from the queue? Can I tell HTCondor which signals to send?

The answer is dependent on the universe of the jobs.

Under the scheduler universe, the signal jobs get upon condor_rm can be set by the user in the submit description file with the form of

remove_kill_sig = SIGWHATEVER
If this command is not defined, HTCondor further looks for a command in the submit description file with the form
kill_sig = SIGWHATEVER
And, if that command is also not given, HTCondor uses SIGTERM.

For all other universes, the jobs get the value of the submit description file command kill_sig, which is SIGTERM by default.

If a job is killed or evicted, the job is sent a kill_sig, unless it is on the receiving end of a hard kill, in which case it gets SIGKILL.

Under all universes, the signal is sent only to the parent PID of the job, namely, the first child of the condor_starter. If the child itself is forking, the child must catch and forward signals as appropriate. This in turn depends on the user's desired behavior. The exception to this is (again) where the job is receiving a hard kill. HTCondor sends the value SIGKILL to all the PIDs in the family.

Why does my Linux job have an enormous ImageSize and refuse to run anymore?

Sometimes Linux jobs run, are preempted and can not start again because HTCondor thinks the image size of the job is too big. This is because HTCondor has a problem calculating the image size of a program on Linux that uses threads. It is particularly noticeable in the Java universe, but it also happens in the vanilla universe. It is not an issue in the standard universe, because threaded programs are not allowed.

On Linux, each thread appears to consume as much memory as the entire program consumes, so the image size appears to be (number-of-threads * image-size-of-program). If your program uses a lot of threads, your apparent image size balloons. You can see the image size that HTCondor believes your program has by using the -l option to condor_q, and looking at the ImageSize attribute.

When you submit your job, HTCondor creates or extends the requirements for your job. In particular, it adds a requirement that you job must run on a machine with sufficient memory:

Requirements = ... ((Memory * 1024) >= ImageSize) ...

Note that memory is the execution machine's memory in Mbytes, while ImageSize is in Kbytes. ImageSize is not a perfect measure of the memory requirements of a job. It over-counts memory that is shared between processes. It may appear quite large if the job uses mmap() on a large file. It does not account for memory that the job uses indirectly in the operating system's file system cache.

In the Requirements expression above, HTCondor added (Memory * 1024) >= ImageSize on behalf of the job. To prevent HTCondor from doing this, provide your own expression about memory in the submit description file, as in this example:

Requirements = Memory > 1024

You will need to change the value 1024 to a reasonably good estimate of the actual memory requirements of the program, in Mbytes. This example says that the program requires 1 Gbyte of memory. If you underestimate the memory your application needs, you may have bad performance if the job runs on machines that have insufficient memory.

In addition, if you have modified your machine policies to preempt jobs when ImageSize is large, you will need to change those policies.


Why does the time output from condor_status appear as [?????] ?

HTCondor collects timing information for a large variety of uses. Collection of the data relies on accurate times. Being a distributed system, clock skew among machines causes errant timing calculations. Values can be reported too large or too small, with the possibility of calculating negative timing values.

This problem may be seen by the user when looking at the output of condor_status. If the ActivityTime field appears as [?????], then this calculated statistic was negative. condor_status recognizes that a negative amount of time will be nonsense to report, and instead displays this string.

The solution to the problem is to synchronize the clocks on these machines. An administrator can do this using a tool such as ntp.

The user condor's home directory cannot be found. Why?

This problem may be observed after installation, when attempting to execute

~condor/condor/bin/condor_config_val  -tilde
and there is a user named condor. The command prints a message such as
     Error: Specified -tilde but can't find condor's home directory

In this case, the difficulty stems from using NIS, because the HTCondor daemons fail to communicate properly with NIS to get account information. To fix the problem, a dynamically linked version of HTCondor must be installed.


HTCondor commands (including condor_q) are really slow. What is going on?

Some HTCondor programs will react slowly if they expect to find a condor_collector daemon, yet cannot contact one. Notably, condor_q can be very slow. The condor_schedd daemon will also be slow, and it will log lots of harmless messages complaining. If you are not running a condor_collector daemon, it is important that the configuration variable COLLECTOR_HOST be set to nothing. This is typically done by setting CONDOR_HOST with

CONDOR_HOST=
COLLECTOR_HOST=$(CONDOR_HOST)
or
COLLECTOR_HOST=


Where are my missing files? The command when_to_transfer_output = ON_EXIT_OR_EVICT is in the submit description file.

Although it may appear as if files are missing, they are not. The transfer does take place whenever a job is preempted by another job, vacates the machine, or is killed. Look for the files in the directory defined by the SPOOL configuration variable. See section 2.5.4, on page [*] for details on the naming of the intermediate files.


next up previous contents index
Next: 8.4 HTCondor on Windows Up: 8. Frequently Asked Questions Previous: 8.2 Setting up HTCondor   Contents   Index
htcondor-admin@cs.wisc.edu