Submitting Test Jobs and Examining the Logs


Now, we will submit some test jobs, and follow their progress by examining the different log files.

Simple Test Job

The first job is a simple "hello world" program that will run very quickly. First, change into a directory with the example job:

% cd ~/examples/hello

In order to take advantage of the advanced features of condor, one needs to re-link executables with Condor's libraries. In order to do this, we use condor_compile to re-link object files into a new binary that will work with condor. If you list the files in this directory, you will see hello.c. This is a C source file for a simple "Hello World" executable. In order to link a binary for condor, issue the following command:

% condor_compile gcc hello.c -o hello 

Now you should have an executable that is linked for condor called hello. You can run it outside of condor just to test it:

% ./hello

To save time, we have already setup the submit file. The only interesting thing to know is that we told Condor to start up the job in /tmp, so that all the output files will go there. /tmp is not on a shared file system, but this will still work because of remote system calls, one of the features you gain by using condor_compile.

Now, just submit the job to Condor:

% condor_submit hello.submit

First, look at the user log for this file:

% cat /tmp/hello.log

You will be able to follow the progress of the job by watching this file.

You can watch the state of the Condor pool with condor_status, in particular, the "-run" option:

% condor_status -run

A Job That Will Never Run Because of Bad Requirements

The next job we will look at will be submitted with a faulty Requirements expression. When users try to define their own Requirements, they often get something wrong, and it prevents their jobs from running. First, lets go into the directory with the example:

% cd ~/examples/never

Now, take a look at the submit file:

% cat never.submit

Notice the line for "Requirements":

requirements    = Arch == "intel"

While this is legal syntax, the job will never run. The machines in the pool advertise Arch = "INTEL", not "intel". This is one place in ClassAds is case sensitive: attribute values. If they had said:

requirements    = ARCH == "INTEL"
or
requirements    = arCH == "INTEL"
it would have been fine, since in both cases, the "INTEL" part was all caps.

Let's submit this job to Condor and see what happens:

% condor_submit never.submit

The job will be submitted, but will never run. First, let's look at the condor_schedd's log, "SchedLog". We have already turned on D_FULLDEBUG for the schedd's in the NFS pool, so you can begin to see what's going wrong:

% tail -20 ~/log/SchedLog

Near the end, you should see lines that look something like this:

6/28 16:29 Negotiating for owner: condor@infn-corsi06.corsi.infn.it
6/28 16:29 Sent job 3.0
6/28 16:29 Out of servers - 0 jobs matched, 1 jobs idle

This says that the condor_schedd is negotiating for your job, but the negotiator replied that there were 0 machines that matched the requirements.

Another way to see this is with "condor_q -analyze":

% condor_q -analyze

This will print out a lot of output:

003.000:  Run analysis summary.  Of 1 resource offers,
           16 do not satisfy the request's constraints
            0 resource offer constraints are not satisfied by this request
            0 are serving equal or higher priority customers
            0 are serving more preferred customers
            0 cannot preempt because preemption has been held
            0 are available to service your request

WARNING:  Be advised:
   No resources matched request's constraints
   Check the Requirements expression below:

Requirements = (Arch == "intel") && (OpSys == "LINUX-GLIBC") && 
(Disk >= ExecutableSize) && (VirtualMemory >= ImageSize)

Notice that it indicates the problem is with the requirements expression of the job, and that there are no machines in the pool that could ever possibly satisfy them.

When users ask "why don't my jobs run?", the first thing to do is have them run condor_q -analyze.

This job will never run, so just remove it from the queue:

% condor_rm -all

The "-all" option tells Condor to remove all the jobs in the queue that you own.

Job That Will Never Run Because of Permission Errors

Another common user error that causes problems are permissions on files or directories. There are many possibilities, particularly if the Condor daemons aren't started as root or you are using AFS. We only have time to look at one example. First, go to the example directory:

% cd ~/examples/permissions

The example will use will be a long running job. About every second, the job prints out a number to STDOUT. Every half minute or so, the job checkpoints itself and exits. While the job is running, we'll remove its output file (a common mistake made by users, particularly on very long running jobs). First, we need to build this new test program:

% condor_compile gcc loop.c -o loop 

Now, let's submit it:

% condor_submit loop.submit

First, watch the job with "condor_q". It should be in the running state (with an "R" in the column marked "ST").

If you use "tail -f loop.out", you'll see the program writting a number to the output file faily regularly. You use Ctrl-C to exit out of tail:

% tail -f loop.out

After a little while, remove this file:

% rm -f loop.out

The job will eventually exit. Now, if you look at condor_q, the job will remain "Idle" (with an "I" in the state column). This is often what user see, their jobs stuck in the "Idle" state for a long time, that prompts them to ask for help.

Now, look at the ShadowLog. You'll want to use "less" so you can easily scroll to the bottom of the file, as well as scrolling backwards.

Note: we've already turned on extra debugging in the condor_shadow by enabling D_SYSCALLS and D_FULLDEBUG. It's not a good idea to leave these on all the time, since they produce a lot of overhead and load on your submit machine, but if you're trying to debug a problem, enabling them is a very good first step.

View the ShadowLog:

% less ~/log/ShadowLog

Press "G" to scroll to the bottom of the file (it has to be a capital "G"). Now you can see the error. Near the top of your screen, you'll see a group of messages that look something like:

6/28 16:35 (9.0) (3310):Got request for syscall 66 
6/28 16:35 (9.0) (3310):    flags = 1
6/28 16:35 (9.0) (3310):    mode = 0
6/28 16:35 (9.0) (3310):    rval = -1, errno = 2
6/28 16:35 (9.0) (3310):Read: i=1, filename=/local/condor/home/examples/permissions/loop.out
6/28 16:35 (9.0) (3310):Read: open: No such file or directory

This tells you exactly what's happening, the condor_shadow is trying to open a file that doesn't exist. If there was a permission problem instead (for example, if you had done a 'chmod 444 loop.out' instead of removing the file), you'd see:

...
6/28 16:35 (9.0) (3310):Read: open: Permission denied
...

In general, if a job isn't running, and "condor_q -analyze" doesn't tell you something helpful, try turning on D_SYSCALLS and looking at the ShadowLog on the submit machine.

That's it. We're done. Congratulations!