The first job is a simple "hello world" program that will run very quickly. First, change into a directory with the example job:
% cd ~/examples/hello
In order to take advantage of the advanced features of condor, one needs to re-link executables with Condor's libraries. In order to do this, we use condor_compile to re-link object files into a new binary that will work with condor. If you list the files in this directory, you will see hello.c. This is a C source file for a simple "Hello World" executable. In order to link a binary for condor, issue the following command:
% condor_compile gcc hello.c -o hello
Now you should have an executable that is linked for condor called hello. You can run it outside of condor just to test it:
% ./hello
To save time, we have already setup the submit file. The only interesting thing to know is that we told Condor to start up the job in /tmp, so that all the output files will go there. /tmp is not on a shared file system, but this will still work because of remote system calls, one of the features you gain by using condor_compile.
Now, just submit the job to Condor:
% condor_submit hello.submit
First, look at the user log for this file:
% cat /tmp/hello.log
You will be able to follow the progress of the job by watching this file.
You can watch the state of the Condor pool with condor_status, in particular, the "-run" option:
% condor_status -run
The next job we will look at will be submitted with a faulty Requirements expression. When users try to define their own Requirements, they often get something wrong, and it prevents their jobs from running. First, lets go into the directory with the example:
% cd ~/examples/never
Now, take a look at the submit file:
% cat never.submit
Notice the line for "Requirements":
requirements = Arch == "intel"
While this is legal syntax, the job will never run. The machines in the pool advertise Arch = "INTEL", not "intel". This is one place in ClassAds is case sensitive: attribute values. If they had said:
requirements = ARCH == "INTEL"or
requirements = arCH == "INTEL"it would have been fine, since in both cases, the "INTEL" part was all caps.
Let's submit this job to Condor and see what happens:
% condor_submit never.submit
The job will be submitted, but will never run. First, let's look at the condor_schedd's log, "SchedLog". We have already turned on D_FULLDEBUG for the schedd's in the NFS pool, so you can begin to see what's going wrong:
% tail -20 ~/log/SchedLog
Near the end, you should see lines that look something like this:
6/28 16:29 Negotiating for owner: condor@infn-corsi06.corsi.infn.it 6/28 16:29 Sent job 3.0 6/28 16:29 Out of servers - 0 jobs matched, 1 jobs idle
This says that the condor_schedd is negotiating for your job, but the negotiator replied that there were 0 machines that matched the requirements.
Another way to see this is with "condor_q -analyze":
% condor_q -analyze
This will print out a lot of output:
003.000: Run analysis summary. Of 1 resource offers, 16 do not satisfy the request's constraints 0 resource offer constraints are not satisfied by this request 0 are serving equal or higher priority customers 0 are serving more preferred customers 0 cannot preempt because preemption has been held 0 are available to service your request WARNING: Be advised: No resources matched request's constraints Check the Requirements expression below: Requirements = (Arch == "intel") && (OpSys == "LINUX-GLIBC") && (Disk >= ExecutableSize) && (VirtualMemory >= ImageSize)
Notice that it indicates the problem is with the requirements expression of the job, and that there are no machines in the pool that could ever possibly satisfy them.
When users ask "why don't my jobs run?", the first thing to do is have them run condor_q -analyze.
This job will never run, so just remove it from the queue:
% condor_rm -all
The "-all" option tells Condor to remove all the jobs in the queue that you own.
Another common user error that causes problems are permissions on files or directories. There are many possibilities, particularly if the Condor daemons aren't started as root or you are using AFS. We only have time to look at one example. First, go to the example directory:
% cd ~/examples/permissions
The example will use will be a long running job. About every second, the job prints out a number to STDOUT. Every half minute or so, the job checkpoints itself and exits. While the job is running, we'll remove its output file (a common mistake made by users, particularly on very long running jobs). First, we need to build this new test program:
% condor_compile gcc loop.c -o loop
Now, let's submit it:
% condor_submit loop.submit
First, watch the job with "condor_q". It should be in the running state (with an "R" in the column marked "ST").
If you use "tail -f loop.out", you'll see the program writting a number to the output file faily regularly. You use Ctrl-C to exit out of tail:
% tail -f loop.out
After a little while, remove this file:
% rm -f loop.out
The job will eventually exit. Now, if you look at condor_q, the job will remain "Idle" (with an "I" in the state column). This is often what user see, their jobs stuck in the "Idle" state for a long time, that prompts them to ask for help.
Now, look at the ShadowLog. You'll want to use "less" so you can easily scroll to the bottom of the file, as well as scrolling backwards.
Note: we've already turned on extra debugging in the condor_shadow by enabling D_SYSCALLS and D_FULLDEBUG. It's not a good idea to leave these on all the time, since they produce a lot of overhead and load on your submit machine, but if you're trying to debug a problem, enabling them is a very good first step.
View the ShadowLog:
% less ~/log/ShadowLog
Press "G" to scroll to the bottom of the file (it has to be a capital "G"). Now you can see the error. Near the top of your screen, you'll see a group of messages that look something like:
6/28 16:35 (9.0) (3310):Got request for syscall 666/28 16:35 (9.0) (3310): flags = 1 6/28 16:35 (9.0) (3310): mode = 0 6/28 16:35 (9.0) (3310): rval = -1, errno = 2 6/28 16:35 (9.0) (3310):Read: i=1, filename=/local/condor/home/examples/permissions/loop.out 6/28 16:35 (9.0) (3310):Read: open: No such file or directory
This tells you exactly what's happening, the condor_shadow is trying to open a file that doesn't exist. If there was a permission problem instead (for example, if you had done a 'chmod 444 loop.out' instead of removing the file), you'd see:
... 6/28 16:35 (9.0) (3310):Read: open: Permission denied ...
In general, if a job isn't running, and "condor_q -analyze" doesn't tell you something helpful, try turning on D_SYSCALLS and looking at the ShadowLog on the submit machine.
That's it. We're done. Congratulations!