3.0 Submitting your first Condor job
3.1 First you need a job
Before you can submit a job to Condor, you need a job. We will quickly
write a small program in C. If you aren't an expert C program, fear
not. We will hold your hand throughout this process.
First, create a file called simple.c using your favorite editor. Put
it anywhere you like in your home directory. In
that file, put the following text. Copy and paste is a good choice:
% cd
% mkdir -p condor-test
% cd condor-test
% cat > simple.c
#include <stdio.h>
main(int argc, char **argv)
{
int sleep_time;
int input;
int failure;
if (argc != 3) {
printf("Usage: simple <sleep-time> <integer>\n");
failure = 1;
} else {
sleep_time = atoi(argv[1]);
input = atoi(argv[2]);
printf("Thinking really hard for %d seconds...\n", sleep_time);
sleep(sleep_time);
printf("We calculated: %d\n", input * 2);
failure = 0;
}
return failure;
}
type control-d here
Now compile that program:
% gcc -o simple simple.c
% ls -lh simple
-rwxrwxr-x 1 roy roy 5.0K Apr 28 14:13 simple*
Finally, run the program and tell it to sleep for four seconds and
calculate 10 * 2:
% ./simple 4 10
Thinking really hard for 4 seconds...
We calculated: 20
Great! You have a job you can tell Condor to run! Although it clearly
isn't an interesting job, it models some of the aspects of a real
scientific program: it takes a while to run and it does a
calculation.
Top
3.2 Submitting your job
Now that you have a job, you just have to tell Condor to run
it. Put the following text into a file called submit:
Universe = vanilla
Executable = simple
Arguments = 4 10
Log = simple.log
Output = simple.out
Error = simple.error
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
Queue
Let's examine each of these lines:
- Universe: The vanilla universe means a plain old job. Later on,
we'll encounter some special universes.
- Executable: The name of your program
- Arguments: These are the arguments you want. They will be the
same arguments we typed above.
- Log: This is the name of a file where Condor will record
information about your job's execution. While it's not required, it is
a really good idea to have a log.
- Output: Where Condor should put the standard output from
your job.
- Error: Where Condor should put the standard error from your
job. Our job isn't likely to have any, but we'll put it there to be
safe.
- should_transfer_files: Tell Condor to transfer files (your executable, the
standard output, etc) back and forth. We are doing this because your home directories are
not on a shared file system.
- when_to_transfer_output: Tell Condor when to transfer your output back. Don't
worry about the details of this now.
Next, tell Condor to run your job:
% condor_submit submit
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 45011.
Now, watch your job run:
% condor_q
-- Submitter: osg-edu.cs.wisc.edu : <192.168.0.1:46374> : osg-edu.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
45011.0 roy 4/28 14:39 0+00:00:00 I 0 9.8 simple 4 10
1 jobs; 1 idle, 0 running, 0 held
% condor_q
-- Submitter: osg-edu.cs.wisc.edu : <192.168.0.1:46374> : osg-edu.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
45011.0 roy 4/28 14:39 0+00:00:01 R 0 9.8 simple 4 10
1 jobs; 0 idle, 1 running, 0 held
% condor_q
-- Submitter: osg-edu.cs.wisc.edu : <192.168.0.1:46374> : osg-edu.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held
Notice a few things here. In a real pool, when you do condor_q, you
might get a long list of everyone's jobs. So you can tell
condor_q to just list your jobs with the -sub option, which is short for
submitter, as in:
% condor_q -sub roy
When my job was done, it was no longer listed. Because I
told Condor to log information about my job, I can see what happened:
000 (45011.000.000) 04/28 14:39:04 Job submitted from host: <192.168.0.1:46374>
...
001 (45011.000.000) 04/28 14:39:09 Job executing on host: <192.168.0.4:33478>
...
005 (45011.000.000) 04/28 14:39:13 Job terminated.
(1) Normal termination (return value 0)
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
That looks good: It took a few seconds for the job to start up, though
you will often see slightly slower startups.
Condor doesn't optimize for fast job startup, but for high
throughput, The job ran for about four seconds. But did our job
execute correctly? If this had been a real Condor pool, the execution
computer would have been different than the submit computer, but
otherwise it would have looked the same.
% cat simple.out
Thinking really hard for 4 seconds...
We calculated: 20
Excellent! We ran our sophisticated scientific job on a Condor pool!
Top
3.3 Doing a parameter sweep
If you only ever had to run a single job, you probably wouldn't need
Condor. But we would like to have our program calculate a whole set of
values for different inputs. How can we do that? Let's change our
submit file to look like this:
Universe = vanilla
Executable = simple
Arguments = 4 10
Log = simple.log
Output = simple.$(Process).out
Error = simple.$(Process).error
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
Queue
Arguments = 4 11
Queue
Arguments = 4 12
Queue
There are two important differences to notice here. First, the
Output and Error lines have the $(Process) macro in them. This means
that the output and error files will be named according to the process
number of the job. You'll see what this looks like in a
moment. Second, we told Condor to run the same job an extra two times
by adding extra Arguments and Queue statements. We are doing a
parameter sweep on the values 10, 11, and 12. Let's see what happens:
% condor_submit submit
Submitting job(s)...
Logging submit event(s)...
3 job(s) submitted to cluster 45017.
% condor_q
-- Submitter: osg-edu.cs.wisc.edu : <192.168.0.1:46374> : osg-edu.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
45017.0 roy 4/28 14:47 0+00:00:00 I 0 9.8 simple 4 10
45017.1 roy 4/28 14:47 0+00:00:00 I 0 9.8 simple 4 11
45017.2 roy 4/28 14:47 0+00:00:00 I 0 9.8 simple 4 12
3 jobs; 3 idle, 0 running, 0 held
% condor_q
-- Submitter: osg-edu.cs.wisc.edu : <192.168.0.1:46374> : osg-edu.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
45017.0 roy 4/28 14:47 0+00:00:05 R 0 9.8 simple 4 10
45017.1 roy 4/28 14:47 0+00:00:03 R 0 9.8 simple 4 11
45017.2 roy 4/28 14:47 0+00:00:01 R 0 9.8 simple 4 12
3 jobs; 0 idle, 3 running, 0 held
% condor_q
-- Submitter: osg-edu.cs.wisc.edu : <192.168.0.1:46374> : osg-edu.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held
% ls simple*out
simple.0.out simple.1.out simple.2.out simple.out
% cat simple.0.out
Thinking really hard for 4 seconds...
We calculated: 20
% cat simple.1.out
Thinking really hard for 4 seconds...
We calculated: 22
% cat simple.2.out
Thinking really hard for 4 seconds...
We calculated: 24
Notice that we had three jobs with the same cluster number, but
different process numbers. They have the same cluster number because
they were all submitted from the same submit file. When the jobs ran,
they created three different output files, each with the desired
output.
You are now ready to submit lots of jobs! Although this example was
simple, Condor has many, many options so you can get a wide variety of
behaviors. You can find many of these if you look at the
documentation for condor_submit.
Extra credit
- What if you want the cluster number to be part of the output
filename?
- Condor sends you email when a job finishes. How can you control
this?
- Make another scientific program that takes it input from a
file. Now submit 3 copies of this program where each input file is in
a separate directory. Use the initialdir option described in the
lecture, or
in the manual.
- Bonus points: You know that your job should never run for
more than four hours. If it does, then the job should be killed
because there is a problem. How can you tell Condor to do this for
you?
Next: Submitting a standard universe job
Top
|