Submitting a standard universe jobs

What is the standard universe?

Your first job was considered a vanilla universe job. This meant that it was a plain old job. Condor also supports standard universe jobs. If you have the source code for your program and if it meets certain requirements, you can re-link your program and Condor will provide two major features for you:

Your job can be checkpointed and restarted. When a job is checkpointed, its complete state is saved. When it is restarted, it restarts from where it was checkpointed. If your job is checkpointed periodically, you can recover if there is some sort of failure or interruption. This is incredibly useful for long-running jobs--wouldn't you hate to lose hours or days of work due to a power outage? When Condor restarts, it can restart your job from the last checkpoint, saving you lots of time.
Your job can use remote I/O. This means that every file operation the job makes is performed on the submission computer, so it appears as though the job is running on the submission computer, not the execution computer. If you have files that are not on a shared filesystem, this can be very useful.

Your tutorial leader will give you more details about standard universe, or you can read about them online.

Linking a program for standard universe

First, you need a job to run. We'll use the same job as before. In case you don't have it, here it is. Save it in simple.c:

#include <stdio.h>

main(int argc, char **argv)
{
    int sleep_time;
    int input;
    int failure;

    if (argc != 3) {
        printf("Usage: simple <sleep-time> <integer>\n");
        failure = 1;
    } else {
        sleep_time = atoi(argv[1]);
        input      = atoi(argv[2]);

        printf("Thinking really hard for %d seconds...\n", sleep_time);
        sleep(sleep_time);
        printf("We calculated: %d\n", input * 2);
        failure = 0;
    }
    return failure;
}

Now compile the program using condor_compile. This doesn't change how the program is compiled, just how it is linked. Take note that the executable is named differently.

% condor_compile gcc -o simple.std simple.c
LINKING FOR CONDOR : /usr/bin/ld -L/unsup/condor/lib 
-Bstatic --eh-frame-hdr -m elf_i386 -dynamic-linker 
/lib/ld-linux.so.2 -o simple.std /unsup/condor/lib/condor_rt0.o 
/usr/lib/crti.o
 /afs/cs.wisc.edu/s/gcc-3.4.1/i386_rh9/bin/../lib/gcc/i686-pc-linux-gnu/3.4.1/crtbeginT.o 
-L/unsup/condor/lib
-L/afs/cs.wisc.edu/s/gcc-3.4.1/i386_rh9/bin/../lib/gcc/i686-pc-linux-gnu/3.4.1 
-L/afs/cs.wisc.edu/s/gcc-3.4.1/i386_rh9/bin/../lib/gcc
-L/s/gcc-3.4.1/i386_rh9/lib/gcc/i686-pc-linux-gnu/3.4.1 
-L/afs/cs.wisc.edu/s/gcc-3.4.1/i386_rh9/bin/../lib/gcc/i686-pc-linux-gnu/3.4.1/../../.. 
-L/s/gcc-3.4.1/i386_rh9/lib/gcc/i686-pc-linux-gnu/3.4.1/../../../tmp/cc6zawpv.o 
/unsup/condor/lib/libcondorsyscall.a 
/unsup/condor/lib/libz.a /unsup/condor/lib/libcomp_libstdc++.a 
/unsup/condor/lib/libcomp_libgcc.a
/unsup/condor/lib/libcomp_libgcc_eh.a 
/unsup/condor/lib/libcomp_libgcc_eh.a -lc -lnss_files -lnss_dns
-lresolv -lc 
-lnss_files -lnss_dns -lresolv -lc /unsup/condor/lib/libcomp_libgcc.a 
/unsup/condor/lib/libcomp_libgcc_eh.a
/unsup/condor/lib/libcomp_libgcc_eh.a 
/afs/cs.wisc.edu/s/gcc-3.4.1/i386_rh9/bin/../lib/gcc/i686-pc-linux-gnu/3.4.1/crtend.o 
/usr/lib/crtn.o
/unsup/condor/lib/libcondorsyscall.a(condor_file_agent.o)(.text+0x250): In function `CondorFileAgent::open(char const*, int, int)':
/home/condor/execute/dir_12578/co

% ls -lh simple.std
-rwxr-x---    1 temp-01  temp-01       12M Mar 15 16:32 simple.std*

There are a lot of warnings there--you can safely ignore those warnings. You can also see just how many libraries we link the program against. It's a lot! And yes, the executable is much bigger now. Partly that's the price of having checkpointing and partly it is because the program is now statically linked, but you can make it slightly smaller if you want by getting rid of debugging symbols:

% strip simple.std

% ls -lh simple.std
-rwxr-x---    1 temp-01  temp-01      1.3M Mar 15 16:40 simple.std*

Submitting a standard universe program

Submitting a standard universe job is almost the same as a vanilla universe job. Just change the universe to standard. Here is a sample submit file. I suggest making it run for a longer time, so we can experiment with the checkpointing while it runs. Also, get rid of the multiple queue commands that we had. Here is the complete submit file, I suggest naming it submit.std.

Universe   = standard
Executable = simple.std
Arguments  = 120 10
Log        = simple.log
Output     = simple.out
Error      = simple.error
Queue

Then submit it as you did before, with condor_submit:

% rm simple.log

% condor_submit submit.std
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 3.

% condor_q


-- Submitter: royal01.cs.wisc.edu : <128.105.112.101:34353> : royal01.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   3.0   temp-01         3/15 16:41   0+00:00:00 I  0   1.3  simple.std 120 10 

1 jobs; 1 idle, 0 running, 0 held
royal01(49)% condor_q


-- Submitter: royal01.cs.wisc.edu : <128.105.112.101:34353> : royal01.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   3.0   temp-01         3/15 16:41   0+00:00:01 R  0   1.3  simple.std 120 10 

1 jobs; 0 idle, 1 running, 0 held

Two minutes pass...

% condor_q


-- Submitter: royal01.cs.wisc.edu : <128.105.112.101:34353> : royal01.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

% cat simple.log
000 (003.000.000) 03/15 16:41:24 Job submitted from host: <128.105.112.101:34353>
...
001 (003.000.000) 03/15 16:41:26 Job executing on host: <128.105.149.110:45736>
...
005 (003.000.000) 03/15 16:43:26 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        1170  -  Run Bytes Sent By Job
        1326572  -  Run Bytes Received By Job
        1170  -  Total Bytes Sent By Job
        1326572  -  Total Bytes Received By Job

Notice that the log file has a bit more information this time: we can see how much data was transfered to and from the job because it's in the standard universe. The remote usage was not very interesting because the job just slept, but a real job would have some interesting numbers there.

Advanced tricks in the standard universe

At this point in the tutorial, I will demonstrate how you can force your job to be checkpointed and what it will look like. We will use a command called condor_checkpoint that you normally never to use, so we can demonstrate.

Warning! This command relies on the condor_checkpoint command, which tells all jobs running on a single computer to checkpoint. Since we have a mini Condor pool with only a single computer, anyone in class that does a condor_checkpoint will cause everyone's job to checkpoint. This may be confusing, but it's the best we can do for a simple demo.

Begin by submitting your job, and figuring out where it is running:

% rm simple.log

% condor_submit submit.std
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 5.

% condor_q

-- Submitter: royal01.cs.wisc.edu : <128.105.112.101:34353> : royal01.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   5.0   temp-01         3/15 17:01   0+00:00:03 R  0   1.3  simple.std 120 10 

1 jobs; 0 idle, 1 running, 0 held

% cat simple.log
000 (005.000.000) 03/15 17:02:00 Job submitted from host: <128.105.112.101:34353>
...
001 (005.000.000) 03/15 17:02:02 Job executing on host: <128.105.149.101:46393>
...

% host 128.105.149.101
101.149.105.128.in-addr.arpa domain name pointer f01.cs.wisc.edu.

By looking at the IP address of the job and converting that to a name, we know where the job is running. Move along quickly now, because the job will only run for two minutes. Now let's tell Condor to checkpoint and see what happens.

% condor_checkpoint f01.cs.wisc.edu

% cat simple.log

000 (005.000.000) 03/15 17:02:00 Job submitted from host: <128.105.112.101:34353>
...
001 (005.000.000) 03/15 17:02:02 Job executing on host: <128.105.149.101:46393>
...
006 (005.000.000) 03/15 17:02:27 Image size of job updated: 1972
...
003 (005.000.000) 03/15 17:02:28 Job was checkpointed.
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
...
005 (005.000.000) 03/15 17:02:28 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        694993  -  Run Bytes Sent By Job
        1326902  -  Run Bytes Received By Job
        694993  -  Total Bytes Sent By Job
        1326902  -  Total Bytes Received By Job
...

Voila! We checkpointed our job correctly.

Advanced note You might notice that the job finished right after it was checkpointed. Why? The job was checkpointed while executing sleep(), then essentially restarted from the checkpoint (though Condor doesn't consider this to be a restart since the job didn't leave the computer). Condor didn't keep track of how much time had elapsed in the sleep call, so the job finished right away. Don't worry--Condor handles other system calls just fine. It's not clear how to handle checkpointing sleep()--if your job is interrupted during the sleep and restarted sometime later, how much time should Condor force the job to sleep for? Do we rely on wall clock time? Run time?

Normally, you never need to use condor_checkpoint: we just used it as a demonstration. Condor will checkpoint your jobs periodically (the default is every three hours) or when your job is forced to leave a computer to give time to another user. So you should never need to use condor_checkpoint.

Extra Credit You can customize the behavior of the standard universe quite a bit. For instance, you can force some files to be accesssed locally instead of via remote I/O. You can change the buffering of remote I/O to get better performance. You can disable checkpointing. You can kill a job that has been restarted from its checkpoint more than three times. How do you do these things? Hint, look at the condor_submit manual page

Next: Submitting a Java job