Your first job was considered a vanilla universe job. This meant that it was a plain old job. Condor also supports standard universe jobs. If you have the source code for your program and if it meets certain requirements, you can re-link your program and Condor will provide two major features for you:
#include <stdio.h> main(int argc, char **argv) { int sleep_time; int input; int failure; if (argc != 3) { printf("Usage: simple <sleep-time> <integer>\n"); failure = 1; } else { sleep_time = atoi(argv[1]); input = atoi(argv[2]); printf("Thinking really hard for %d seconds...\n", sleep_time); sleep(sleep_time); printf("We calculated: %d\n", input * 2); failure = 0; } return failure; }
Now compile the program using condor_compile. This doesn't change how the program is compiled, just how it is linked. Take note that the executable is named differently.
% condor_compile gcc -o simple.std simple.c LINKING FOR CONDOR : /usr/bin/ld -L/opt/condor-6.7.3/lib -Bstatic --eh-frame-hdr -m elf_i386 -dynamic-linker /lib/ld-linux.so.2 -o simple.std /opt/condor-6.7.3/lib/condor_rt0.o /usr/lib/gcc/i386-redhat-linux/3.4.2/../../../crti.o /usr/lib/gcc/i386-redhat-linux/3.4.2/crtbeginT.o -L/opt/condor-6.7.3/lib -L/usr/lib/gcc/i386-redhat-linux/3.4.2 -L/usr/lib/gcc/i386-redhat-linux/3.4.2 -L/usr/lib/gcc/i386-redhat-linux/3.4.2/../../.. /tmp/cc83nCSc.o /opt/condor-6.7.3/lib/libcondorzsyscall.a /opt/condor-6.7.3/lib/libcondor_z.a /opt/condor-6.7.3/lib/libcomp_libstdc++.a /opt/condor-6.7.3/lib/libcomp_libgcc.a /opt/condor-6.7.3/lib/libcomp_libgcc_eh.a --as-needed --no-as-needed -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c /opt/condor-6.7.3/lib/libcomp_libgcc.a /opt/condor-6.7.3/lib/libcomp_libgcc_eh.a --as-needed --no-as-needed /usr/lib/gcc/i386-redhat-linux/3.4.2/crtend.o /usr/lib/gcc/i386-redhat-linux/3.4.2/../../../crtn.o /opt/condor-6.7.3/lib/libcondorzsyscall.a(condor_file_agent.o)(.text+0x250): In function `CondorFileAgent::open(char const*, int, int)': : warning: the use of `tmpnam' is dangerous, better use `mkstemp' % ls -lh simple.std -rwxrwxr-x 1 roy roy 1.6M Jan 25 12:37 simple.std*
There are a lot of warnings there--you can safely ignore those warnings. You can also see just how many libraries we link the program against. It's a lot! And yes, the executable is much bigger now. Partly that's the price of having checkpointing and partly it is because the program is now statically linked, but you can make it slightly smaller if you want by getting rid of debugging symbols:
% strip simple.std % ls -lh simple.std -rwxrwxr-x 1 roy roy 1.3M Jan 25 12:39 simple.std*
Submitting a standard universe job is almost the same as a vanilla universe job. Just change the universe to standard. Here is a sample submit file. I suggest making it run for a longer time, so we can experiment with the checkpointing while it runs. Also, get rid of the multiple queue commands that we had. Here is the complete submit file, I suggest naming it submit.std.
Universe = standard Executable = simple.std Arguments = 120 10 Log = simple.log Output = simple.out Error = simple.error Queue
Then submit it as you did before, with condor_submit:
% rm simple.log % condor_submit submit.std Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 3. % condor_q -sub roy -- Submitter: roy@fnal.gov : <128.105.48.160:32787> : hal.fnal.gov ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3.0 roy 1/25 13:07 0+00:00:00 R 0 1.3 simple.std 120 10 1 jobs; 0 idle, 1 running, 0 held Two minutes pass... % condor_q -- Submitter: roy@fnal.gov : <128.105.48.160:32787> : hal.fnal.gov ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held % cat simple.log 000 (003.000.000) 01/25 13:07:58 Job submitted from host: <128.105.48.160:32787> ... 001 (003.000.000) 01/25 13:08:27 Job executing on host: <128.105.48.160:32786> ... 005 (003.000.000) 01/25 13:10:27 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 1162 - Run Bytes Sent By Job 1335470 - Run Bytes Received By Job 1162 - Total Bytes Sent By Job 1335470 - Total Bytes Received By Job ...
Notice that the log file has a bit more information this time: we can see how much data was transfered to and from the job because it's in the standard universe. The remote usage was not very interesting because the job just slept, but a real job would have some interesting numbers there.
At this point in the tutorial, I will demonstrate how you can force your job to be checkpointed and what it will look like. We will use a command called condor_checkpoint that you normally never to use, so we can demonstrate.
Begin by submitting your job, and figuring out where it is running:
% rm simple.log % condor_submit submit.std Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 5. % condor_q -sub roy -- Submitter: roy@fnal.gov : <132.67.192.133:49346> : hal.fnal.gov ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 roy 1/25 14:42 0+00:00:00 R 0 1.3 simple.std 120 10 1 jobs; 0 idle, 1 running, 0 held % cat simple.log 000 (005.000.000) 01/25 14:42:26 Job submitted from host: <128.105.48.160:32787> ... 001 (005.000.000) 01/25 14:42:30 Job executing on host: <128.105.48.160:32786> ... % host 128.105.48.160 160.48.105.128.in-addr.arpa domain name pointer wireless60.cs.wisc.edu.
By looking at the IP address of the job and converting that to a name, we know where the job is running. Move along quickly now, because the job will only run for two minutes. Now let's tell Condor to checkpoint and see what happens.
% condor_checkpoint wireless60.cs.wisc.edu % cat simple.log 000 (005.000.000) 01/25 14:42:26 Job submitted from host: <128.105.48.160:32787> ... 001 (005.000.000) 01/25 14:42:30 Job executing on host: <128.105.48.160:32786> ... 006 (005.000.000) 01/25 14:43:00 Image size of job updated: 1977 ... 003 (005.000.000) 01/25 14:43:00 Job was checkpointed. Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage ... 005 (005.000.000) 01/25 14:43:00 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 690889 - Run Bytes Sent By Job 1335788 - Run Bytes Received By Job 690889 - Total Bytes Sent By Job 1335788 - Total Bytes Received By Job ...
Voila! We checkpointed our job correctly.
Normally, you never need to use condor_checkpoint: we just used it as a demonstration. Condor will checkpoint your jobs periodically (the default is every three hours) or when your job is forced to leave a computer to give time to another user. So you should never need to use condor_checkpoint.