Condor-G and DAGMan Hands-On Lab

Part II: Submitting a Simple Grid Job with Condor-G

Now we are ready to submit our first job with Condor-G. Condor-G (G stands for Grid) is a subset of Condor designed for submitting jobs to external systems, such as Globus. Condor-G does many useful thing for the user such as:

finds a remote grid site for the job to run on based on user's requirements/preferences
takes care of staging executable, data and user credentials in an out of the remote site
keeps track of the job's progress, re-tries if necessary
stores logging, debugging information about the job
runs in the background; no immediate user presence is required

These features become especially useful when a user has tens or hundreds of Grid jobs to run, and keeping track of them individually becomes a daunting task.

Creating Condor-G submit file

The basic procedure is to create a Condor job submit description file. This file can tell Condor what executable to run, what resources to use, how to handle failures, where to store the job's output, and many other characteristics of the job submission. Then this file is given to condor_submit.

First, create a scratch directory in your home location.

$ cd ~
$ mkdir scratch

Move to our scratch location:

$ cd ~/scratch

Create a Condor submit file. As you can see from the condor_submit manual page, there are many options that can be specified in a Condor-G submit description file. We will start out with just a few. We'll be sending the job to the computer my-gatekeeper.cs.wisc.edu and running under the "jobmanager-fork" job manager. Other common jobmanagers are: jobmanager-condor, jobmanager-pbs, jobmanager-lsf, etc. They indicate which batch system the gatekeeper will use to run your job on the remote site. You may have to know in advance which jobmanagers the remote site has enabled.

(Feel free to use your favorite editor, but we will demonstrate with 'cat' in the example below. When using cat to create files, press Ctrl-D to close the file -- don't actually type "Ctrl-D" into the file. After you create a file, we suggest you use cat to display the file and confirm that it contains the expected text.)

Create the submit file, then verify that it was entered correctly:

$ cat > myjob.submit
executable=myscript.sh
arguments=TestJob 10
output=results.output
error=results.error
log=results.log
notification=never
universe=globus
globusscheduler=my-gatekeeper.cs.wisc.edu:/jobmanager-fork
queue
<Ctrl-D>

$ cat myjob.submit
executable=myscript.sh
arguments=TestJob 10
output=results.output
error=results.error
log=results.log
notification=never
universe=globus
globusscheduler=my-gatekeeper.cs.wisc.edu:/jobmanager-fork
queue

Notice that we're setting notification to never to avoid getting email messages about the completion of our job, and redirecting the stdout/err of the job back to the submission computer. Notice the "globusscheduler" parameter - it indicates the hostname of the gatekeeper and the jobmanager type used for job execution.

Creating a program to run

Create a little program to run on the grid.

$ cat > myscript.sh
#! /bin/sh

echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished.  Exiting"
echo "RESULT: 0 SUCCESS"
Ctrl-D 
$ cat myscript.sh
#! /bin/sh

echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished.  Exiting"
echo "RESULT: 0 SUCCESS"

Make the program executable and test it.

$ chmod a+x myscript.sh
$ ./myscript.sh TEST 1
I'm process id 3428 on uml1.cs.wisc.edu
This is sent to standard error
Thu Jul 10 12:21:11 CDT 2003
Running as binary ./myscript.sh TEST 1
My name (argument 1) is TEST
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting
RESULT: 0 SUCCESS

Submitting to Condor-G

Submit your test job to Condor-G.

$ condor_submit myjob.submit
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1.

Occasionally run condor_q to watch the progress of your job. You may also want to occasionally run "condor_q -globus" which presents Globus specific status information. (Additional documentation on condor_q).

Warning: if your job is in the "H" (hold state) for some reason, you probably have done something wrong in one of your files; please see the instructor.

$ condor_q
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   user??         7/10 17:28   0+00:00:00 I  0   0.0  myscript.sh TestJo

1 jobs; 1 idle, 0 running, 0 held
$ condor_q -globus


-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   1.0   user??       UNSUBMITTED fork     my-gatekeeper.cs.wisc.edu   /tmp/username-cond

In a few seconds... (note the job status of "R" indicating that the job is running)

$ condor_q
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   user??         7/10 17:28   0+00:00:27  R  0   0.0  myscript.sh TestJo

1 jobs; 0 idle, 1 running, 0 held

$ condor_q -globus
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   1.0   user??       ACTIVE fork     my-gatekeeper.cs.wisc.edu   /tmp/username-cond

In a few seconds the job status changes to "C" - complete

$ condor_q
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:33785> : uml1.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   user??         7/10 17:28   0+00:00:40  C  0   0.0  myscript.sh       

0 jobs; 0 idle, 0 running, 0 held

$ condor_q -globus
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:33785> : uml1.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   1.0   user??       DONE fork     my-gatekeeper.cs.wisc.edu   /afs/cs.wisc.edu/u

Ultimately the job completes and is removed from the Condor-G queue

$ condor_q
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:33785> : uml1.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

In another window you can run "tail -f" to watch the log file for your job to monitor its progress. For the remainder of this tutorial, we suggest you re-run this command when you submit one or more jobs. This will allow you to see monitor how typical Condor-G jobs progress. Use "Ctrl-C" to stop watching the file.

In a second window:

$ cd ~/scratch
$ tail -f --lines=500 results.log
000 (001.000.000) 07/10 17:28:48 Job submitted from host: <128.105.185.14:35688>
...
017 (001.000.000) 07/10 17:29:01 Job submitted to Globus
    RM-Contact: my-gatekeeper.cs.wisc.edu:/jobmanager-fork
    JM-Contact: https://my-gatekeeper.cs.wisc.edu:2321/696/1057876132/
    Can-Restart-JM: 1
...
001 (001.000.000) 07/10 17:29:01 Job executing on host: my-gatekeeper.cs.wisc.edu
...
005 (001.000.000) 07/10 17:30:08 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Jemob
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...

When the job is no longer listed by condor_q or when the log file reports "Job terminated," you can see the results in condor_history.

$ condor_history
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD
   1.0   user??         7/10 10:28   0+00:00:00  C    ???         ...

When the job completes, verify that the output is as expected. (The binary name is different from what you created because of how Globus and Condor-G cooperate to stage your file to execute machine).

$ ls
myjob.submit  myscript.sh*  results.error  results.log	results.output
$ cat results.error
This is sent to standard error
$ cat results.output 
$I'm process id 733 on pc-26
Thu Jul 10 17:28:57 CDT 2003
Running as binary /home/user??/.globus/.gass_cache/local/md5/28/fcae5001dbcd99cc476984b4151284/md5/af/355c4959dc83a74b18b7c03eb27201/data TestJob 10
My name (argument 1) is TestJob
My sleep duration (argument 2) is 10
Sleep of 10 seconds finished.  Exiting
RESULT: 0 SUCCESS

If you didn't watch the results.log file with tail -f above, you will want to examine the information logged now:

$ cat results.log 
000 (001.000.000) 04/09 17:15:04 Job submitted from host: <198.51.254.123:35604>
...
017 (001.000.000) 04/09 17:15:57 Job submitted to Globus
RM-Contact: my-gatekeeper.cs.wisc.edu:/jobmanager-fork
JM-Contact: https://uml1.cs.wisc.edu:35956/24670/1081548947/
Can-Restart-JM: 1
...
001 (001.000.000) 04/09 17:15:57 Job executing on host: my-gatekeeper.cs.wisc.edu
...
005 (001.000.000) 04/09 17:16:14 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...

In this example we only ran one job for clarity. It is certainly possible to submit (and remove) many jobs from the queue at any point.

Clean up the results:

$ rm results.*

<-- Previous Next-->