Now we are ready to submit our first job with Condor-G. Condor-G (G stands for Grid) is a subset of Condor designed for submitting jobs to external systems, such as Globus. Condor-G does many useful thing for the user such as:
These features become especially useful when a user has tens or hundreds of Grid jobs to run, and keeping track of them individually becomes a daunting task.
Creating Condor-G submit file
The basic procedure is to create a Condor job submit description file. This file can tell Condor what executable to run, what resources to use, how to handle failures, where to store the job's output, and many other characteristics of the job submission. Then this file is given to condor_submit.
First, create a scratch directory in your home location.
$ cd ~ $ mkdir scratch
Move to our scratch location:
$ cd ~/scratch
Create a Condor submit file. As you can see from the condor_submit manual page, there are many options that can be specified in a Condor-G submit description file. We will start out with just a few. We'll be sending the job to the computer my-gatekeeper.cs.wisc.edu and running under the "jobmanager-fork" job manager. Other common jobmanagers are: jobmanager-condor, jobmanager-pbs, jobmanager-lsf, etc. They indicate which batch system the gatekeeper will use to run your job on the remote site. You may have to know in advance which jobmanagers the remote site has enabled.
(Feel free to use your favorite editor, but we will demonstrate with 'cat' in the example below. When using cat to create files, press Ctrl-D to close the file -- don't actually type "Ctrl-D" into the file. After you create a file, we suggest you use cat to display the file and confirm that it contains the expected text.)
Create the submit file, then verify that it was entered correctly:
$ cat > myjob.submit
executable=myscript.sh
arguments=TestJob 10
output=results.output
error=results.error
log=results.log
notification=never
universe=globus
globusscheduler=my-gatekeeper.cs.wisc.edu:/jobmanager-fork
queue
<Ctrl-D>
$ cat myjob.submit executable=myscript.sh arguments=TestJob 10 output=results.output error=results.error log=results.log notification=never universe=globus globusscheduler=my-gatekeeper.cs.wisc.edu:/jobmanager-fork queue
Notice that we're setting notification to never to avoid getting email messages about the completion of our job, and redirecting the stdout/err of the job back to the submission computer. Notice the "globusscheduler" parameter - it indicates the hostname of the gatekeeper and the jobmanager type used for job execution.
Creating a program to run
Create a little program to run on the grid.
$ cat > myscript.sh
#! /bin/sh
echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished. Exiting"
echo "RESULT: 0 SUCCESS"
Ctrl-D $ cat myscript.sh #! /bin/sh echo "I'm process id $$ on" `hostname` echo "This is sent to standard error" 1>&2 date echo "Running as binary $0" "$@" echo "My name (argument 1) is $1" echo "My sleep duration (argument 2) is $2" sleep $2 echo "Sleep of $2 seconds finished. Exiting" echo "RESULT: 0 SUCCESS"
Make the program executable and test it.
$ chmod a+x myscript.sh $ ./myscript.sh TEST 1 I'm process id 3428 on uml1.cs.wisc.edu This is sent to standard error Thu Jul 10 12:21:11 CDT 2003 Running as binary ./myscript.sh TEST 1 My name (argument 1) is TEST My sleep duration (argument 2) is 1 Sleep of 1 seconds finished. Exiting RESULT: 0 SUCCESS
Submitting to Condor-G
Submit your test job to Condor-G.
$ condor_submit myjob.submit Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 1.
Occasionally run condor_q to watch the progress of your job. You may also want to occasionally run "condor_q -globus" which presents Globus specific status information. (Additional documentation on condor_q).
Warning: if your job is in the "H" (hold state) for some reason, you probably have done something wrong in one of your files; please see the instructor.
$ condor_q -- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 user?? 7/10 17:28 0+00:00:00 I 0 0.0 myscript.sh TestJo 1 jobs; 1 idle, 0 running, 0 held $ condor_q -globus -- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 1.0 user?? UNSUBMITTED fork my-gatekeeper.cs.wisc.edu /tmp/username-cond
In a few seconds... (note the job status of "R" indicating that the job is running)
$ condor_q -- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 user?? 7/10 17:28 0+00:00:27 R 0 0.0 myscript.sh TestJo 1 jobs; 0 idle, 1 running, 0 held $ condor_q -globus -- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 1.0 user?? ACTIVE fork my-gatekeeper.cs.wisc.edu /tmp/username-cond
In a few seconds the job status changes to "C" - complete
$ condor_q -- Submitter: uml1.cs.wisc.edu : <128.105.185.14:33785> : uml1.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 user?? 7/10 17:28 0+00:00:40 C 0 0.0 myscript.sh 0 jobs; 0 idle, 0 running, 0 held $ condor_q -globus -- Submitter: uml1.cs.wisc.edu : <128.105.185.14:33785> : uml1.cs.wisc.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 1.0 user?? DONE fork my-gatekeeper.cs.wisc.edu /afs/cs.wisc.edu/u
Ultimately the job completes and is removed from the Condor-G queue
$ condor_q -- Submitter: uml1.cs.wisc.edu : <128.105.185.14:33785> : uml1.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
In another window you can run "tail -f" to watch the log file for your job to monitor its progress. For the remainder of this tutorial, we suggest you re-run this command when you submit one or more jobs. This will allow you to see monitor how typical Condor-G jobs progress. Use "Ctrl-C" to stop watching the file.
In a second window:
$ cd ~/scratch $ tail -f --lines=500 results.log 000 (001.000.000) 07/10 17:28:48 Job submitted from host: <128.105.185.14:35688> ... 017 (001.000.000) 07/10 17:29:01 Job submitted to Globus RM-Contact: my-gatekeeper.cs.wisc.edu:/jobmanager-fork JM-Contact: https://my-gatekeeper.cs.wisc.edu:2321/696/1057876132/ Can-Restart-JM: 1 ... 001 (001.000.000) 07/10 17:29:01 Job executing on host: my-gatekeeper.cs.wisc.edu ... 005 (001.000.000) 07/10 17:30:08 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Jemob 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
When the job is no longer listed by condor_q or when the log file reports "Job terminated," you can see the results in condor_history.
$ condor_history ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 1.0 user?? 7/10 10:28 0+00:00:00 C ??? ...
When the job completes, verify that the output is as expected. (The binary name is different from what you created because of how Globus and Condor-G cooperate to stage your file to execute machine).
$ ls myjob.submit myscript.sh* results.error results.log results.output $ cat results.error This is sent to standard error $ cat results.output $I'm process id 733 on pc-26 Thu Jul 10 17:28:57 CDT 2003 Running as binary /home/user??/.globus/.gass_cache/local/md5/28/fcae5001dbcd99cc476984b4151284/md5/af/355c4959dc83a74b18b7c03eb27201/data TestJob 10 My name (argument 1) is TestJob My sleep duration (argument 2) is 10 Sleep of 10 seconds finished. Exiting RESULT: 0 SUCCESS
If you didn't watch the results.log file with tail -f above, you will want to examine the information logged now:
$ cat results.log 000 (001.000.000) 04/09 17:15:04 Job submitted from host: <198.51.254.123:35604> ... 017 (001.000.000) 04/09 17:15:57 Job submitted to Globus RM-Contact: my-gatekeeper.cs.wisc.edu:/jobmanager-fork JM-Contact: https://uml1.cs.wisc.edu:35956/24670/1081548947/ Can-Restart-JM: 1 ... 001 (001.000.000) 04/09 17:15:57 Job executing on host: my-gatekeeper.cs.wisc.edu ... 005 (001.000.000) 04/09 17:16:14 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
In this example we only ran one job for clarity. It is certainly possible to submit (and remove) many jobs from the queue at any point.
Clean up the results:
$ rm results.*<-- Previous Next-->