Condor-G and DAGMan Hands-On Lab

Earlier today we learned about some grid technologies including Condor-G and DAGMan. In this session, we will install and submit some simple jobs using these components.

These directions were specifically written for a tutorial given on Monday, July 14, 2003, on the pc-##.gs.unina.it computers. This tutorial will work elsewhere with two important notes. First, use the "Full VDT Install" directions, not the "Using the existing VDT Install" directions. Second, jobs are being sent to pc-##.gs.unina.it computers. You will likely need to change the computer names to match a working Globus Gatekeeper that you have access to.

Part I: Installing and Configuring the Grid Middleware

Using the existing VDT Install
Full VDT Install
Finishing Up

Part II: Submitting a Simple Grid Job with Condor-G
Part III: Held Jobs
Part IV: A Simple DAG with Condor-G and DAGMan
Part V: A More Complex DAG

Optional: Multiple Globus Resources

Part VI: Handling Jobs That Fail with DAGMan
Part VII: All Done... shutting down

Part I: Installing and Configuring the Grid Middleware

The VDT (Virtual Data Toolkit) distributes a variety of grid middleware, making it easy to install Condor-G and DAGMan and get going. (Official Virtual Data Toolkit web site)

First, create a scratch directory to do your work in, another directory for the VDT installation, and a third for your jobs. In the example below, we use /tmp. This is a good choice for the purposes of this tutorial -- but be aware that an installation of VDT uses about 60 megabytes, and during installation it can use up to 120 megabytes. So be certain /tmp has enough free space. Why does a VDT install use so much space? The answer is the VDT installs much more than just Condor-G and DAGMan. The VDT installs lots of software packages for complete grid solutions for job and data management in a grid setting.

Be sure to create a unique name for your directory in /tmp, the directory may be shared with other users. Where the example specifies "username", use your username, your real name, or something else unique.

Lets get going! First make the subdirectories:

$ cd /tmp
$ mkdir username-condor-g-dagman-tutorial
$ cd username-condor-g-dagman-tutorial
$ mkdir submit

You will now need to select a method to access a working VDT installation. Unless specifically directed by the speaker, if you are attending the Monday, July 14, 2003 tutorial on the unina.it computers, you will want to select Using the existing VDT Install.

Using the existing VDT Install - This technique is relatively quick, but is only suitable for the hands-on tutorial on Monday, July 14, 2003 being given on the pc-##.gs.unina.it computers. This installation requires those specifically configured computers.
Full VDT Install - This technique can take twenty or so minutes while the many packages are downloaded and installed. It does have the advantage of working on other systems if you are using this tutorial at a later date. This also gives you a chance to experience an actual VDT installation.

Using the existing VDT Install

This installation technique requires the computers configured for the hands-on tutorial on Monday, July 14, 2003 being given on the pc-##.gs.unina.it computers. If you are not participating in that tutorial, you will want to use the Full VDT Install directions.

Begin by linking your scratch space to the existing VDT installation:

$ mkdir condor_local
$ CONDOR_LOCAL=`pwd`/condor_local
$ ln -s /opt/vdt ./
$ pushd vdt

Now we'll want to "source setup.sh" the scripts created by the VDT. This will configure your environment so Globus, Condor-G, and related tools are ready for use. (If the "echo $GLOBUS_LOCATION" returns nothing, rerun "source setup.sh" and try again.)

(This tutorial assumes you're using a Bourne Shell derived shell, typically /bin/sh or /bin/bash. This is the system default on the pc-##.gs.unina.it computers. If you've changed to a csh derived shell, you'll need to slightly modify the examples. Whenever you are asked to source setup.sh, there will also be a setup.csh for csh users. For simplicity you may want to use /bin/sh or /bin/bash for this tutorial.)

$ pwd
/tmp/username-condor-g-dagman-tutorial/vdt
$ source setup.sh
$ echo $GLOBUS_LOCATION
/tmp/username-condor-g-dagman-tutorial/vdt/globus
$ popd

We need to make some minor adjustments to Condor's configuration. Normally you could use condor_configure, but in this tutorial we're using an existing VDT installation to which we don't have all necessary permissions. We'll use condor_config_val to determine where the local Condor configuration file is. condor_config_val allows you to query Condor's configuration settings.

$ LOCAL_CONFIG=`condor_config_val LOCAL_CONFIG_FILE`
$ cp $LOCAL_CONFIG $LOCAL_CONFIG.orig
$ echo "LOCAL_DIR=$CONDOR_LOCAL" >> $LOCAL_CONFIG
$ cat >> $LOCAL_CONFIG
DAEMON_LIST = MASTER,SCHEDD
CONDOR_HOST=
Ctrl-D
$ condor_config_val LOCAL_DIR DAEMON_LIST CONDOR_HOST
/tmp/eraseme-2003-07-15/condor_local
MASTER,SCHEDD
Not defined: CONDOR_HOST
$ condor_init
Creating /tmp/username-condor-g-dagman-tutorial/condor_local/log
Creating /tmp/username-condor-g-dagman-tutorial/condor_local/spool
Creating /tmp/username-condor-g-dagman-tutorial/condor_local/execute
/opt/vdt/condor/condor_home.pc-23/condor_config.local already exists.
Condor has been initialized, but not started.

Now continue on to Finishing Up skip over the Full VDT Install section.

Full VDT Install

The VDT uses the package manager Pacman to do installation (Official Pacman web site). Pacman is written in Python and requires a working Python installation. The VDT site currently recommends version 2.098, but the VDT works fine with 2.108. Pacman 2.108 improves compatibility with recent versions of Python, so we'll use that.

Download Pacman.

$ mkdir vdt
$ cd vdt
$ wget http://physics.bu.edu/~youssef/pacman/sample_cache/tarballs/pacman-2.108.tar.gz
--16:43:43--  http://physics.bu.edu/%7Eyoussef/pacman/sample_cache/tarballs/pacman-2.108.tar.gz
           => `pacman-2.108.tar.gz'
Resolving physics.bu.edu... done.
Connecting to physics.bu.edu[128.197.41.42]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75,946 [application/x-tar]
100%[=============================================================================>] 75,946       291.99K/s    ETA 00:00

16:43:43 (291.99 KB/s) - `pacman-2.108.tar.gz' saved [75946/75946]

For simplicity, we recommend using wget for the tutorial. Instead of using wget, you optionally download Pacman with this link, or use one of the following methods:

curl http://physics.bu.edu/~youssef/pacman/sample_cache/tarballs/pacman-2.108.tar.gz > pacman-2.108.tar.gz
GET http://physics.bu.edu/~youssef/pacman/sample_cache/tarballs/pacman-2.108.tar.gz > pacman-2.108.tar.gz
lynx -dump -source http://physics.bu.edu/~youssef/pacman/sample_cache/tarballs/pacman-2.108.tar.gz > pacman-2.108.tar.gz

Install Pacman. Note that this can take about 20 minutes, depending on how fast your internet connection is. Then we'll "source setup.sh" to bring Pacman's path and other settings into our bash shell environment.

$ tar xzf pacman-2.108.tar.gz
$ cd pacman-2.108
$ source setup.sh
$ cd ..
$ pacman -get VDT:VDT-Client
Do you want to start a new Pacman installation here:
 [/afs/cs.wisc.edu/u/a/d/adesmet/miron-condor-g-dagman-talk/vdt]?
 (y or n): y
Do you want to trust the registered cache [VDT]? (y or n): y
Fetching [GPT] from [http://www.lsc-group.phys.uwm.edu/vdt/vdt_cache/]...
Fetching [VDT-Globus-Wrapper] from [http://www.lsc-group.phys.uwm.edu/vdt/vdt_cache/]...
Fetching [Globus-2-Client] from [http://www.lsc-group.phys.uwm.edu/vdt/vdt_cache/]...
Fetching [Globus-RLS-Client] from [http://www.lsc-group.phys.uwm.edu/vdt/vdt_cache/]...
Fetching [Configure-Globus] from [http://www.lsc-group.phys.uwm.edu/vdt/vdt_cache/]...
Fetching [Fault-Tolerant-Shell] from [http://www.lsc-group.phys.uwm.edu/vdt/vdt_cache/]...
Fetching [DOE-EDG-Certificates] from [http://www.lsc-group.phys.uwm.edu/vdt/vdt_cache/]...
Fetching [Condor-Base] from [http://www.lsc-group.phys.uwm.edu/vdt/vdt_cache/]...
Fetching [Condor] from [http://www.lsc-group.phys.uwm.edu/vdt/vdt_cache/]...
Fetching [KX509-Client] from [http://www.lsc-group.phys.uwm.edu/vdt/vdt_cache/]...
Fetching [MyProxy] from [http://www.lsc-group.phys.uwm.edu/vdt/vdt_cache/]...
Fetching [VDT-Version] from [http://www.lsc-group.phys.uwm.edu/vdt/vdt_cache/]...
Package [VDT-Environment] has been installed.
Package [Globus-Environment] has been installed.
Package [GPT-Environment] has been installed.
Installing GPT 2.2.5...
Package [GPT] has been installed.
Package [VDT-Globus-Wrapper] has been installed.
Installing Globus bundle:  globus-all-client-dbg-NMI-VDT-1.1.8-i686-pc-linux-gnu-bin.tar.gz ...
Package [Globus-2-Client] has been installed.
Installing Globus bundle:  globus_rls_client-2.0.8.tar.gz ...
Package [Globus-RLS-Client] has been installed.
Package [Configure-Globus] has been installed.

We'll skip GSI setup because you are not root
See /tmp/username-condor-g-dagman-tutorial/vdt/post-install/README for instructions on how to do it later
Package [Configure-Globus-Client] has been installed.
Package [Fault-Tolerant-Shell-Environment] has been installed.
Installing FTSH...
Package [Fault-Tolerant-Shell] has been installed.
Skipping installation of CA signing policies because you are not root.
See post-install/README for more information.
Package [DOE-EDG-Certificates] has been installed.
Package [Condor-Environment] has been installed.
Package [Condor-Base] has been installed.

Condor has been installed into:
	/tmp/username-condor-g-dagman-tutorial/vdt/condor

Package [Condor] has been installed.
Installing Globus bundle:  kx509-client-NMI-VDT-1.1.8-i686-pc-linux-gnu-bin.tar.gz ...
Package [KX509-Client] has been installed.
Installing Globus bundle:  myproxy-NMI-VDT-1.1.8-i686-pc-linux-gnu-bin.tar.gz ...
Package [MyProxy] has been installed.
Package [VDT-Version] has been installed.

The VDT Client version 1.1.9 has been installed.

Package [VDT-Client] has been installed.
Done.

It's possible than an error will occur while installing the VDT. The network may briefly fail, the disk may fill, or other problems. If necessary, just delete the scratch directory we've created and start over.

Only run the next block of commands if you need to start over because of an error.

$ cd ..
$ pwd
/tmp
$ rm -rf username-condor-g-dagman-tutorial

This is a slightly unusual installation, because normally the VDT wants to install the certificates into /etc/grid-security. Because we're installing as a non-root user, this isn't an option. So we'll need to move the certificates into place by hand. We'll also set the X509_CERT_DIR environment variable to ensure that we're using the certificates we expect.

$ mv post-install/certificates  globus/share/
$ cp globus/setup/globus/42* globus/share/certificates/
$ X509_CERT_DIR=`pwd`/globus/share/certificates
$ export X509_CERT_DIR

$ pwd
/tmp/username-condor-g-dagman-tutorial/vdt
$ source setup.sh
$ echo $GLOBUS_LOCATION
/tmp/username-condor-g-dagman-tutorial/vdt/globus

We need to configure Condor-G to be a submission-only entrance to the grid. We'll use condor_configure to accomplish this.

$ pwd
/tmp/username-condor-g-dagman-tutorial/vdt
$ cd condor
$ ./condor_configure --type=submit --local-dir=`pwd`/localdir

By default the above configuration will attempt to connect your computer to a local Condor pool. This doesn't cause any problems, but can slightly slow down job submission. To disable this behavior, we need to disable the CONDOR_HOST setting in the Condor configuration files. The easiest way is to append a blank CONDOR_HOST setting to the end of the local configuration file. We'll use condor_config_val to easily locate Condor's local configuration file. You can edit the condor_home.pc-22/condor_config.local with your favorite editor, or run the following command:

$ echo "CONDOR_HOST=" >> `condor_config_val LOCAL_CONFIG_FILE`
$ condor_config_val CONDOR_HOST
Not defined: CONDOR_HOST

Continue on to the "Finishing Up" section below.

Finishing Up

Create a short lived proxy for this tutorial. (The default proxy length is 12 hours. For a long lived task, you might create proxies with lifespans of 24 hours, several days, or even several months.) (The "-verify" option is not required, but can is useful for debugging. -verify will warn you if an expected Certificate Authority Certificate is missing.)

$ grid-proxy-info -all


ERROR: Couldn't find a valid proxy.
Use -debug for further information.

$ grid-proxy-init -hours 4 -verify
Your identity: /C=US/O=Globus/O=University of Wisconsin/OU=Computer Sciences Department/CN=Alan De Smet
Enter GRID pass phrase for this identity: Your pass phrase
Creating proxy ........................................... Done
Proxy Verify OK
Your proxy is valid until Thu Jul 10 16:06:13 2003
$ grid-proxy-info -all
subject  : /C=US/O=Globus/O=University of Wisconsin/OU=Computer Sciences Department/CN=Alan De Smet/CN=proxy
issuer   : /C=US/O=Globus/O=University of Wisconsin/OU=Computer Sciences Department/CN=Alan De Smet
type     : full
strength : 512 bits
timeleft : 3:59:57

Do a quick test with globus-job-run to ensure that everything is working. Note that the below will submit to the fork jobmanager on a server at Univ of Wisconsin. Instead, you will probably be instructed to submit to a server local to your location -- ask your instructor:

$ globus-job-run pc-26.gs.unina.it /bin/date
Wed Jul  9 17:57:49 CDT 2003

Our basic Globus setup appears to be functioning. Now we just need to start up Condor-G by running condor_master. After running the condor_master, Condor-G should be running and ready to go. We'll then do a ps command to observe which processes are started. This is a very minimal Condor configuration, only the master (which ensures that the other daemons remain running), and the schedd (which manages your job queue) are necessary and running. If you were doing matchmaking to match jobs to available grid resources, you would also run the collector and negotiator.

$ condor_master
$ ps -efwwwww | grep condor_ | grep `id -un` | grep -v grep
adesmet  23046     1  0 18:10 ?        00:00:00 condor_master
adesmet  23050 23046  0 18:10 ?        00:00:00 condor_schedd -f

Part II: Submitting a Simple Grid Job with Condor-G

Now we are ready to submit our first job with Condor-G. The basic procedure is to create a Condor job submit description file. This file can tell Condor what executable to run, what resources to use, how to handle failures, where to store the job's output, and many other characteristics of the job submission. Then this file is give to condor_submit.

First, move to our scratch submission location:

$ cd ../..
$ pwd
/tmp/username-condor-g-dagman-tutorial
$ cd submit

Create a Condor submit file. As you can see from the condor_submit manual page, there are many options that can be specified in a Condor-G submit description file. We will start out with just a few. We'll be sending the job to the computer "pc-26.gs.unina.it" and running under the "jobmanager-fork" job manager. We're setting notification to never to avoid getting email messages about the completion of our job, and redirecting the stdout/err of the job back to the submission computer.

(Feel free to use your favorite editor, but we will demonstrate with 'cat' in the example below. When using cat to create files, press Ctrl-D to close the file -- don't actually type "Ctrl-D" into the file. Whenever you create a file using cat, we suggest you use cat to display the file and confirm that it contains the expected text.)

Create the submit file, then verify that it was entered correctly:

$ cat > myjob.submit
executable=myscript.sh
arguments=TestJob 10
output=results.output
error=results.error
log=results.log
notification=never
universe=globus
globusscheduler=pc-26.gs.unina.it:/jobmanager-fork
queue
Ctrl-D
$ cat myjob.submit
executable=myscript.sh
arguments=TestJob 10
output=results.output
error=results.error
log=results.log
notification=never
universe=globus
globusscheduler=pc-26.gs.unina.it:/jobmanager-fork
queue

Create a little program to run on the grid.

$ cat > myscript.sh
#! /bin/sh

echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished.  Exiting"
echo "RESULT: 0 SUCCESS"
Ctrl-D 
$ cat myscript.sh
#! /bin/sh

echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished.  Exiting"
echo "RESULT: 0 SUCCESS"

Make the program executable and test it.

$ chmod a+x myscript.sh
$ ./myscript.sh TEST 1
I'm process id 3428 on puffin.cs.wisc.edu
This is sent to standard error
Thu Jul 10 12:21:11 CDT 2003
Running as binary ./myscript.sh TEST 1
My name (argument 1) is TEST
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting
RESULT: 0 SUCCESS

Submit your test job to Condor-G.

$ condor_submit myjob.submit
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1.

Occasionally run condor_q to watch the progress of your job. You may also want to occasionally run "condor_q -globus" which presents Globus specific status information. (Additional documentation on condor_q)

$ condor_q


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   adesmet         7/10 17:28   0+00:00:00 I  0   0.0  myscript.sh TestJo

1 jobs; 1 idle, 0 running, 0 held
$ condor_q -globus


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   1.0   adesmet       UNSUBMITTED fork     pc-26.gs.unina.it   /tmp/username-cond
$ condor_q


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   adesmet         7/10 17:28   0+00:00:27 R  0   0.0  myscript.sh TestJo

1 jobs; 0 idle, 1 running, 0 held
$ condor_q -globus


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   1.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond
$ condor_q

-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   adesmet         7/10 17:28   0+00:00:40 C  0   0.0  myscript.sh       

0 jobs; 0 idle, 0 running, 0 held
$ condor_q -globus

-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   1.0   adesmet       DONE fork     pc-26.gs.unina.it   /afs/cs.wisc.edu/u

$ condor_q


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

In another window you can run "tail -f" to watch the log file for your job to monitor its progress. For the remainder of this tutorial, we suggest you re-run this command when you submit one or more jobs. This will allow you to see monitor how typical Condor-G jobs progress. Use "Ctrl-C" to stop watching the file.

In a second window:

$ cd /tmp/username-condor-g-dagman-tutorial/submit
$ tail -f --lines=500 results.log
000 (001.000.000) 07/10 17:28:48 Job submitted from host: <128.105.185.14:35688>
...
017 (001.000.000) 07/10 17:29:01 Job submitted to Globus
    RM-Contact: pc-26.gs.unina.it:/jobmanager-fork
    JM-Contact: https://pc-26.gs.unina.it:2321/696/1057876132/
    Can-Restart-JM: 1
...
001 (001.000.000) 07/10 17:29:01 Job executing on host: pc-26.gs.unina.it
...
005 (001.000.000) 07/10 17:30:08 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...

When the job is no longer listed in condor_q or when the log file reports "Job terminated," you can see the results in condor_history.

$ condor_history
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD
   1.0   adesmet         7/10 10:28   0+00:00:00 C   ???        /afs/cs.wisc.ed

When the job completes, verify that the output is as expected. (The binary name is different from what you created because of how Globus and Condor-G cooperate to stage your file to execute computer.)

$ ls
myjob.submit  myscript.sh*  results.error  results.log	results.output
$ cat results.error
This is sent to standard error
$ cat results.output 
$I'm process id 733 on pc-26
Thu Jul 10 17:28:57 CDT 2003
Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/28/fcae5001dbcd99cc476984b4151284/md5/af/355c4959dc83a74b18b7c03eb27201/data TestJob 10
My name (argument 1) is TestJob
My sleep duration (argument 2) is 10
Sleep of 10 seconds finished.  Exiting
RESULT: 0 SUCCESS

If you didn't watch the results.log file with tail -f above, you will want to examine the information logged now:

$ cat results.log

Clean up the results:

$ rm results.*

Part III: Held Jobs

When an problem occurs in the middleware, Condor-G will place your job on "Hold". Held jobs remain in the queue, but wait for user intervention. When you resolve the problem, you can use condor_release to free job to continue.

You can places jobs on hold yourself, perhaps if you want to delay your run using condor_hold

For this example, we'll make the output file non-writable. The job will be unable to copy the results back and will be placed on hold.

Submit the job again, but this time immediately after submitting it, mark the output file as read-only:

$ condor_submit myjob.submit ; chmod a-w results.output
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 2.

Watch the job with tail. When the job goes on hold, use Ctrl-C to exit tail. Note that condor_q reports that the job is in the "H" or Held state.

$ tail -f --lines=500 results.log
000 (003.000.000) 07/12 22:35:44 Job submitted from host: <192.167.1.23:32864>
...
017 (003.000.000) 07/12 22:35:57 Job submitted to Globus
    RM-Contact: pc-26.gs.unina.it:/jobmanager-fork
    JM-Contact: https://pc-26.gs.unina.it:33178/12497/1058042148/
    Can-Restart-JM: 1
...
001 (003.000.000) 07/12 22:35:57 Job executing on host: pc-26.gs.unina.it
...
012 (003.000.000) 07/12 22:36:52 Job was held.
        Globus error 129: the standard output/error size is different
...
Ctrl-C
$ condor_q

 
-- Submitter: pc-23.gs.unina.it : <192.167.1.23:32864> : pc-23.gs.unina.it
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
   2.0   adesmet         7/12 22:35   0+00:00:55 H  0   0.0  myscript.sh TestJo
 
1 jobs; 0 idle, 0 running, 1 held

Fix the problem (make the file writable again), then release the job. You can specifiy the job's ID, or just use "-all" to release all held jobs.

$ chmod u+w results.output
$ condor_release -all
All jobs released.

Again, watch the log until the job finishes:

$ tail -f --lines=500 results.log
000 (003.000.000) 07/12 22:35:44 Job submitted from host: <192.167.1.23:32864>
...
017 (003.000.000) 07/12 22:35:57 Job submitted to Globus
    RM-Contact: pc-26.gs.unina.it:/jobmanager-fork
    JM-Contact: https://pc-26.gs.unina.it:33178/12497/1058042148/
    Can-Restart-JM: 1
...
001 (003.000.000) 07/12 22:35:57 Job executing on host: pc-26.gs.unina.it
...
012 (003.000.000) 07/12 22:36:52 Job was held.
        Globus error 129: the standard output/error size is different
...
013 (003.000.000) 07/12 22:44:33 Job was released.
        via condor_release (by user Todd)
...
001 (003.000.000) 07/12 22:44:46 Job executing on host: pc-26.gs.unina.it
...
005 (003.000.000) 07/12 22:44:51 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...
Ctrl-C

Your job finished, the results have been retreived successfully:

$ cat results.output
I'm process id 12528 on pc-26.gs.unina.it
Sat Jul 12 22:35:53 CEST 2003
Running as binary /home/home45/Aland/.globus/.gass_cache/local/md5/6d/217f3f7926c06a529143f6129bf269/md5/a7/2af94ba728c69c588e523a99baaefd/data TestJob 10
My name (argument 1) is TestJob
My sleep duration (argument 2) is 10
Sleep of 10 seconds finished.  Exiting
RESULT: 0 SUCCESS

Before continuing, clean up the results:

$ rm results.*

Part IV: A Simple DAG

Since it will be handy for the rest of the tutorial, create a little shell script to monitor the Condor-G queue:

$ cat > watch_condor_q
#! /bin/sh
while true; do
     condor_q
     condor_q -globus
     sleep 10
done
Ctrl-D
$ cat watch_condor_q
#! /bin/sh
while true; do
     condor_q
     condor_q -globus
     sleep 10
done
$ chmod a+x watch_condor_q

Create a minimal DAG for DAGMan. This DAG will have a single node.

$ cat > mydag.dag
Job HelloWorld myjob.submit
Ctrl-D
$ cat mydag.dag
Job HelloWorld myjob.submit

Submit it with condor_submit_dag, then watch the run. Notice that condor_dagman is running as a job and that condor_dagman submits your real job without your direct intervention. You might happen to catch the "C" (completed) state as your job finishes, but that often goes by too quickly to notice.

Again, in another window you may want to run "tail -f --lines=500 results.log" in a second window to watch the job log file as your job runs. You might also want to watch DAGMan's log file with "tail -f --lines=500 mydag.dag.dagman.out" in a third window. (mydag.dag.dagman.out) in the same way in a third window. For the remainder of this tutorial, we suggest you re-run this command when you submit a DAG. This will allow you to see how typical DAGs progress. Use "Ctrl-C" to stop watching the file.

Third window:

$ cd /tmp/username-condor-g-dagman-tutorial/submit
$ tail -f --lines=500 mydag.dag.dagman.out
7/10 10:36:43 ******************************************************
7/10 10:36:43 ** condor_scheduniv_exec.6.0 (CONDOR_DAGMAN) STARTING UP
7/10 10:36:43 ** $CondorVersion: 6.5.1 Apr 22 2003 $
7/10 10:36:43 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
7/10 10:36:43 ** PID = 26844
7/10 10:36:43 ******************************************************
7/10 10:36:44 DaemonCore: Command Socket at <128.105.185.14:34571>
7/10 10:36:44 argv[0] == "condor_scheduniv_exec.6.0"
7/10 10:36:44 argv[1] == "-Debug"
7/10 10:36:44 argv[2] == "3"
7/10 10:36:44 argv[3] == "-Lockfile"
7/10 10:36:44 argv[4] == "mydag.dag.lock"
7/10 10:36:44 argv[5] == "-Condorlog"
7/10 10:36:44 argv[6] == "results.log"
7/10 10:36:44 argv[7] == "-Dag"
7/10 10:36:44 argv[8] == "mydag.dag"
7/10 10:36:44 argv[9] == "-Rescue"
7/10 10:36:44 argv[10] == "mydag.dag.rescue"
7/10 10:36:44 Condor log will be written to results.log
7/10 10:36:44 DAG Lockfile will be written to mydag.dag.lock
7/10 10:36:44 DAG Input file is mydag.dag
7/10 10:36:44 Rescue DAG will be written to mydag.dag.rescue
7/10 10:36:44 Parsing mydag.dag ...
7/10 10:36:44 Dag contains 1 total jobs
7/10 10:36:44 Bootstrapping...
7/10 10:36:44 Number of pre-completed jobs: 0
7/10 10:36:44 Submitting Job HelloWorld ...
7/10 10:36:44 	assigned Condor ID (7.0.0)
7/10 10:36:45 Event: ULOG_SUBMIT for Job HelloWorld (7.0.0)
7/10 10:36:45 0/1 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 10:37:05 Event: ULOG_GLOBUS_SUBMIT for Job HelloWorld (7.0.0)
7/10 10:37:05 Event: ULOG_EXECUTE for Job HelloWorld (7.0.0)
7/10 10:38:10 Event: ULOG_JOB_TERMINATED for Job HelloWorld (7.0.0)
7/10 10:38:10 Job HelloWorld completed successfully.
7/10 10:38:10 1/1 done, 0 failed, 0 submitted, 0 ready, 0 pre, 0 post
7/10 10:38:10 All jobs Completed!
7/10 10:38:10 **** condor_scheduniv_exec.6.0 (condor_DAGMAN) EXITING WITH STATUS 0

First window:

$ condor_submit_dag mydag.dag

Checking your DAG input file and all submit files it references.
This might take a while... 
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor   : mydag.dag.condor.sub
Log of DAGMan debugging messages         : mydag.dag.dagman.out
Log of Condor library debug messages     : mydag.dag.lib.out
Log of the life of condor_dagman itself  : mydag.dag.dagman.log

Condor Log file for all jobs of this DAG : results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 2.
-----------------------------------------------------------------------
$ ./watch_condor_q 


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   adesmet         7/10 17:33   0+00:00:03 R  0   2.6  condor_dagman -f -
   3.0   adesmet         7/10 17:33   0+00:00:00 I  0   0.0  myscript.sh TestJo

2 jobs; 1 idle, 1 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   3.0   adesmet       UNSUBMITTED fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   adesmet         7/10 17:33   0+00:00:33 R  0   2.6  condor_dagman -f -
   3.0   adesmet         7/10 17:33   0+00:00:15 R  0   0.0  myscript.sh TestJo

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   3.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   adesmet         7/10 17:33   0+00:01:03 R  0   2.6  condor_dagman -f -
   3.0   adesmet         7/10 17:33   0+00:00:45 R  0   0.0  myscript.sh TestJo

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   3.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        

Ctrl-C

Verify your results:

$ ls -l
total 12
-rw-r--r--    1 adesmet  adesmet        28 Jul 10 10:35 mydag.dag
-rw-r--r--    1 adesmet  adesmet       523 Jul 10 10:36 mydag.dag.condor.sub
-rw-r--r--    1 adesmet  adesmet       608 Jul 10 10:38 mydag.dag.dagman.log
-rw-r--r--    1 adesmet  adesmet      1860 Jul 10 10:38 mydag.dag.dagman.out
-rw-r--r--    1 adesmet  adesmet        29 Jul 10 10:38 mydag.dag.lib.out
-rw-------    1 adesmet  adesmet         0 Jul 10 10:36 mydag.dag.lock
-rw-r--r--    1 adesmet  adesmet       175 Jul  9 18:13 myjob.submit
-rwxr-xr-x    1 adesmet  adesmet       194 Jul 10 10:36 myscript.sh
-rw-r--r--    1 adesmet  adesmet        31 Jul 10 10:37 results.error
-rw-------    1 adesmet  adesmet       833 Jul 10 10:38 results.log
-rw-r--r--    1 adesmet  adesmet       261 Jul 10 10:37 results.output
-rwxr-xr-x    1 adesmet  adesmet        81 Jul 10 10:35 watch_condor_q
$ cat results.error 
This is sent to standard error
$ cat results.output 
I'm process id 29149 on pc-26
This is sent to standard error
Thu Jul 10 10:38:44 CDT 2003
Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/aa/ceb9e04077256aaa2acf4dff670897/md5/27/2f50da149fc049d07b1c27f30b67df/data TEST 1
My name (argument 1) is TEST
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting
RESULT: 0 SUCCESS

Looking at DAGMan's various files, we see that DAGMan itself ran as a Condor job (specifically, a scheduler universe job):

$ ls
mydag.dag	      mydag.dag.dagman.log  mydag.dag.lib.out  myjob.submit  results.error  results.output
mydag.dag.condor.sub  mydag.dag.dagman.out  mydag.dag.lock     myscript.sh   results.log    watch_condor_q
$ cat mydag.dag.condor.sub
# Filename: mydag.dag.condor.sub
# Generated by condor_submit_dag mydag.dag
universe	= scheduler
executable	= /afs/cs.wisc.edu/u/a/d/adesmet/miron-condor-g-dagman-talk/vdt/condor/bin/condor_dagman
getenv		= True
output		= mydag.dag.lib.out
error		= mydag.dag.lib.out
log		= mydag.dag.dagman.log
remove_kill_sig	= SIGUSR1
arguments	= -f -l . -Debug 3 -Lockfile mydag.dag.lock -Condorlog results.log -Dag mydag.dag -Rescue mydag.dag.rescue
environment	= _CONDOR_DAGMAN_LOG=mydag.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
queue
$ cat mydag.dag.dagman.log
000 (006.000.000) 07/10 10:36:43 Job submitted from host: <128.105.185.14:33785>
...
001 (006.000.000) 07/10 10:36:44 Job executing on host: <128.105.185.14:33785>
...
005 (006.000.000) 07/10 10:38:10 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...

If you weren't watching the DAGMan output file with tail -f, you can examine the file with the following command:

$ cat mydag.dag.dagman.out

Clean up your results. Be careful when deleting the mydag.dag.* files, you do not want to delete the mydag.dag file, just mydag.dag.* .

$ rm mydag.dag.* results.*

Part V: A More Complex DAG

Typically each node in a DAG will have its own Condor submit file. Create some more submit files by copying our existing file. For simplicity during this tutorial, we'll keep the submit files very similar, notably using the same executable, but your submit files and executables can differ in real-world use.

$ cp myjob.submit job.setup.submit
$ cp myjob.submit job.work1.submit
$ cp myjob.submit job.work2.submit
$ cp myjob.submit job.workfinal.submit
$ cp myjob.submit job.finalize.submit

Edit the various submit files. Change the output and error entries to point to results.NODE.output and results.NODE.error files where NODE is actually the middle word in the submit file (job.NODE.submit). So job.finalize.error would include:

output=results.finalize.output
error=results.finalize.error

Here is one possible set of settings for the output entries:

$ grep '^output=' job.*.submit
job.finalize.submit:output=results.finalize.output
job.setup.submit:output=results.setup.output
job.work1.submit:output=results.work1.output
job.work2.submit:output=results.work2.output
job.workfinal.submit:output=results.workfinal.output

This is important so that the various nodes don't overwrite each others output.

Leave the log entries alone. DAGMan requires that all nodes output their logs in the same location. Condor will ensure that the different jobs will not overwrite each other's entries in the log. (Never versions of DAGMan lift this requirement, and allow each job to use its own log file -- but you may want to use one common log file anyway because it's convenient to have all of your job status information in a single place.)

log=results.log

Also change the arguments entries so that the first argument is something unique to each node (perhaps the NODE name).

For node work2, change the second argument to 120 so that it looks something like:

arguments=MyWorkerNode2 120

Add the new nodes to your DAG:

$ cat mydag.dag 
Job HelloWorld myjob.submit
$ cat >> mydag.dag
Job Setup job.setup.submit
Job WorkerNode_1 job.work1.submit
Job WorkerNode_Two job.work2.submit
Job CollectResults job.workfinal.submit
Job LastNode job.finalize.submit
PARENT Setup CHILD WorkerNode_1 WorkerNode_Two
PARENT WorkerNode_1 WorkerNode_Two CHILD CollectResults
PARENT CollectResults CHILD LastNode
Ctrl-D
$ cat mydag.dag 
Job HelloWorld myjob.submit
Job Setup job.setup.submit
Job WorkerNode_1 job.work1.submit
Job WorkerNode_Two job.work2.submit
Job CollectResults job.workfinal.submit
Job LastNode job.finalize.submit
PARENT Setup CHILD WorkerNode_1 WorkerNode_Two
PARENT WorkerNode_1 WorkerNode_Two CHILD CollectResults
PARENT CollectResults CHILD LastNode
Ctrl-D

condor_q -dag will organize jobs into their associated DAGs. Change watch_condor_q to use this:

$ rm watch_condor_q
$ cat > watch_condor_q
#! /bin/sh
while true; do
	 echo ....
	 echo .... Output from condor_q
	 echo ....
     condor_q
	 echo ....
	 echo .... Output from condor_q -globus
	 echo ....
     condor_q -globus
	 echo ....
	 echo .... Output from condor_q -dag
	 echo ....
     condor_q -dag
     sleep 10
done
Ctrl-D
$ cat watch_condor_q
#! /bin/sh
while true; do
	 echo ....
	 echo .... Output from condor_q
	 echo ....
     condor_q
	 echo ....
	 echo .... Output from condor_q -globus
	 echo ....
     condor_q -globus
	 echo ....
	 echo .... Output from condor_q -dag
	 echo ....
     condor_q -dag
     sleep 10
done
$ chmod a+x watch_condor_q

Submit your new DAG and monitor it.

Again, in separate windows you may want to run "tail -f --lines=500 results.log" and "tail -f --lines=500 mydag.dag.dagman.out" to monitor the job's progress.

$ condor_submit_dag mydag.dag

Checking your DAG input file and all submit files it references.
This might take a while... 
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor   : mydag.dag.condor.sub
Log of DAGMan debugging messages         : mydag.dag.dagman.out
Log of Condor library debug messages     : mydag.dag.lib.out
Log of the life of condor_dagman itself  : mydag.dag.dagman.log

Condor Log file for all jobs of this DAG : results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 8.
-----------------------------------------------------------------------
$ ./watch_condor_q


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:00:08 R  0   2.6  condor_dagman -f -
   5.0   adesmet         7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh TestJo
   6.0   adesmet         7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh Setup 

3 jobs; 2 idle, 1 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   5.0   adesmet       UNSUBMITTED fork     pc-26.gs.unina.it   /tmp/username-cond
   6.0   adesmet       UNSUBMITTED fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:00:08 R  0   2.6  condor_dagman -f -
   5.0    |-HelloWorld   7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh TestJo
   6.0    |-Setup        7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh Setup 

3 jobs; 2 idle, 1 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:00:12 R  0   2.6  condor_dagman -f -
   5.0   adesmet         7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh TestJo
   6.0   adesmet         7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh Setup 

3 jobs; 2 idle, 1 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   5.0   adesmet       UNSUBMITTED fork     pc-26.gs.unina.it   /tmp/username-cond
   6.0   adesmet       UNSUBMITTED fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:00:12 R  0   2.6  condor_dagman -f -
   5.0    |-HelloWorld   7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh TestJo
   6.0    |-Setup        7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh Setup 

3 jobs; 2 idle, 1 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:00:42 R  0   2.6  condor_dagman -f -
   5.0   adesmet         7/10 17:45   0+00:00:24 R  0   0.0  myscript.sh TestJo
   6.0   adesmet         7/10 17:45   0+00:00:24 R  0   0.0  myscript.sh Setup 

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   5.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond
   6.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:00:42 R  0   2.6  condor_dagman -f -
   5.0    |-HelloWorld   7/10 17:45   0+00:00:24 R  0   0.0  myscript.sh TestJo
   6.0    |-Setup        7/10 17:45   0+00:00:24 R  0   0.0  myscript.sh Setup 

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:01:12 R  0   2.6  condor_dagman -f -
   5.0   adesmet         7/10 17:45   0+00:00:54 R  0   0.0  myscript.sh TestJo
   6.0   adesmet         7/10 17:45   0+00:00:54 R  0   0.0  myscript.sh Setup 

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   5.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond
   6.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:01:12 R  0   2.6  condor_dagman -f -
   5.0    |-HelloWorld   7/10 17:45   0+00:00:54 R  0   0.0  myscript.sh TestJo
   6.0    |-Setup        7/10 17:45   0+00:00:54 R  0   0.0  myscript.sh Setup 

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:01:42 R  0   2.6  condor_dagman -f -
   7.0   adesmet         7/10 17:46   0+00:00:00 I  0   0.0  myscript.sh work1 
   8.0   adesmet         7/10 17:46   0+00:00:00 I  0   0.0  myscript.sh Worker

3 jobs; 2 idle, 1 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   7.0   adesmet       UNSUBMITTED fork     pc-26.gs.unina.it   /tmp/username-cond
   8.0   adesmet       UNSUBMITTED fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:01:42 R  0   2.6  condor_dagman -f -
   7.0    |-WorkerNode_  7/10 17:46   0+00:00:00 I  0   0.0  myscript.sh work1 
   8.0    |-WorkerNode_  7/10 17:46   0+00:00:00 I  0   0.0  myscript.sh Worker

3 jobs; 2 idle, 1 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:02:12 R  0   2.6  condor_dagman -f -
   7.0   adesmet         7/10 17:46   0+00:00:27 R  0   0.0  myscript.sh work1 
   8.0   adesmet         7/10 17:46   0+00:00:27 R  0   0.0  myscript.sh Worker

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   7.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond
   8.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:02:12 R  0   2.6  condor_dagman -f -
   7.0    |-WorkerNode_  7/10 17:46   0+00:00:27 R  0   0.0  myscript.sh work1 
   8.0    |-WorkerNode_  7/10 17:46   0+00:00:27 R  0   0.0  myscript.sh Worker

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:02:42 R  0   2.6  condor_dagman -f -
   7.0   adesmet         7/10 17:46   0+00:00:57 R  0   0.0  myscript.sh work1 
   8.0   adesmet         7/10 17:46   0+00:00:57 R  0   0.0  myscript.sh Worker

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   7.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond
   8.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:02:43 R  0   2.6  condor_dagman -f -
   7.0    |-WorkerNode_  7/10 17:46   0+00:00:58 R  0   0.0  myscript.sh work1 
   8.0    |-WorkerNode_  7/10 17:46   0+00:00:58 R  0   0.0  myscript.sh Worker

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:03:13 R  0   2.6  condor_dagman -f -
   8.0   adesmet         7/10 17:46   0+00:01:28 R  0   0.0  myscript.sh Worker

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   8.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:03:13 R  0   2.6  condor_dagman -f -
   8.0    |-WorkerNode_  7/10 17:46   0+00:01:28 R  0   0.0  myscript.sh Worker

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:03:43 R  0   2.6  condor_dagman -f -
   8.0   adesmet         7/10 17:46   0+00:01:58 R  0   0.0  myscript.sh Worker

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   8.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:03:43 R  0   2.6  condor_dagman -f -
   8.0    |-WorkerNode_  7/10 17:46   0+00:01:58 R  0   0.0  myscript.sh Worker

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:04:13 R  0   2.6  condor_dagman -f -
   9.0   adesmet         7/10 17:49   0+00:00:02 R  0   0.0  myscript.sh workfi

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   9.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:04:13 R  0   2.6  condor_dagman -f -
   9.0    |-CollectResu  7/10 17:49   0+00:00:02 R  0   0.0  myscript.sh workfi

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:04:43 R  0   2.6  condor_dagman -f -
   9.0   adesmet         7/10 17:49   0+00:00:32 R  0   0.0  myscript.sh workfi

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   9.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:04:43 R  0   2.6  condor_dagman -f -
   9.0    |-CollectResu  7/10 17:49   0+00:00:32 R  0   0.0  myscript.sh workfi

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:05:13 R  0   2.6  condor_dagman -f -
   9.0   adesmet         7/10 17:49   0+00:01:02 R  0   0.0  myscript.sh workfi

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   9.0   adesmet       DONE fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:05:13 R  0   2.6  condor_dagman -f -
   9.0    |-CollectResu  7/10 17:49   0+00:01:02 C  0   0.0  myscript.sh workfi

1 jobs; 0 idle, 1 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:05:43 R  0   2.6  condor_dagman -f -
  10.0   adesmet         7/10 17:50   0+00:00:13 R  0   0.0  myscript.sh Final 

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
  10.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:05:44 R  0   2.6  condor_dagman -f -
  10.0    |-LastNode     7/10 17:50   0+00:00:13 R  0   0.0  myscript.sh Final 

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:06:14 R  0   2.6  condor_dagman -f -
  10.0   adesmet         7/10 17:50   0+00:00:43 R  0   0.0  myscript.sh Final 

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
  10.0   adesmet       ACTIVE fork     pc-26.gs.unina.it   /tmp/username-cond


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:06:14 R  0   2.6  condor_dagman -f -
  10.0    |-LastNode     7/10 17:50   0+00:00:43 R  0   0.0  myscript.sh Final 

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:35688> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held
Ctrl-C

Watching the logs or the condor_q output, you'll note that the CollectResults node ("workfinal") wasn't run until both of the WorkerNode nodes ("work1" and "work2") finished.

Examine your results.

$ ls
job.finalize.submit   mydag.dag.condor.sub  myscript.sh		     results.setup.error   results.workfinal.error
job.setup.submit      mydag.dag.dagman.log  results.error	     results.setup.output  results.workfinal.output
job.work1.submit      mydag.dag.dagman.out  results.finalize.error   results.work1.error   watch_condor_q
job.work2.submit      mydag.dag.lib.out     results.finalize.output  results.work1.output
job.workfinal.submit  mydag.dag.lock	    results.log		     results.work2.error
mydag.dag	      myjob.submit	    results.output	     results.work2.output
$ tail --lines=500 results.*.error
==> results.finalize.error <==
This is sent to standard error

==> results.setup.error <==
This is sent to standard error

==> results.work1.error <==
This is sent to standard error

==> results.work2.error <==
This is sent to standard error

==> results.workfinal.error <==
This is sent to standard error
$ tail --lines=500 results.*.output
==> results.finalize.output <==
I'm process id 29614 on pc-26
Thu Jul 10 10:53:58 CDT 2003
Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/0d/7c60aa10b34817d3ffe467dd116816/md5/de/03c3eb8a20852948a2af53438bbce1/data Finalize 1
My name (argument 1) is Finalize
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting

==> results.setup.output <==
I'm process id 29337 on pc-26
Thu Jul 10 10:50:31 CDT 2003
Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/a5/fab7b658db65dbfec3ecf0a5414e1c/md5/f4/e9a04ae03bff43f00a10c78ebd60fd/data Setup 1
My name (argument 1) is Setup
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting

==> results.work1.output <==
I'm process id 29444 on pc-26
Thu Jul 10 10:51:04 CDT 2003
Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/2e/17db42df4e113f813cea7add42e03e/md5/f6/f1bd82a2fec9a3a372a44c009a63ca/data WorkerNode1 1
My name (argument 1) is WorkerNode1
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting

==> results.work2.output <==
I'm process id 29432 on pc-26
Thu Jul 10 10:51:03 CDT 2003
Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/ea/9a3c8d16346b2fea808cda4b5969fa/md5/f6/f1bd82a2fec9a3a372a44c009a63ca/data WorkerNode2 120
My name (argument 1) is WorkerNode2
My sleep duration (argument 2) is 120
Sleep of 120 seconds finished.  Exiting

==> results.workfinal.output <==
I'm process id 29554 on pc-26
Thu Jul 10 10:53:27 CDT 2003
Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/c9/7ba5d43acad3d9ebdfa633839e75c3/md5/11/cd84efa75305d54100f0f451b46b35/data WorkFinal 1
My name (argument 1) is WorkFinal
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting

Examine your log

$ cat results.log
000 (005.000.000) 07/10 17:45:24 Job submitted from host: <128.105.185.14:35688>
    DAG Node: HelloWorld
...
000 (006.000.000) 07/10 17:45:24 Job submitted from host: <128.105.185.14:35688>
    DAG Node: Setup
...
017 (006.000.000) 07/10 17:45:42 Job submitted to Globus
    RM-Contact: pc-26.gs.unina.it:/jobmanager-fork
    JM-Contact: https://pc-26.gs.unina.it:2349/914/1057877133/
    Can-Restart-JM: 1
...
001 (006.000.000) 07/10 17:45:42 Job executing on host: pc-26.gs.unina.it
...
017 (005.000.000) 07/10 17:45:42 Job submitted to Globus
    RM-Contact: pc-26.gs.unina.it:/jobmanager-fork
    JM-Contact: https://pc-26.gs.unina.it:2348/915/1057877133/
    Can-Restart-JM: 1
...
001 (005.000.000) 07/10 17:45:42 Job executing on host: pc-26.gs.unina.it
...
005 (005.000.000) 07/10 17:46:50 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...
005 (006.000.000) 07/10 17:46:50 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...
000 (007.000.000) 07/10 17:46:55 Job submitted from host: <128.105.185.14:35688>
    DAG Node: WorkerNode_1
...
000 (008.000.000) 07/10 17:46:56 Job submitted from host: <128.105.185.14:35688>
    DAG Node: WorkerNode_Two
...
017 (008.000.000) 07/10 17:47:09 Job submitted to Globus
    RM-Contact: pc-26.gs.unina.it:/jobmanager-fork
    JM-Contact: https://pc-26.gs.unina.it:2364/1037/1057877219/
    Can-Restart-JM: 1
...
001 (008.000.000) 07/10 17:47:09 Job executing on host: pc-26.gs.unina.it
...
017 (007.000.000) 07/10 17:47:09 Job submitted to Globus
    RM-Contact: pc-26.gs.unina.it:/jobmanager-fork
    JM-Contact: https://pc-26.gs.unina.it:2367/1040/1057877220/
    Can-Restart-JM: 1
...
001 (007.000.000) 07/10 17:47:09 Job executing on host: pc-26.gs.unina.it
...
005 (007.000.000) 07/10 17:48:17 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...
005 (008.000.000) 07/10 17:49:18 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...
000 (009.000.000) 07/10 17:49:22 Job submitted from host: <128.105.185.14:35688>
    DAG Node: CollectResults
...
017 (009.000.000) 07/10 17:49:35 Job submitted to Globus
    RM-Contact: pc-26.gs.unina.it:/jobmanager-fork
    JM-Contact: https://pc-26.gs.unina.it:2383/1185/1057877366/
    Can-Restart-JM: 1
...
001 (009.000.000) 07/10 17:49:35 Job executing on host: pc-26.gs.unina.it
...
005 (009.000.000) 07/10 17:50:42 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...
000 (010.000.000) 07/10 17:50:42 Job submitted from host: <128.105.185.14:35688>
    DAG Node: LastNode
...
017 (010.000.000) 07/10 17:50:55 Job submitted to Globus
    RM-Contact: pc-26.gs.unina.it:/jobmanager-fork
    JM-Contact: https://pc-26.gs.unina.it:2392/1247/1057877446/
    Can-Restart-JM: 1
...
001 (010.000.000) 07/10 17:50:55 Job executing on host: pc-26.gs.unina.it
...
005 (010.000.000) 07/10 17:52:02 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...

Examine the DAGMan log

$ cat mydag.dag.dagman.out
7/10 17:45:24 ******************************************************
7/10 17:45:24 ** condor_scheduniv_exec.4.0 (CONDOR_DAGMAN) STARTING UP
7/10 17:45:24 ** $CondorVersion: 6.5.1 Apr 22 2003 $
7/10 17:45:24 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
7/10 17:45:24 ** PID = 18826
7/10 17:45:24 ******************************************************
7/10 17:45:24 DaemonCore: Command Socket at <128.105.185.14:35774>
7/10 17:45:24 argv[0] == "condor_scheduniv_exec.4.0"
7/10 17:45:24 argv[1] == "-Debug"
7/10 17:45:24 argv[2] == "3"
7/10 17:45:24 argv[3] == "-Lockfile"
7/10 17:45:24 argv[4] == "mydag.dag.lock"
7/10 17:45:24 argv[5] == "-Condorlog"
7/10 17:45:24 argv[6] == "results.log"
7/10 17:45:24 argv[7] == "-Dag"
7/10 17:45:24 argv[8] == "mydag.dag"
7/10 17:45:24 argv[9] == "-Rescue"
7/10 17:45:24 argv[10] == "mydag.dag.rescue"
7/10 17:45:24 Condor log will be written to results.log
7/10 17:45:24 DAG Lockfile will be written to mydag.dag.lock
7/10 17:45:24 DAG Input file is mydag.dag
7/10 17:45:24 Rescue DAG will be written to mydag.dag.rescue
7/10 17:45:24 Parsing mydag.dag ...
7/10 17:45:24 Dag contains 6 total jobs
7/10 17:45:24 Bootstrapping...
7/10 17:45:24 Number of pre-completed jobs: 0
7/10 17:45:24 Submitting Job HelloWorld ...
7/10 17:45:24 	assigned Condor ID (5.0.0)
7/10 17:45:24 Submitting Job Setup ...
7/10 17:45:24 	assigned Condor ID (6.0.0)
7/10 17:45:25 Event: ULOG_SUBMIT for Job HelloWorld (5.0.0)
7/10 17:45:25 Event: ULOG_SUBMIT for Job Setup (6.0.0)
7/10 17:45:25 0/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post
7/10 17:45:45 Event: ULOG_GLOBUS_SUBMIT for Job Setup (6.0.0)
7/10 17:45:45 Event: ULOG_EXECUTE for Job Setup (6.0.0)
7/10 17:45:45 Event: ULOG_GLOBUS_SUBMIT for Job HelloWorld (5.0.0)
7/10 17:45:45 Event: ULOG_EXECUTE for Job HelloWorld (5.0.0)
7/10 17:46:55 Event: ULOG_JOB_TERMINATED for Job HelloWorld (5.0.0)
7/10 17:46:55 Job HelloWorld completed successfully.
7/10 17:46:55 Event: ULOG_JOB_TERMINATED for Job Setup (6.0.0)
7/10 17:46:55 Job Setup completed successfully.
7/10 17:46:55 Submitting Job WorkerNode_1 ...
7/10 17:46:55 	assigned Condor ID (7.0.0)
7/10 17:46:55 Submitting Job WorkerNode_Two ...
7/10 17:46:56 	assigned Condor ID (8.0.0)
7/10 17:46:56 Event: ULOG_SUBMIT for Job WorkerNode_1 (7.0.0)
7/10 17:46:56 Event: ULOG_SUBMIT for Job WorkerNode_Two (8.0.0)
7/10 17:46:56 2/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post
7/10 17:47:11 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_Two (8.0.0)
7/10 17:47:11 Event: ULOG_EXECUTE for Job WorkerNode_Two (8.0.0)
7/10 17:47:11 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_1 (7.0.0)
7/10 17:47:11 Event: ULOG_EXECUTE for Job WorkerNode_1 (7.0.0)
7/10 17:48:21 Event: ULOG_JOB_TERMINATED for Job WorkerNode_1 (7.0.0)
7/10 17:48:21 Job WorkerNode_1 completed successfully.
7/10 17:48:21 3/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:49:21 Event: ULOG_JOB_TERMINATED for Job WorkerNode_Two (8.0.0)
7/10 17:49:21 Job WorkerNode_Two completed successfully.
7/10 17:49:21 Submitting Job CollectResults ...
7/10 17:49:22 	assigned Condor ID (9.0.0)
7/10 17:49:22 Event: ULOG_SUBMIT for Job CollectResults (9.0.0)
7/10 17:49:22 4/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:49:37 Event: ULOG_GLOBUS_SUBMIT for Job CollectResults (9.0.0)
7/10 17:49:37 Event: ULOG_EXECUTE for Job CollectResults (9.0.0)
7/10 17:50:42 Event: ULOG_JOB_TERMINATED for Job CollectResults (9.0.0)
7/10 17:50:42 Job CollectResults completed successfully.
7/10 17:50:42 Submitting Job LastNode ...
7/10 17:50:42 	assigned Condor ID (10.0.0)
7/10 17:50:42 Event: ULOG_SUBMIT for Job LastNode (10.0.0)
7/10 17:50:42 5/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:50:57 Event: ULOG_GLOBUS_SUBMIT for Job LastNode (10.0.0)
7/10 17:50:57 Event: ULOG_EXECUTE for Job LastNode (10.0.0)
7/10 17:52:02 Event: ULOG_JOB_TERMINATED for Job LastNode (10.0.0)
7/10 17:52:02 Job LastNode completed successfully.
7/10 17:52:02 6/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 0 post
7/10 17:52:02 All jobs Completed!
7/10 17:52:02 **** condor_scheduniv_exec.4.0 (condor_DAGMAN) EXITING WITH STATUS 0

Clean up your results. Be careful about deleting the mydag.dag.* files, you do not want to delete the mydag.dag file, just mydag.dag.* .

$ rm mydag.dag.* results.*

Optional: Multiple Globus Schedulers

If you're ahead of schedule, you can try redoing this section, but with other Grid sites. Modify some of the globusscheduler entries in your submit files to point to pc-26.gs.unina.it and pc-27.gs.unina.it. A single DAG can send jobs to a variety of sites. Condor-G is capable of managing jobs being distributed to many different sites simultaneously.

Part VI: Handling Jobs That Fail with DAGMan

DAGMan can handle a situation where some of the nodes in a DAG fails. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed.

Let's create a script that will fail so we can see this:

$ cat > myscript2.sh
#! /bin/sh

echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished.  Exiting"
echo "RESULT: 1 FAILURE"
exit 1
Ctrl-D
$ cat myscript2.sh
#! /bin/sh

echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished.  Exiting"
echo "RESULT: 1 FAILURE"
exit 1
$ chmod a+x myscript2.sh

Modify job.work2.submit to run myscript2.sh instead of myscript.sh:

$ rm job.work2.submit
$ cat > job.work2.submit
executable=myscript2.sh
output=results.work2.output
error=results.work2.error
log=results.log
notification=never
universe=globus
globusscheduler=pc-26.gs.unina.it:/jobmanager-fork
arguments=WorkerNode2 60
queue
Ctrl-D
$ cat job.work2.submit
executable=myscript2.sh
output=results.work2.output
error=results.work2.error
log=results.log
notification=never
universe=globus
globusscheduler=pc-26.gs.unina.it:/jobmanager-fork
arguments=WorkerNode2 60
queue

Submit the dag again.

$ condor_submit_dag mydag.dag

Checking your DAG input file and all submit files it references.
This might take a while... 
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor   : mydag.dag.condor.sub
Log of DAGMan debugging messages         : mydag.dag.dagman.out
Log of Condor library debug messages     : mydag.dag.lib.out
Log of the life of condor_dagman itself  : mydag.dag.dagman.log

Condor Log file for all jobs of this DAG : results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 15.
-----------------------------------------------------------------------

Use watch_condor_q to watch the jobs until they finish.

In separate windows run "tail -f --lines=500 results.log" and "tail -f --lines=500 mydag.dag.dagman.out" to monitor the job's progress.

$ ./watch_condor_q


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  15.0   adesmet         7/10 11:11   0+00:00:04 R  0   2.6  condor_dagman -f -
  16.0   adesmet         7/10 11:11   0+00:00:00 I  0   0.0  myscript.sh       
  17.0   adesmet         7/10 11:11   0+00:00:00 I  0   0.0  myscript.sh Setup 

3 jobs; 2 idle, 1 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
  16.0   adesmet       UNSUBMITTED fork     pc-26.gs.unina.it   /afs/cs.wisc.edu/u
  17.0   adesmet       UNSUBMITTED fork     pc-26.gs.unina.it   /afs/cs.wisc.edu/u


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  15.0   adesmet         7/10 11:11   0+00:00:04 R  0   2.6  condor_dagman -f -
  16.0    |-HelloWorld   7/10 11:11   0+00:00:00 I  0   0.0  myscript.sh       
  17.0    |-Setup        7/10 11:11   0+00:00:00 I  0   0.0  myscript.sh Setup 

3 jobs; 2 idle, 1 running, 0 held

Output of watch_condor_q truncated

-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held
Ctrl-C

Check your results:

$ ls
job.finalize.submit   mydag.dag.condor.sub  myscript.sh		     results.output	   results.work2.output
job.setup.submit      mydag.dag.dagman.log  myscript2.sh	     results.setup.error   results.workfinal.error
job.work1.submit      mydag.dag.dagman.out  results.error	     results.setup.output  results.workfinal.output
job.work2.submit      mydag.dag.lib.out     results.finalize.error   results.work1.error   watch_condor_q
job.workfinal.submit  mydag.dag.lock	    results.finalize.output  results.work1.output
mydag.dag	      myjob.submit	    results.log		     results.work2.error
$ cat results.work2.output
I'm process id 29921 on pc-26
Thu Jul 10 11:12:42 CDT 2003
Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/87/459c159766cefb36f0d75023de0e35/md5/70/5d82b930ec61460d9c9ca65cbe5a8a/data WorkerNode2 60
My name (argument 1) is WorkerNode2
My sleep duration (argument 2) is 60
Sleep of 60 seconds finished.  Exiting
RESULT: 1 FAILURE
$ cat mydag.dag.dagman.out
7/10 11:11:55 ******************************************************
7/10 11:11:55 ** condor_scheduniv_exec.15.0 (CONDOR_DAGMAN) STARTING UP
7/10 11:11:55 ** $CondorVersion: 6.5.1 Apr 22 2003 $
7/10 11:11:55 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
7/10 11:11:55 ** PID = 27126
7/10 11:11:55 ******************************************************
7/10 11:11:55 DaemonCore: Command Socket at <128.105.185.14:34769>
7/10 11:11:55 argv[0] == "condor_scheduniv_exec.15.0"
7/10 11:11:55 argv[1] == "-Debug"
7/10 11:11:55 argv[2] == "3"
7/10 11:11:55 argv[3] == "-Lockfile"
7/10 11:11:55 argv[4] == "mydag.dag.lock"
7/10 11:11:55 argv[5] == "-Condorlog"
7/10 11:11:55 argv[6] == "results.log"
7/10 11:11:55 argv[7] == "-Dag"
7/10 11:11:55 argv[8] == "mydag.dag"
7/10 11:11:55 argv[9] == "-Rescue"
7/10 11:11:55 argv[10] == "mydag.dag.rescue"
7/10 11:11:55 Condor log will be written to results.log
7/10 11:11:55 DAG Lockfile will be written to mydag.dag.lock
7/10 11:11:55 DAG Input file is mydag.dag
7/10 11:11:55 Rescue DAG will be written to mydag.dag.rescue
7/10 11:11:55 Parsing mydag.dag ...
7/10 11:11:55 Dag contains 6 total jobs
7/10 11:11:55 Bootstrapping...
7/10 11:11:55 Number of pre-completed jobs: 0
7/10 11:11:55 Submitting Job HelloWorld ...
7/10 11:11:55 	assigned Condor ID (16.0.0)
7/10 11:11:55 Submitting Job Setup ...
7/10 11:11:55 	assigned Condor ID (17.0.0)
7/10 11:11:56 Event: ULOG_SUBMIT for Job HelloWorld (16.0.0)
7/10 11:11:56 Event: ULOG_SUBMIT for Job Setup (17.0.0)
7/10 11:11:56 0/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post
7/10 11:12:16 Event: ULOG_GLOBUS_SUBMIT for Job HelloWorld (16.0.0)
7/10 11:12:16 Event: ULOG_EXECUTE for Job HelloWorld (16.0.0)
7/10 11:12:16 Event: ULOG_GLOBUS_SUBMIT for Job Setup (17.0.0)
7/10 11:12:16 Event: ULOG_EXECUTE for Job Setup (17.0.0)
7/10 11:12:21 Event: ULOG_JOB_TERMINATED for Job HelloWorld (16.0.0)
7/10 11:12:21 Job HelloWorld completed successfully.
7/10 11:12:21 1/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 11:12:31 Event: ULOG_JOB_TERMINATED for Job Setup (17.0.0)
7/10 11:12:31 Job Setup completed successfully.
7/10 11:12:31 Submitting Job WorkerNode_1 ...
7/10 11:12:32 	assigned Condor ID (18.0.0)
7/10 11:12:32 Submitting Job WorkerNode_Two ...
7/10 11:12:32 	assigned Condor ID (19.0.0)
7/10 11:12:32 Event: ULOG_SUBMIT for Job WorkerNode_1 (18.0.0)
7/10 11:12:32 Event: ULOG_SUBMIT for Job WorkerNode_Two (19.0.0)
7/10 11:12:32 2/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post
7/10 11:12:47 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_Two (19.0.0)
7/10 11:12:47 Event: ULOG_EXECUTE for Job WorkerNode_Two (19.0.0)
7/10 11:12:47 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_1 (18.0.0)
7/10 11:12:47 Event: ULOG_EXECUTE for Job WorkerNode_1 (18.0.0)
7/10 11:13:07 Event: ULOG_JOB_TERMINATED for Job WorkerNode_1 (18.0.0)
7/10 11:13:07 Job WorkerNode_1 completed successfully.
7/10 11:13:07 3/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 11:13:57 Event: ULOG_JOB_TERMINATED for Job WorkerNode_Two (19.0.0)
7/10 11:13:57 Job WorkerNode_Two completed successfully.
7/10 11:13:57 Submitting Job CollectResults ...
7/10 11:13:57 	assigned Condor ID (20.0.0)
7/10 11:13:57 Event: ULOG_SUBMIT for Job CollectResults (20.0.0)
7/10 11:13:57 4/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 11:14:12 Event: ULOG_GLOBUS_SUBMIT for Job CollectResults (20.0.0)
7/10 11:14:12 Event: ULOG_EXECUTE for Job CollectResults (20.0.0)
7/10 11:14:32 Event: ULOG_JOB_TERMINATED for Job CollectResults (20.0.0)
7/10 11:14:32 Job CollectResults completed successfully.
7/10 11:14:32 Submitting Job LastNode ...
7/10 11:14:32 	assigned Condor ID (21.0.0)
7/10 11:14:32 Event: ULOG_SUBMIT for Job LastNode (21.0.0)
7/10 11:14:32 5/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 11:14:47 Event: ULOG_GLOBUS_SUBMIT for Job LastNode (21.0.0)
7/10 11:14:47 Event: ULOG_EXECUTE for Job LastNode (21.0.0)
7/10 11:15:02 Event: ULOG_JOB_TERMINATED for Job LastNode (21.0.0)
7/10 11:15:02 Job LastNode completed successfully.
7/10 11:15:02 6/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 0 post
7/10 11:15:02 All jobs Completed!
7/10 11:15:02 **** condor_scheduniv_exec.15.0 (condor_DAGMAN) EXITING WITH STATUS 0

Uh oh, DAGMan ran that remaining nodes based on bad data from node work2. Normally DAGMan checks the return code and considers non-zero a failure. We did modify myscript2.sh to return non-zero. That would normally work, but we're using Condor-G, not normal Condor. Condor-G relies on Globus and Globus doesn't return error codes.

If you're interested in having DAGMan notice a failed job and stopping the DAG at that point, you'll need to use a POST script to detect the problem. One solution is to wrap your executable in a script that will output the executable's return code to stdout and have the POST script scan the stdout for the status. Of perhaps your executable's normal output contains enough information to make the decision.

In this case, our executable is emitting a well known message. Let's add a POST script.

First, clean up your results. Be careful about deleting the mydag.dag.* files, you do not want to delete the mydag.dag file, just mydag.dag.* .

$ rm mydag.dag.* results.*

Now create a script to check the output.

$ cat > postscript_checker
#! /bin/sh
grep 'RESULT: 0 SUCCESS' $1 > /dev/null 2>/dev/null
Ctrl-D
$ cat postscript_checker
#! /bin/sh
grep 'RESULT: 0 SUCCESS' $1 > /dev/null 2>/dev/null
$ chmod a+x postscript_checker

modify your mydag.dag to use the new script for the nodes.

$ cat >>mydag.dag
Script POST Setup postscript_checker results.setup.output
Script POST WorkerNode_1 postscript_checker results.work1.output
Script POST WorkerNode_Two postscript_checker results.work2.output
Script POST CollectResults postscript_checker results.workfinal.output
Script POST LastNode postscript_checker results.finalize.output
Ctrl-D
$ cat mydag.dag
Job HelloWorld myjob.submit
Job Setup job.setup.submit
Job WorkerNode_1 job.work1.submit
Job WorkerNode_Two job.work2.submit
Job CollectResults job.workfinal.submit
Job LastNode job.finalize.submit
PARENT Setup CHILD WorkerNode_1 WorkerNode_Two
PARENT WorkerNode_1 WorkerNode_Two CHILD CollectResults
PARENT CollectResults CHILD LastNode
Script POST Setup postscript_checker results.setup.output
Script POST WorkerNode_1 postscript_checker results.work1.output
Script POST WorkerNode_Two postscript_checker results.work2.output
Script POST CollectResults postscript_checker results.workfinal.output
Script POST LastNode postscript_checker results.finalize.output
$ ls
job.finalize.submit  job.work1.submit  job.workfinal.submit  myjob.submit  myscript2.sh        watch_condor_q
job.setup.submit     job.work2.submit  mydag.dag	     myscript.sh   postscript_checker

Submit the dag again with the new POST scripts in place.

$ condor_submit_dag mydag.dag

Checking your DAG input file and all submit files it references.
This might take a while... 
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor   : mydag.dag.condor.sub
Log of DAGMan debugging messages         : mydag.dag.dagman.out
Log of Condor library debug messages     : mydag.dag.lib.out
Log of the life of condor_dagman itself  : mydag.dag.dagman.log

Condor Log file for all jobs of this DAG : results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 22.
-----------------------------------------------------------------------

Again, watch the job with watch_condor_q. In separate windows run "tail -f --lines=500 results.log" and "tail -f --lines=500 mydag.dag.dagman.out" to monitor the job's progress.

$ ./watch_condor_q 


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  22.0   adesmet         7/10 11:25   0+00:00:03 R  0   2.6  condor_dagman -f -
  23.0   adesmet         7/10 11:25   0+00:00:00 I  0   0.0  myscript.sh       
  24.0   adesmet         7/10 11:25   0+00:00:00 I  0   0.0  myscript.sh Setup 

3 jobs; 2 idle, 1 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
  23.0   adesmet       UNSUBMITTED fork     pc-26.gs.unina.it   /afs/cs.wisc.edu/u
  24.0   adesmet       UNSUBMITTED fork     pc-26.gs.unina.it   /afs/cs.wisc.edu/u


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  22.0   adesmet         7/10 11:25   0+00:00:03 R  0   2.6  condor_dagman -f -
  23.0    |-HelloWorld   7/10 11:25   0+00:00:00 I  0   0.0  myscript.sh       
  24.0    |-Setup        7/10 11:25   0+00:00:00 I  0   0.0  myscript.sh Setup 

3 jobs; 2 idle, 1 running, 0 held

Output of watch_condor_q truncated

-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held
Ctrl-C

Check your results:

$ ls
job.finalize.submit   mydag.dag		    mydag.dag.rescue	results.error	      results.work1.error
job.setup.submit      mydag.dag.condor.sub  myjob.submit	results.log	      results.work1.output
job.work1.submit      mydag.dag.dagman.log  myscript.sh		results.output	      results.work2.error
job.work2.submit      mydag.dag.dagman.out  myscript2.sh	results.setup.error   results.work2.output
job.workfinal.submit  mydag.dag.lib.out     postscript_checker	results.setup.output  watch_condor_q
$ cat mydag.dag.dagman.out
7/10 11:25:35 ******************************************************
7/10 11:25:35 ** condor_scheduniv_exec.22.0 (CONDOR_DAGMAN) STARTING UP
7/10 11:25:35 ** $CondorVersion: 6.5.1 Apr 22 2003 $
7/10 11:25:35 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
7/10 11:25:35 ** PID = 27251
7/10 11:25:35 ******************************************************
7/10 11:25:35 DaemonCore: Command Socket at <128.105.185.14:34913>
7/10 11:25:35 argv[0] == "condor_scheduniv_exec.22.0"
7/10 11:25:35 argv[1] == "-Debug"
7/10 11:25:35 argv[2] == "3"
7/10 11:25:35 argv[3] == "-Lockfile"
7/10 11:25:35 argv[4] == "mydag.dag.lock"
7/10 11:25:35 argv[5] == "-Condorlog"
7/10 11:25:35 argv[6] == "results.log"
7/10 11:25:35 argv[7] == "-Dag"
7/10 11:25:35 argv[8] == "mydag.dag"
7/10 11:25:35 argv[9] == "-Rescue"
7/10 11:25:35 argv[10] == "mydag.dag.rescue"
7/10 11:25:35 Condor log will be written to results.log
7/10 11:25:35 DAG Lockfile will be written to mydag.dag.lock
7/10 11:25:35 DAG Input file is mydag.dag
7/10 11:25:35 Rescue DAG will be written to mydag.dag.rescue
7/10 11:25:35 Parsing mydag.dag ...
7/10 11:25:35 jobName: Setup
7/10 11:25:35 jobName: WorkerNode_1
7/10 11:25:35 jobName: WorkerNode_Two
7/10 11:25:35 jobName: CollectResults
7/10 11:25:35 jobName: LastNode
7/10 11:25:35 Dag contains 6 total jobs
7/10 11:25:35 Bootstrapping...
7/10 11:25:35 Number of pre-completed jobs: 0
7/10 11:25:35 Submitting Job HelloWorld ...
7/10 11:25:35 	assigned Condor ID (23.0.0)
7/10 11:25:35 Submitting Job Setup ...
7/10 11:25:35 	assigned Condor ID (24.0.0)
7/10 11:25:36 Event: ULOG_SUBMIT for Job HelloWorld (23.0.0)
7/10 11:25:36 Event: ULOG_SUBMIT for Job Setup (24.0.0)
7/10 11:25:36 0/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post
7/10 11:25:56 Event: ULOG_GLOBUS_SUBMIT for Job HelloWorld (23.0.0)
7/10 11:25:56 Event: ULOG_EXECUTE for Job HelloWorld (23.0.0)
7/10 11:25:56 Event: ULOG_GLOBUS_SUBMIT for Job Setup (24.0.0)
7/10 11:25:56 Event: ULOG_EXECUTE for Job Setup (24.0.0)
7/10 11:26:01 Event: ULOG_JOB_TERMINATED for Job HelloWorld (23.0.0)
7/10 11:26:01 Job HelloWorld completed successfully.
7/10 11:26:01 1/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 11:26:11 Event: ULOG_JOB_TERMINATED for Job Setup (24.0.0)
7/10 11:26:11 Job Setup completed successfully.
7/10 11:26:11 Running POST script of Job Setup...
7/10 11:26:11 1/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 1 post
7/10 11:26:16 Event: ULOG_POST_SCRIPT_TERMINATED for Job Setup (24.0.0)
7/10 11:26:16 POST Script of Job Setup completed successfully.
7/10 11:26:16 Submitting Job WorkerNode_1 ...
7/10 11:26:16 	assigned Condor ID (25.0.0)
7/10 11:26:16 Submitting Job WorkerNode_Two ...
7/10 11:26:17 	assigned Condor ID (26.0.0)
7/10 11:26:17 Event: ULOG_SUBMIT for Job WorkerNode_1 (25.0.0)
7/10 11:26:17 Event: ULOG_SUBMIT for Job WorkerNode_Two (26.0.0)
7/10 11:26:17 2/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post
7/10 11:26:32 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_1 (25.0.0)
7/10 11:26:32 Event: ULOG_EXECUTE for Job WorkerNode_1 (25.0.0)
7/10 11:26:32 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_Two (26.0.0)
7/10 11:26:32 Event: ULOG_EXECUTE for Job WorkerNode_Two (26.0.0)
7/10 11:26:52 Event: ULOG_JOB_TERMINATED for Job WorkerNode_1 (25.0.0)
7/10 11:26:52 Job WorkerNode_1 completed successfully.
7/10 11:26:52 Running POST script of Job WorkerNode_1...
7/10 11:26:52 2/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 1 post
7/10 11:26:57 Event: ULOG_POST_SCRIPT_TERMINATED for Job WorkerNode_1 (25.0.0)
7/10 11:26:57 POST Script of Job WorkerNode_1 completed successfully.
7/10 11:26:57 3/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 11:27:42 Event: ULOG_JOB_TERMINATED for Job WorkerNode_Two (26.0.0)
7/10 11:27:42 Job WorkerNode_Two completed successfully.
7/10 11:27:42 Running POST script of Job WorkerNode_Two...
7/10 11:27:42 3/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 1 post
7/10 11:27:47 Event: ULOG_POST_SCRIPT_TERMINATED for Job WorkerNode_Two (26.0.0)
7/10 11:27:47 POST Script of Job WorkerNode_Two failed with status 1
7/10 11:27:47 3/6 done, 1 failed, 0 submitted, 0 ready, 0 pre, 0 post
7/10 11:27:47 ERROR: the following job(s) failed:
7/10 11:27:47 ---------------------- Job ----------------------
7/10 11:27:47       Node Name: WorkerNode_Two
7/10 11:27:47          NodeID: 3
7/10 11:27:47     Node Status: STATUS_ERROR    
7/10 11:27:47           Error: POST Script failed with status 1
7/10 11:27:47 Job Submit File: job.work2.submit
7/10 11:27:47     POST Script: postscript_checker results.work2.output
7/10 11:27:47   Condor Job ID: (26.0.0)
7/10 11:27:47       Q_PARENTS: 1, <END>
7/10 11:27:47       Q_WAITING: <END>
7/10 11:27:47      Q_CHILDREN: 4, <END>
7/10 11:27:47 ---------------------------------------	<END>
7/10 11:27:47 Writing Rescue DAG file...
7/10 11:27:47 **** condor_scheduniv_exec.22.0 (condor_DAGMAN) EXITING WITH STATUS 1

DAGMan notices that one of the jobs failed. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved.

Look at the rescue DAG. It's the same structurally as your original DAG, but notes that finished are marked DONE. (DAGMan also reorganized the file.) When you submit the rescue DAG, DONE nodes will be skipped.

$ cat mydag.dag.rescue 
# Rescue DAG file, created after running
#   the mydag.dag DAG file
#
# Total number of Nodes: 6
# Nodes premarked DONE: 3
# Nodes that failed: 1
#   WorkerNode_Two,<ENDLIST>

JOB HelloWorld myjob.submit DONE

JOB Setup job.setup.submit DONE
SCRIPT POST Setup postscript_checker results.setup.output

JOB WorkerNode_1 job.work1.submit DONE
SCRIPT POST WorkerNode_1 postscript_checker results.work1.output

JOB WorkerNode_Two job.work2.submit 
SCRIPT POST WorkerNode_Two postscript_checker results.work2.output

JOB CollectResults job.workfinal.submit 
SCRIPT POST CollectResults postscript_checker results.workfinal.output

JOB LastNode job.finalize.submit 
SCRIPT POST LastNode postscript_checker results.finalize.output


PARENT Setup CHILD WorkerNode_1 WorkerNode_Two
PARENT WorkerNode_1 CHILD CollectResults
PARENT WorkerNode_Two CHILD CollectResults
PARENT CollectResults CHILD LastNode

So we know there is a problem with the work2 step. Let's "fix" it.

$ rm myscript2.sh
$ cp myscript.sh myscript2.sh

Now we can submit our rescue DAG. (If you didn't fix the problem, DAGMan would have generated another rescue DAG, this time "mydag.dag.rescue.rescue".) In separate windows run "tail -f --lines=500 results.log" and "tail -f --lines=500 mydag.dag.dagman.out" to monitor the job's progress.

$ condor_submit_dag mydag.dag.rescue 

Checking your DAG input file and all submit files it references.
This might take a while... 
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor   : mydag.dag.rescue.condor.sub
Log of DAGMan debugging messages         : mydag.dag.rescue.dagman.out
Log of Condor library debug messages     : mydag.dag.rescue.lib.out
Log of the life of condor_dagman itself  : mydag.dag.rescue.dagman.log

Condor Log file for all jobs of this DAG : results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 27.
-----------------------------------------------------------------------
$ ./watch_condor_q 


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  27.0   adesmet         7/10 11:34   0+00:00:01 R  0   2.6  condor_dagman -f -
  28.0   adesmet         7/10 11:34   0+00:00:00 I  0   0.0  myscript2.sh Worke

2 jobs; 1 idle, 1 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
  28.0   adesmet       UNSUBMITTED fork     pc-26.gs.unina.it   /afs/cs.wisc.edu/u


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  27.0   adesmet         7/10 11:34   0+00:00:01 R  0   2.6  condor_dagman -f -
  28.0    |-WorkerNode_  7/10 11:34   0+00:00:00 I  0   0.0  myscript2.sh Worke

2 jobs; 1 idle, 1 running, 0 held

Output of watch_condor_q truncated

-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        


-- Submitter: puffin.cs.wisc.edu : <128.105.185.14:33785> : puffin.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

Ctrl-C

Check your results.

$ ls
job.finalize.submit   mydag.dag.lib.out		   myscript2.sh		    results.work1.error
job.setup.submit      mydag.dag.rescue		   postscript_checker	    results.work1.output
job.work1.submit      mydag.dag.rescue.condor.sub  results.error	    results.work2.error
job.work2.submit      mydag.dag.rescue.dagman.log  results.finalize.error   results.work2.output
job.workfinal.submit  mydag.dag.rescue.dagman.out  results.finalize.output  results.workfinal.error
mydag.dag	      mydag.dag.rescue.lib.out	   results.log		    results.workfinal.output
mydag.dag.condor.sub  mydag.dag.rescue.lock	   results.output	    watch_condor_q
mydag.dag.dagman.log  myjob.submit		   results.setup.error
mydag.dag.dagman.out  myscript.sh		   results.setup.output
$ cat mydag.dag.rescue.dagman.out
7/10 11:34:33 ******************************************************
7/10 11:34:33 ** condor_scheduniv_exec.27.0 (CONDOR_DAGMAN) STARTING UP
7/10 11:34:33 ** $CondorVersion: 6.5.1 Apr 22 2003 $
7/10 11:34:33 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
7/10 11:34:33 ** PID = 27317
7/10 11:34:33 ******************************************************
7/10 11:34:33 DaemonCore: Command Socket at <128.105.185.14:35032>
7/10 11:34:33 argv[0] == "condor_scheduniv_exec.27.0"
7/10 11:34:33 argv[1] == "-Debug"
7/10 11:34:33 argv[2] == "3"
7/10 11:34:33 argv[3] == "-Lockfile"
7/10 11:34:33 argv[4] == "mydag.dag.rescue.lock"
7/10 11:34:33 argv[5] == "-Condorlog"
7/10 11:34:33 argv[6] == "results.log"
7/10 11:34:33 argv[7] == "-Dag"
7/10 11:34:33 argv[8] == "mydag.dag.rescue"
7/10 11:34:33 argv[9] == "-Rescue"
7/10 11:34:33 argv[10] == "mydag.dag.rescue.rescue"
7/10 11:34:33 Condor log will be written to results.log
7/10 11:34:33 DAG Lockfile will be written to mydag.dag.rescue.lock
7/10 11:34:33 DAG Input file is mydag.dag.rescue
7/10 11:34:33 Rescue DAG will be written to mydag.dag.rescue.rescue
7/10 11:34:33 Parsing mydag.dag.rescue ...
7/10 11:34:33 jobName: Setup
7/10 11:34:33 jobName: WorkerNode_1
7/10 11:34:33 jobName: WorkerNode_Two
7/10 11:34:33 jobName: CollectResults
7/10 11:34:33 jobName: LastNode
7/10 11:34:33 Dag contains 6 total jobs
7/10 11:34:33 Deleting older version of results.log
7/10 11:34:33 Bootstrapping...
7/10 11:34:33 Number of pre-completed jobs: 3
7/10 11:34:33 Submitting Job WorkerNode_Two ...
7/10 11:34:33 	assigned Condor ID (28.0.0)
7/10 11:34:34 Event: ULOG_SUBMIT for Job WorkerNode_Two (28.0.0)
7/10 11:34:34 3/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 11:34:54 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_Two (28.0.0)
7/10 11:34:54 Event: ULOG_EXECUTE for Job WorkerNode_Two (28.0.0)
7/10 11:35:59 Event: ULOG_JOB_TERMINATED for Job WorkerNode_Two (28.0.0)
7/10 11:35:59 Job WorkerNode_Two completed successfully.
7/10 11:35:59 Running POST script of Job WorkerNode_Two...
7/10 11:35:59 3/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 1 post
7/10 11:36:04 Event: ULOG_POST_SCRIPT_TERMINATED for Job WorkerNode_Two (28.0.0)
7/10 11:36:04 POST Script of Job WorkerNode_Two completed successfully.
7/10 11:36:04 Submitting Job CollectResults ...
7/10 11:36:04 	assigned Condor ID (29.0.0)
7/10 11:36:04 Event: ULOG_SUBMIT for Job CollectResults (29.0.0)
7/10 11:36:04 4/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 11:36:19 Event: ULOG_GLOBUS_SUBMIT for Job CollectResults (29.0.0)
7/10 11:36:19 Event: ULOG_EXECUTE for Job CollectResults (29.0.0)
7/10 11:36:34 Event: ULOG_JOB_TERMINATED for Job CollectResults (29.0.0)
7/10 11:36:34 Job CollectResults completed successfully.
7/10 11:36:34 Running POST script of Job CollectResults...
7/10 11:36:34 4/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 1 post
7/10 11:36:39 Event: ULOG_POST_SCRIPT_TERMINATED for Job CollectResults (29.0.0)
7/10 11:36:39 POST Script of Job CollectResults completed successfully.
7/10 11:36:39 Submitting Job LastNode ...
7/10 11:36:39 	assigned Condor ID (30.0.0)
7/10 11:36:39 Event: ULOG_SUBMIT for Job LastNode (30.0.0)
7/10 11:36:39 5/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 11:36:54 Event: ULOG_GLOBUS_SUBMIT for Job LastNode (30.0.0)
7/10 11:36:54 Event: ULOG_EXECUTE for Job LastNode (30.0.0)
7/10 11:37:09 Event: ULOG_JOB_TERMINATED for Job LastNode (30.0.0)
7/10 11:37:09 Job LastNode completed successfully.
7/10 11:37:09 Running POST script of Job LastNode...
7/10 11:37:09 5/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 1 post
7/10 11:37:14 Event: ULOG_POST_SCRIPT_TERMINATED for Job LastNode (30.0.0)
7/10 11:37:14 POST Script of Job LastNode completed successfully.
7/10 11:37:14 6/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 0 post
7/10 11:37:14 All jobs Completed!
7/10 11:37:14 **** condor_scheduniv_exec.27.0 (condor_DAGMAN) EXITING WITH STATUS 0
$ cat results.work2.output
I'm process id 30478 on pc-26
Thu Jul 10 11:34:46 CDT 2003
Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/23/61b50cd9b278330cac68107dd390d6/md5/5e/004f7216b8b846d548357da00985f4/data WorkerNode2 60
My name (argument 1) is WorkerNode2
My sleep duration (argument 2) is 60
Sleep of 60 seconds finished.  Exiting
RESULT: 0 SUCCESS
$ exit

Part VII: Shutting Down

To shutdown the Condor-G services on your workstation, use condor_off as follows:

$ condor_off -master
Sent "Kill-Daemon" command for "master" to local master

That's it. There is a lot more you can do with Condor-G and DAGMan, but this basic introduction is all you need to know to get started. Good luck!