Banner
Title: Condor Practical
Subtitle: A Simple DAG
Tutors: Alain Roy and Todd Tannenbaum
Authors: Alain Roy and Ben Burnett

A Simple DAG

6.1 What is DAGMan?

Your tutorial leader will introduce you to DAGMan and DAGs. In short, DAGMAn, lets you submit complex sequences of jobs as long as they can be expressed as a directed acylic graph. For example, you may wish to run a large parameter sweep but before the sweep run you need to prepare your data. After the sweep runs, you need to collate the results. This might look like this, assuming you want to sweep over five parameters:

DAGMan has many abilities, such as throttling jobs, recovery from failures, and more. More information about DAGMan can be found at in the Condor manual.

Top

6.2 Submitting a simple DAG

Make sure that your submit file has only one queue command in it, as when we first wrote it.

Universe                = vanilla
Executable              = simple.bat
Arguments               = 4 10
Log                     = simple.log.txt
Output                  = simple.out.txt
Error                   = simple.err.txt
should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
Queue

We are going to get a bit more sophisticated in submitting our jobs now. Let's have three windows open. In one window, you'll submit the job. In another you will watch the queue, and in the third you will watch what DAGMan does. To prepare for this, we'll create a script to help watch the queue. Name it watch_condor_q.bat. (Where it says Ctrl + Z, type the character, not the full name. This will end the input for copy.)

C:\condor-test> copy con watch_condor_q.bat
@echo off
:loop
condor_q -dag
rem sleep for 10 seconds (no OS support?!)
ping -n 10 127.0.0.1 >NUL 2>&1
goto loop
Ctrl + Z

If you like, modify watch_condor_q.bat so it just watches your jobs, not everyone that is using your computer.

Now we will create the most minimal DAG that can be created: a DAG with just one node.

C:\condor-test> copy con simple.dag
Job Simple simple.sub
Ctrl + Z

In your first window, submit the job:

C:\condor-test> del simple.log.txt simple.out.txt

C:\condor-test> condor_submit_dag -force simple.dag

Checking all your submit files for log file names.
This might take a while...
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor           : simple.dag.condor.sub
Log of DAGMan debugging messages                 : simple.dag.dagman.out
Log of Condor library output                     : simple.dag.lib.out
Log of Condor library error messages             : simple.dag.lib.err
Log of the life of condor_dagman itself          : simple.dag.dagman.log

Condor Log file for all jobs of this DAG         : C:\Documents and Settings\Adm
inistrator\condor-test\simple.log.txt
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 9.
WARNING: the line `remove_kill_sig = SIGUSR1' was unused by condor_submit. Is it
 a typo? <=====Don't worry about this!
-----------------------------------------------------------------------

C:\condor-test> condor_reschedule <=====Don't miss this!

In the second window, watch the queue.

C:\condor-test> watch_condor_q.bat
-- Submitter: lab-21 : <129.215.30.181:2207> : lab-21
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD
   9.0   Administrator  11/27 11:46   0+00:00:00 R  0   1.2  condor_dagman.exe

1 jobs; 0 idle, 1 running, 0 held


-- Submitter: lab-21 : <129.215.30.181:2207> : lab-21
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD
   9.0   Administrator  11/27 11:46   0+00:00:09 R  0   1.2  condor_dagman.exe

1 jobs; 0 idle, 1 running, 0 held


-- Submitter: lab-21 : <129.215.30.181:2207> : lab-21
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD
   9.0   Administrator  11/27 11:46   0+00:00:19 R  0   1.2  condor_dagman.exe
  10.0    |-Simple      11/27 11:46   0+00:00:00 I  0   0.0  simple.bat 4 10

2 jobs; 1 idle, 1 running, 0 held


-- Submitter: lab-21 : <129.215.30.181:2207> : lab-21
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD
   9.0   Administrator  11/27 11:46   0+00:00:28 R  0   1.2  condor_dagman.exe
  10.0    |-Simple      11/27 11:46   0+00:00:00 I  0   0.0  simple.bat 4 10

2 jobs; 1 idle, 1 running, 0 held


-- Submitter: lab-21 : <129.215.30.181:2207> : lab-21
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD
   9.0   Administrator  11/27 11:46   0+00:00:37 R  0   1.2  condor_dagman.exe
  10.0    |-Simple      11/27 11:46   0+00:00:00 I  0   0.0  simple.bat 4 10

2 jobs; 1 idle, 1 running, 0 held


-- Submitter: lab-21 : <129.215.30.181:2207> : lab-21
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD
   9.0   Administrator  11/27 11:46   0+00:00:46 R  0   1.2  condor_dagman.exe
  10.0    |-Simple      11/27 11:46   0+00:00:04 R  0   0.0  simple.bat 4 10

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: lab-21 : <129.215.30.181:2207> : lab-21
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held
-- Submitter: lab-21 : <129.215.30.181:2207> : lab-21
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD
   9.0   Administrator  11/27 11:46   0+00:00:00 R  0   1.2  condor_dagman.exe

1 jobs; 0 idle, 1 running, 0 held


-- Submitter: lab-21 : <129.215.30.181:2207> : lab-21
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD
   9.0   Administrator  11/27 11:46   0+00:00:09 R  0   1.2  condor_dagman.exe

1 jobs; 0 idle, 1 running, 0 held

-- Submitter: lab-21 : <129.215.30.181:2207> : lab-21
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD
   9.0   Administrator  11/27 11:46   0+00:00:37 R  0   1.2  condor_dagman.exe
  10.0    |-Simple      11/27 11:46   0+00:00:00 I  0   0.0  simple.bat 4 10

2 jobs; 1 idle, 1 running, 0 held

-- Submitter: lab-21 : <129.215.30.181:2207> : lab-21
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD
   9.0   Administrator  11/27 11:46   0+00:00:46 R  0   1.2  condor_dagman.exe
  10.0    |-Simple      11/27 11:46   0+00:00:04 R  0   0.0  simple.bat 4 10

2 jobs; 0 idle, 2 running, 0 held

-- Submitter: lab-21 : <129.215.30.181:2207> : lab-21
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held

Ctrl + C

Terminate batch job (Y/N)? y

In the third window, watch what DAGMan does:

C:\condor-test> more simple.dag.dagman.out
11/20 10:20:00 ******************************************************
11/20 10:20:00 ** condor_scheduniv_exec.9.0 (CONDOR_DAGMAN) STARTING UP
11/20 10:20:00 ** C:\condor\bin\condor_dagman.exe
11/20 10:20:00 ** $CondorVersion: 6.9.5 Nov 18 2007 $
11/20 10:20:00 ** $CondorPlatform: INTEL-WINNT50 $
11/20 10:20:00 ** PID = 1592
11/20 10:20:00 ** Log last touched time unavailable (No such file or directory)
11/20 10:20:00 ******************************************************
11/20 10:20:00 Using config source: C:\condor\condor_config
11/20 10:20:00 Using local config sources: 
11/20 10:20:00    C:\condor/condor_config.local
11/20 10:20:00 DaemonCore: Command Socket at <128.105.48.96:63921>
11/20 10:20:00 DAGMAN_SUBMIT_DELAY setting: 0
11/20 10:20:00 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
11/20 10:20:00 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
11/20 10:20:00 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
11/20 10:20:00 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
11/20 10:20:00 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
11/20 10:20:00 DAGMAN_RETRY_NODE_FIRST setting: 0
11/20 10:20:00 DAGMAN_MAX_JOBS_IDLE setting: 0
11/20 10:20:00 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
11/20 10:20:00 DAGMAN_MUNGE_NODE_NAMES setting: 1
11/20 10:20:00 DAGMAN_DELETE_OLD_LOGS setting: 1
11/20 10:20:00 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
11/20 10:20:00 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0
11/20 10:20:00 DAGMAN_ABORT_DUPLICATES setting: 1
11/20 10:20:00 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1
11/20 10:20:00 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
11/20 10:20:00 argv[0] == "condor_scheduniv_exec.9.0"
11/20 10:20:00 argv[1] == "-Debug"
11/20 10:20:00 argv[2] == "3"
11/20 10:20:00 argv[3] == "-Lockfile"
11/20 10:20:00 argv[4] == "simple.dag.lock"
11/20 10:20:00 argv[5] == "-Condorlog"
11/20 10:20:00 argv[6] == "C:\condor-test\simple.log.txt"
11/20 10:20:00 argv[7] == "-Dag"
11/20 10:20:00 argv[8] == "simple.dag"
11/20 10:20:00 argv[9] == "-Rescue"
11/20 10:20:00 argv[10] == "simple.dag.rescue"
11/20 10:20:00 DAG Lockfile will be written to simple.dag.lock
11/20 10:20:00 DAG Input file is simple.dag
11/20 10:20:00 Rescue DAG will be written to simple.dag.rescue
11/20 10:20:00 All DAG node user log files:
11/20 10:20:00   C:\condor-test\simple.log.txt (Condor)
11/20 10:20:00 Parsing simple.dag ...
11/20 10:20:00 Dag contains 1 total jobs
11/20 10:20:00 Truncating any older versions of log files...
11/20 10:20:00 Sleeping for 12 seconds to ensure ProcessId uniqueness
11/20 10:20:12 WARNING: ProcessId not confirmed unique
11/20 10:20:12 Bootstrapping...
11/20 10:20:12 Number of pre-completed nodes: 0
11/20 10:20:12 Registering condor_event_timer...
11/20 10:20:13 Submitting Condor Node Simple job(s)...
11/20 10:20:13 submitting: condor_submit -a dag_node_name' '=' 'Simple -a +DAGManJobId' '=' '9 \
               -a DAGManJobId' '=' '9 -a submit_event_notes' '=' 'DAG' 'Node:' 'Simple -a \
               +DAGParentNodeNames' '=' '"" simple.sub
11/20 10:20:13 From submit: Submitting job(s).
11/20 10:20:13 From submit: Logging submit event(s).
11/20 10:20:13 From submit: 1 job(s) submitted to cluster 10.
11/20 10:20:13 	assigned Condor ID (10.0)
11/20 10:20:13 Just submitted 1 job this cycle...
11/20 10:20:13 Event: ULOG_SUBMIT for Condor Node Simple (10.0)
11/20 10:20:13 Number of idle job procs: 1
11/20 10:20:13 Of 1 nodes total:
11/20 10:20:13  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
11/20 10:20:13   ===     ===      ===     ===     ===        ===      ===
11/20 10:20:13     0       0        1       0       0          0        0
11/20 10:20:38 Event: ULOG_EXECUTE for Condor Node Simple (10.0)
11/20 10:20:38 Number of idle job procs: 0
11/20 10:20:38 Event: ULOG_JOB_TERMINATED for Condor Node Simple (10.0)
11/20 10:20:38 Node Simple job proc (10.0) completed successfully.
11/20 10:20:38 Node Simple job completed
11/20 10:20:38 Number of idle job procs: 0
11/20 10:20:38 Of 1 nodes total:
11/20 10:20:38  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
11/20 10:20:38   ===     ===      ===     ===     ===        ===      ===
11/20 10:20:38     1       0        0       0       0          0        0
11/20 10:20:38 All jobs Completed!
11/20 10:20:38 Note: 0 total job deferrals because of -MaxJobs limit (0)
11/20 10:20:38 Note: 0 total job deferrals because of -MaxIdle limit (0)
11/20 10:20:38 Note: 0 total job deferrals because of node category throttles
11/20 10:20:38 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
11/20 10:20:38 Note: 0 total POST script deferrals because of -MaxPost limit (0)
11/20 10:20:38 **** condor_scheduniv_exec.9.0 (condor_DAGMAN) EXITING WITH STATUS 0

Now verify your results:

C:\condor-test> more simple.log.txt
000 (010.000.000) 11/27 11:46:49 Job submitted from host: <129.215.30.181:2207>
    DAG Node: Simple
...
001 (010.000.000) 11/27 11:47:21 Job executing on host: <129.215.30.173:2217>
...
005 (010.000.000) 11/27 11:47:25 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        84  -  Run Bytes Sent By Job
        431  -  Run Bytes Received By Job
        84  -  Total Bytes Sent By Job
        431  -  Total Bytes Received By Job
...

C:\condor-test>more simple.out.txt
1
2
3
4
5
6
7
8
9
10
Thinking really hard for 4 seconds...

Looking at DAGMan's various files, we see that DAGMan itself ran as a Condor job (specifically, a scheduler universe job).

C:\condor-test> dir simple.dag.*
 Volume in drive C has no label.
 Volume Serial Number is 14E3-4F7E

 Directory of C:\condor-test

11/16/2007  01:50 PM                23 simple.dag
11/20/2007  10:20 AM               901 simple.dag.condor.sub
11/20/2007  10:20 AM               621 simple.dag.dagman.log
11/20/2007  10:20 AM             4,749 simple.dag.dagman.out
11/20/2007  10:20 AM                 0 simple.dag.lib.err
11/20/2007  10:20 AM                30 simple.dag.lib.out
               6 File(s)          6,324 bytes
               0 Dir(s)  30,574,940,160 bytes free

C:\condor-test> more simple.dag.condor.sub
# Filename: simple.dag.condor.sub
# Generated by condor_submit_dag simple.dag
universe        = scheduler
executable      = C:\condor\bin\condor_dagman.exe
getenv          = True
output          = simple.dag.lib.out
error           = simple.dag.lib.err
log             = simple.dag.dagman.log
remove_kill_sig = SIGUSR1
# Note: default on_exit_remove expression:
# ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <=2))
# attempts to ensure that DAGMan is automatically
# requeued by the schedd if it exits abnormally or
# is killed (e.g., during a reboot).
on_exit_remove  = ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
copy_to_spool   = False
arguments       = -f -l . -Debug 3 -Lockfile simple.dag.lock -Condorlog C:\condo
r-test\simple.log.txt -Dag simple.dag -Rescue simple.dag.rescue
environment     = _CONDOR_DAGMAN_LOG=simple.dag.dagman.out|_CONDOR_MAX_DAGMAN_LOG=0
queue

C:\condor-test> more simple.dag.lib.out
Executing condor dagman ...

C:\condor-test> more simple.dag.dagman.log
000 (009.000.000) 11/20 10:20:00 Job submitted from host: <128.105.48.96:63700>
...
001 (009.000.000) 11/20 10:20:00 Job executing on host: <128.105.48.96:63700>
...
005 (009.000.000) 11/20 10:20:38 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...

Clean up some of these files:

C:\condor-test>del simple.dag.condor.sub simple.dag.dagman.log  simple.dag.dagman.out simple.dag.lib.err simple.dag.lib.out


Question Why does DAGMan run as a Condor job?

Next: A more complex DAG

Top

Top