|
||||||||||||
|
A Simple DAG6.1 What is DAGMan?Your tutorial leader will introduce you to DAGMan and DAGs. In short, DAGMAn, lets you submit complex sequences of jobs as long as they can be expressed as a directed acylic graph. For example, you may wish to run a large parameter sweep but before the sweep run you need to prepare your data. After the sweep runs, you need to collate the results. This might look like this, assuming you want to sweep over five parameters:
DAGMan has many abilities, such as throttling jobs, recovery from failures, and more. More information about DAGMan can be found at in the Condor manual. 6.2 Submitting a simple DAGMake sure that your submit file has only one queue command in it, as when we first wrote it. And we will just run vanilla universe jobs for now, though we could equally well run standard universe jobs. Universe = vanilla Executable = simple Arguments = 4 10 Log = simple.log Output = simple.out Error = simple.error should_transfer_files = YES when_to_transfer_output = ON_EXIT Queue We are going to get a bit more sophisticated in submitting our jobs now. Let's have three windows open. In one window, you'll submit the job. In another you will watch the queue, and in the third you will watch what DAGMan does. To prepare for this, we'll create a script to help watch the queue. Name it watch_condor_q. (Where it says Ctrl-D, type the character, not the full name. This will end the input for cat.) % cat > watch_condor_q #! /bin/sh while true; do condor_q -dag sleep 10 done Ctrl-D % chmod a+x watch_condor_q If you like, modify watch_condor_q so it just watches your jobs, not everyone that is using your computer. Now we will create the most minimal DAG that can be created: a DAG with just one node. % cat > simple.dag Job Simple submit Ctrl-D In your first window, submit the job: % rm -f simple.log simple.out % condor_submit_dag -force simple.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : simple.dag.condor.sub Log of DAGMan debugging messages : simple.dag.dagman.out Log of Condor library debug messages : simple.dag.lib.out Log of the life of condor_dagman itself : simple.dag.dagman.log Condor Log file for all Condor jobs of this DAG: simple.dag.dummy_log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 7. ----------------------------------------------------------------------- % condor_reschedule <=====Don't miss this! In the second window, watch the queue. % ./watch_condor_q -- Submitter: osg-edu.cs.wisc.edu : <193.206.208.141:9603> : osg-edu.cs.wisc.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 18.0 aroy 2/4 23:08 0+00:00:09 R 0 9.8 condor_dagman -f - 19.0 |-Simple 2/4 23:08 0+00:00:00 I 0 9.8 simple 4 10 2 jobs; 1 idle, 1 running, 0 held -- Submitter: osg-edu.cs.wisc.edu : <193.206.208.141:9603> : osg-edu.cs.wisc.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 18.0 aroy 2/4 23:08 0+00:00:19 R 0 9.8 condor_dagman -f - 19.0 |-Simple 2/4 23:08 0+00:00:00 I 0 9.8 simple 4 10 2 jobs; 1 idle, 1 running, 0 held -- Submitter: osg-edu.cs.wisc.edu : <193.206.208.141:9603> : osg-edu.cs.wisc.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 18.0 aroy 2/4 23:08 0+00:00:29 R 0 9.8 condor_dagman -f - 19.0 |-Simple 2/4 23:08 0+00:00:05 R 0 9.8 simple 4 10 2 jobs; 0 idle, 2 running, 0 held -- Submitter: osg-edu.cs.wisc.edu : <193.206.208.141:9603> : osg-edu.cs.wisc.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held Ctrl-C In the third window, watch what DAGMan does: % tail -f --lines=500 simple.dag.dagman.out 2/4 23:08:30 ****************************************************** 2/4 23:08:30 ** condor_scheduniv_exec.18.0 (CONDOR_DAGMAN) STARTING UP 2/4 23:08:30 ** /usr/local/condor/bin/condor_dagman 2/4 23:08:30 ** $CondorVersion: 6.8.6 Sep 13 2007 $ 2/4 23:08:30 ** $CondorPlatform: I386-LINUX_RH9 $ 2/4 23:08:30 ** PID = 577 2/4 23:08:30 ** Log last touched time unavailable (No such file or directory) 2/4 23:08:30 ****************************************************** 2/4 23:08:30 Using config source: /usr/local/condor/etc/condor_config 2/4 23:08:30 Using local config sources: 2/4 23:08:30 /var/local/condor/condor_config.local 2/4 23:08:30 DaemonCore: Command Socket at <193.206.208.141:9684> 2/4 23:08:30 DAGMAN_SUBMIT_DELAY setting: 0 2/4 23:08:30 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 2/4 23:08:30 DAGMAN_STARTUP_CYCLE_DETECT setting: 0 2/4 23:08:30 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5 2/4 23:08:30 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114 2/4 23:08:30 DAGMAN_RETRY_SUBMIT_FIRST setting: 1 2/4 23:08:30 DAGMAN_RETRY_NODE_FIRST setting: 0 2/4 23:08:30 DAGMAN_MAX_JOBS_IDLE setting: 0 2/4 23:08:30 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 2/4 23:08:30 DAGMAN_MUNGE_NODE_NAMES setting: 1 2/4 23:08:30 DAGMAN_DELETE_OLD_LOGS setting: 1 2/4 23:08:30 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0 2/4 23:08:30 DAGMAN_ABORT_DUPLICATES setting: 0 2/4 23:08:30 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 2/4 23:08:30 argv[0] == "condor_scheduniv_exec.18.0" 2/4 23:08:30 argv[1] == "-Debug" 2/4 23:08:30 argv[2] == "3" 2/4 23:08:30 argv[3] == "-Lockfile" 2/4 23:08:30 argv[4] == "simple.dag.lock" 2/4 23:08:30 argv[5] == "-Condorlog" 2/4 23:08:30 argv[6] == "/condor/aroy/condor-test/simple.log" 2/4 23:08:30 argv[7] == "-Dag" 2/4 23:08:30 argv[8] == "simple.dag" 2/4 23:08:30 argv[9] == "-Rescue" 2/4 23:08:30 argv[10] == "simple.dag.rescue" 2/4 23:08:30 DAG Lockfile will be written to simple.dag.lock 2/4 23:08:30 DAG Input file is simple.dag 2/4 23:08:30 Rescue DAG will be written to simple.dag.rescue 2/4 23:08:30 All DAG node user log files: 2/4 23:08:30 /condor/aroy/condor-test/simple.log (Condor) 2/4 23:08:30 Parsing simple.dag ... 2/4 23:08:30 Dag contains 1 total jobs 2/4 23:08:30 Truncating any older versions of log files... 2/4 23:08:30 Bootstrapping... 2/4 23:08:30 Number of pre-completed nodes: 0 2/4 23:08:30 Registering condor_event_timer... 2/4 23:08:31 Got node Simple from the ready queue 2/4 23:08:31 Submitting Condor Node Simple job(s)... 2/4 23:08:31 submitting: condor_submit -a dag_node_name' '=' 'Simple -a +DAGManJobId' '=' '18 -a DAGManJobId' '=' '18 -a submit_event_notes' '=' 'DAG' 'Node:' 'Simple -a +DAGParentNodeNames' '=' '"" submit 2/4 23:08:31 From submit: Submitting job(s). 2/4 23:08:31 From submit: Logging submit event(s). 2/4 23:08:31 From submit: 1 job(s) submitted to cluster 19. 2/4 23:08:31 assigned Condor ID (19.0) 2/4 23:08:31 Just submitted 1 job this cycle... 2/4 23:08:31 Event: ULOG_SUBMIT for Condor Node Simple (19.0) 2/4 23:08:31 Number of idle job procs: 1 2/4 23:08:31 Of 1 nodes total: 2/4 23:08:31 Done Pre Queued Post Ready Un-Ready Failed 2/4 23:08:31 === === === === === === === 2/4 23:08:31 0 0 1 0 0 0 0 2/4 23:08:56 Event: ULOG_EXECUTE for Condor Node Simple (19.0) 2/4 23:08:56 Number of idle job procs: 0 2/4 23:09:01 Event: ULOG_JOB_TERMINATED for Condor Node Simple (19.0) 2/4 23:09:01 Node Simple job proc (19.0) completed successfully. 2/4 23:09:01 Node Simple job completed 2/4 23:09:01 Number of idle job procs: 0 2/4 23:09:01 Of 1 nodes total: 2/4 23:09:01 Done Pre Queued Post Ready Un-Ready Failed 2/4 23:09:01 === === === === === === === 2/4 23:09:01 1 0 0 0 0 0 0 2/4 23:09:01 All jobs Completed! 2/4 23:09:01 Note: 0 total job deferrals because of -MaxJobs limit (0) 2/4 23:09:01 Note: 0 total job deferrals because of -MaxIdle limit (0) 2/4 23:09:01 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 2/4 23:09:01 Note: 0 total POST script deferrals because of -MaxPost limit (0) 2/4 23:09:01 **** condor_scheduniv_exec.18.0 (condor_DAGMAN) EXITING WITH STATUS 0 Now verify your results: % cat simple.log 000 (019.000.000) 02/04 23:08:31 Job submitted from host: <193.206.208.141:9603> DAG Node: Simple ... 001 (019.000.000) 02/04 23:08:55 Job executing on host: <193.206.208.205:9674> ... 005 (019.000.000) 02/04 23:08:59 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 56 - Run Bytes Sent By Job 5052 - Run Bytes Received By Job 56 - Total Bytes Sent By Job 5052 - Total Bytes Received By Job ... % cat simple.out Thinking really hard for 4 seconds... We calculated: 20 Looking at DAGMan's various files, we see that DAGMan itself ran as a Condor job (specifically, a scheduler universe job). % ls -1 simple.dag.* simple.dag.condor.sub simple.dag.dagman.log simple.dag.dagman.out simple.dag.lib.err simple.dag.lib.out % cat simple.dag.condor.sub # Filename: simple.dag.condor.sub # Generated by condor_submit_dag simple.dag universe= scheduler executable= /usr/local/condor/bin/condor_dagman getenv= True output= simple.dag.lib.out error= simple.dag.lib.err log= simple.dag.dagman.log remove_kill_sig= SIGUSR1 #ensure DAGMan is automatically requeued by the schedd if it #exits abnormally or is killed (e.g., during a reboot) on_exit_remove= ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) copy_to_spool= False arguments= -f -l . -Debug 3 -Lockfile simple.dag.lock -Condorlog /condor/aroy/condor-test/simple.log -Dag simple.dag -Rescue simple.dag.rescue environment= _CONDOR_DAGMAN_LOG=simple.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0 queue % cat simple.dag.dagman.log 000 (018.000.000) 02/04 23:08:30 Job submitted from host: <193.206.208.141:9603> ... 001 (018.000.000) 02/04 23:08:30 Job executing on host: <193.206.208.141:9603> ... 005 (018.000.000) 02/04 23:09:01 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... Clean up some of these files: % rm -f simple.dag.*
Question
Why does DAGMan run as a Condor job?
|
|||||||||||
|