Condor-G and DAGMan Hands-On Lab

Part III: Held Jobs (Being prepared for failure)

When an problem occurs in the middleware, Condor-G will place your job on "Hold". Held jobs remain in the queue, but wait for user intervention. When you resolve the problem, you can use condor_release to free job to continue.

You can places jobs on hold yourself, perhaps if you want to delay your run using condor_hold

For this example, we'll make the output file non-writable. The job will be unable to copy the results back and will be placed on hold.

Submit the job again, but this time immediately after submitting it, mark the output file as read-only:

$ condor_submit myjob.submit ; chmod a-w results.output
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 2.

Watch the job with tail. When the job goes on hold, use Ctrl-C to exit tail.

$ tail -f --lines=500 results.log
000 (003.000.000) 07/12 22:35:44 Job submitted from host: <192.167.1.23:32864>
...
017 (003.000.000) 07/12 22:35:57 Job submitted to Globus
    RM-Contact: my-gatekeeper.cs.wisc.edu:/jobmanager-fork
    JM-Contact: https://my-gatekeeper.cs.wisc.edu:33178/12497/1058042148/
    Can-Restart-JM: 1
...
001 (003.000.000) 07/12 22:35:57 Job executing on host: my-gatekeeper.gs.unina.it
...
012 (003.000.000) 07/12 22:36:52 Job was held.
        Globus error 129: the standard output/error size is different
...
Ctrl-C

Note that condor_q reports that the job is in the "H" or Held state.

$ condor_q

 
-- Submitter: pc-23.gs.unina.it : <192.167.1.23:32864> : pc-23.gs.unina.it
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
   2.0   adesmet         7/12 22:35   0+00:00:55 H  0   0.0  myscript.sh TestJo
 
1 jobs; 0 idle, 0 running, 1 held

Fix the problem (make the file writable again), then release the job. You can specifiy the job's ID, or just use "-all" to release all held jobs.

$ chmod u+w results.output
$ condor_release -all
All jobs released.

Again, watch the log until the job finishes:

$ tail -f --lines=500 results.log
000 (003.000.000) 07/12 22:35:44 Job submitted from host: <192.167.1.23:32864>
...
017 (003.000.000) 07/12 22:35:57 Job submitted to Globus
    RM-Contact: my-gatekeeper.cs.wisc.edu:/jobmanager-fork
    JM-Contact: https://my-gatekeeper.cs.wisc.edu:33178/12497/1058042148/
    Can-Restart-JM: 1
...
001 (003.000.000) 07/12 22:35:57 Job executing on host: my-gatekeeper.cs.wisc.edu
...
012 (003.000.000) 07/12 22:36:52 Job was held.
        Globus error 129: the standard output/error size is different
...
013 (003.000.000) 07/12 22:44:33 Job was released.
        via condor_release (by user user??)
...
001 (003.000.000) 07/12 22:44:46 Job executing on host: my-gatekeeper.cs.wisc.edu
...
005 (003.000.000) 07/12 22:44:51 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...
Ctrl-C

Your job finished, the results have been retreived successfully:

$ cat results.output
I'm process id 12528 on my-gatekeeper.cs.wisc.edu
Sat Jul 12 22:35:53 CEST 2003
Running as binary /home/user??/.globus/.gass_cache/local/md5/6d/217f3f7926c06a529143f6129bf269/md5/a7/2af94ba728c69c588e523a99baaefd/data TestJob 10
My name (argument 1) is TestJob
My sleep duration (argument 2) is 10
Sleep of 10 seconds finished.  Exiting
RESULT: 0 SUCCESS

Before continuing, clean up the results:

$ rm results.*

<-- Previous Next-->