When an problem occurs in the middleware, Condor-G will place your job on "Hold". Held jobs remain in the queue, but wait for user intervention. When you resolve the problem, you can use condor_release to free job to continue.
You can places jobs on hold yourself, perhaps if you want to delay your run using condor_hold
For this example, we'll make the output file non-writable. The job will be unable to copy the results back and will be placed on hold.
Submit the job again, but this time immediately after submitting it, mark the output file as read-only:
$ condor_submit myjob.submit ; chmod a-w results.output Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 2.
Watch the job with tail. When the job goes on hold, use Ctrl-C to exit tail.
$ tail -f --lines=500 results.log 000 (003.000.000) 07/12 22:35:44 Job submitted from host: <192.167.1.23:32864> ... 017 (003.000.000) 07/12 22:35:57 Job submitted to Globus RM-Contact: my-gatekeeper.cs.wisc.edu:/jobmanager-fork JM-Contact: https://my-gatekeeper.cs.wisc.edu:33178/12497/1058042148/ Can-Restart-JM: 1 ... 001 (003.000.000) 07/12 22:35:57 Job executing on host: my-gatekeeper.gs.unina.it ... 012 (003.000.000) 07/12 22:36:52 Job was held. Globus error 129: the standard output/error size is different ... Ctrl-CNote that condor_q reports that the job is in the "H" or Held state.
$ condor_q -- Submitter: pc-23.gs.unina.it : <192.167.1.23:32864> : pc-23.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 adesmet 7/12 22:35 0+00:00:55 H 0 0.0 myscript.sh TestJo 1 jobs; 0 idle, 0 running, 1 held
Fix the problem (make the file writable again), then release the job. You can specifiy the job's ID, or just use "-all" to release all held jobs.
$ chmod u+w results.output $ condor_release -all All jobs released.
Again, watch the log until the job finishes:
$ tail -f --lines=500 results.log 000 (003.000.000) 07/12 22:35:44 Job submitted from host: <192.167.1.23:32864> ... 017 (003.000.000) 07/12 22:35:57 Job submitted to Globus RM-Contact: my-gatekeeper.cs.wisc.edu:/jobmanager-fork JM-Contact: https://my-gatekeeper.cs.wisc.edu:33178/12497/1058042148/ Can-Restart-JM: 1 ... 001 (003.000.000) 07/12 22:35:57 Job executing on host: my-gatekeeper.cs.wisc.edu ... 012 (003.000.000) 07/12 22:36:52 Job was held. Globus error 129: the standard output/error size is different ... 013 (003.000.000) 07/12 22:44:33 Job was released. via condor_release (by user user??) ... 001 (003.000.000) 07/12 22:44:46 Job executing on host: my-gatekeeper.cs.wisc.edu ... 005 (003.000.000) 07/12 22:44:51 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... Ctrl-C
Your job finished, the results have been retreived successfully:
$ cat results.output I'm process id 12528 on my-gatekeeper.cs.wisc.edu Sat Jul 12 22:35:53 CEST 2003 Running as binary /home/user??/.globus/.gass_cache/local/md5/6d/217f3f7926c06a529143f6129bf269/md5/a7/2af94ba728c69c588e523a99baaefd/data TestJob 10 My name (argument 1) is TestJob My sleep duration (argument 2) is 10 Sleep of 10 seconds finished. Exiting RESULT: 0 SUCCESS
Before continuing, clean up the results:
$ rm results.*<-- Previous Next-->