At this point, you should have finished the Condor exercises and the Search for Knowledge exercises. If everything went smoothly for you, if you have extra time, and if you feel comfortable with C++ on Linux, you can continue on to learn about MW. Because C++ is not a prerequisite for this class, we do not expect you to go through this, but we want to provide it as an option for those of you that are interested.
Master Worker (MW for short) is an addition to Condor: it is not provided with Condor but is an extra download. It has no hidden knowledge of Condor, but is built on top of Condor using public interfaces. The first thing to do is to download MW into your home directory:
% cd ~ % wget http://www.cs.wisc.edu/condor/mw/mw-0.2.tar.gz
We apologize in advance that the documentation for MW is rather light. But you can read what there is online.
In your home directory, first extract the MW source code and rename the directory to mw, then make a directory to install mw into:
% tar xzf mw-0.2.tar.gz % mv mw mw-src % mkdir ~/mw
Now configure and build MW. Make sure the CONDOR_CONFIG is properly
set and you can find the Condor binaries in
/opt/condor-6.6.10
. We're going to build it without PVM,
but we'll use the socket implementation. From a high-level
perspective, it doesn't make a difference which you use. The socket
implementation is slightly less capable, but is much easier to use and
debug if there are problems. The entire configure/make process should
just take a couple of minutes. Make sure you edit
the prefix that you give to configure Note that because your
home directory is on NFS, it may build slowly.
% which condor_version /opt/condor-6.6.10/bin/condor_version % echo $CONDOR_CONFIG /opt/condor-6.6.10/etc/condor_config % cd mw-src % ./configure --with-condor=/opt/condor-6.6.10 \ --prefix=~/mw \ --without-pvm checking for g++... g++ checking for C++ compiler default output... a.out checking whether the C++ compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o [output trimmed...] % make [ "__src examples" = "__" ] || for subdir in `echo "src examples"`; do (cd $subdir && make all) ; done make[1]: Entering directory `/gine/roy/mw-src/src' /usr/bin/g++ -DPACKAGE_NAME=\"\" -DPACKAGE_TARNAME=\"\" -DPACKAGE_VERSION=\"\" -DPACKAGE_STRING=\"\" -DPACKAGE_BUGREPORT=\"\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_FCNTL_H=1 -DHAVE_LIMITS_H=1 -DHAVE_SYS_TIME_H=1 -DHAVE_UNISTD_H=1 -DTIME_WITH_SYS_TIME=1 -DHAVE_VPRINTF=1 -DHAVE_GETCWD=1 -DHAVE_GETHOSTNAME=1 -DHAVE_GETTIMEOFDAY=1 -DHAVE_MKDIR=1 -DHAVE_STRSTR=1 -DHAVE_DYNAMIC_CAST= -DCONDOR_DIR=\"/opt/condor-6.6.10\" -I. -I. -IRMComm -IMW-File -IMW-CondorPVM -IMW-Socket -IMWControlTasks -g -O2 -Wall -c MW.C [output trimmed...] % make install [ "__src examples" = "__" ] || for subdir in `echo "src examples"`; do (cd $subdir && make install) ; done make[1]: Entering directory `/home/roy//mw-src/src' /bin/sh ../mkinstalldirs /home/roy/mw/lib mkdir /home/roy/school/mw/lib /usr/bin/install -c -m 644 libMW.a /home/roy/mw/lib/libMW.a [output trimmed...]
Assuming you don't see any errors, you're set to go!
MW has provided several examples. They are all in the mw-src/examples directory.
% cd examples % ls Makefile Makefile.in fib/ knapsack/ matmul/ n-queens/ newmatmul/ newskel/ skel/
MW can run applications within Condor, but it can also run them
without Condor, just on your computer. This will only create a single
worker, which will execute the tasks serially. This can be easier to try out
and easier to debug. Let run matmaul in independent mode. The
matrices to be multiplied are in the file named in_master
.
% cd matmaul % cat in_master 1 1 workermatmul_condorpvm.LINUX 0 10 10 10 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 % ./mastermatmul_indp < in_master 10:42:52 MWDriver is pid 32316. 10:42:52 Starting from the beginning. 10:42:52 argc=1, argv[0]=./mastermatmul_indp 10:42:52 workermatmul_condorpvm.LINUX 10:42:52 tempnum_executables = 0 10:42:52 Good to go. 10:42:52 num_TODO = 10, num_run = 0, num_done = 0 10:42:52 CONTINUE -- todo list has at least task number: 10 [output trimmed...] 10:42:55 The resulting Matrix is as follows 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 [output trimmed...] 10:42:55 Killing workers: 10:42:55 MWList::Can't remove any element from empty list. 10:42:55 MWList::Can't remove any element from empty list. 10:42:55 MWList::Can't remove any element from empty list. 10:42:55 MWList::Can't remove any element from empty list.Ignore those error messages at the end (Can't remove any element...).
Congratulations! You've successfully run your first MW job, albeit a simple one.
The submit file for the matmul example is
submit_socket
. Theoretically you could use
submit_pvm
but we don't have PVM installed. You could
also use submit_file
which uses Condor's standard
universe, but there is not particular advantage for our short-running
job.
Look at submit_socket:
# Now we're in the scheduler universe universe = Scheduler # The name of our executable Executable = mastermatmul_socket # Assume a max image size of 16 Megabytes. Image_Size = 4 Meg +MemoryRequirements = 4 # This goes into stdin for the master. Input = in_master.socket # Set the output of this job to go to out_master Output = out_master.socket # Set the stderr of this job to go to out_worker. It is named # out_worker because the output of the workers is directed to stderr Error = out_worker.socket # Keep a log in case of problems. Log = work.log notify_user = chang@cs.wisc.edu QueueNotice two things about this submit file:
Now submit the job and watch it run:
% condor_submit submit_socket Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 1. % condor_q -- Submitter: chopin.cs.wisc.edu : <128.105.121.21:50689> : chopin.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 roy 7/5 10:51 0+00:00:00 R 0 4.0 mastermatmul_socke 1 jobs; 0 idle, 1 running, 0 held % condor_q -- Submitter: chopin.cs.wisc.edu : <128.105.121.21:50689> : chopin.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 roy 7/5 10:51 0+00:00:01 R 0 4.0 mastermatmul_socke 2.0 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.1 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.2 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.3 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.4 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.5 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) % condor_q -- Submitter: chopin.cs.wisc.edu : <128.105.121.21:50689> : chopin.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 roy 7/5 10:51 0+00:00:26 R 0 4.0 mastermatmul_socke 2.0 roy 7/5 10:51 0+00:00:03 R 0 0.0 mw_exec0.$$(Opsys) 2.1 roy 7/5 10:51 0+00:00:01 R 0 0.0 mw_exec0.$$(Opsys) 2.2 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.3 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.4 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.5 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) % condor_q -- Submitter: chopin.cs.wisc.edu : <128.105.121.21:50689> : chopin.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 heldWe saw the master submit six workers. Two of them started to run, and they did all of the work. Look at out_master.socket to see the result of the run:
% cat out_master.socket 10:51:19 MWDriver is pid 507. 10:51:19 Socket bound to port: 8997 10:51:19 Starting from the beginning. 10:51:19 argc=1, argv[0]=condor_scheduniv_exec.1.0 10:51:19 workermatmul_socket 10:51:19 tempnum_executables = 0 10:51:19 Making a link from workermatmul_socket to mw_exec0.LINUX.INTEL 10:51:19 In MWSocketRC::init_beginning_workers() 10:51:19 Good to go. [output trimmed...]If you look at the output carefully, you'll notice that only one worker did all of the tasks. That is because the time to do the tasks in this simple case was really short.
Now that you've tried out the basics, we'll let you explore by yourself. Here are some ideas:
RMC->set_target_num_workers( target_num_workers );