Master Worker

Please Note

At this point, you should have finished the Condor exercises and the Search for Knowledge exercises. If everything went smoothly for you, if you have extra time, and if you feel comfortable with C++ on Linux, you can continue on to learn about MW. Because C++ is not a prerequisite for this class, we do not expect you to go through this, but we want to provide it as an option for those of you that are interested.

Getting Ready

Master Worker (MW for short) is an addition to Condor: it is not provided with Condor but is an extra download. It has no hidden knowledge of Condor, but is built on top of Condor using public interfaces. The first thing to do is to download MW into your home directory:

% cd ~

% wget http://www.cs.wisc.edu/condor/mw/mw-0.2.tar.gz

We apologize in advance that the documentation for MW is rather light. But you can read what there is online.

Compiling MW

In your home directory, first extract the MW source code and rename the directory to mw, then make a directory to install mw into:

% tar xzf mw-0.2.tar.gz

% mv mw mw-src

% mkdir ~/mw

Now configure and build MW. Make sure the CONDOR_CONFIG is properly set and you can find the Condor binaries in /opt/condor-6.6.10. We're going to build it without PVM, but we'll use the socket implementation. From a high-level perspective, it doesn't make a difference which you use. The socket implementation is slightly less capable, but is much easier to use and debug if there are problems. The entire configure/make process should just take a couple of minutes. Make sure you edit the prefix that you give to configure Note that because your home directory is on NFS, it may build slowly.

% which condor_version
/opt/condor-6.6.10/bin/condor_version

% echo $CONDOR_CONFIG
/opt/condor-6.6.10/etc/condor_config

% cd mw-src

% ./configure --with-condor=/opt/condor-6.6.10 \
              --prefix=~/mw                    \
              --without-pvm
checking for g++... g++
checking for C++ compiler default output... a.out
checking whether the C++ compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables... 
checking for suffix of object files... o
[output trimmed...]

% make
[ "__src examples" = "__" ] || for subdir in `echo "src examples"`; do (cd $subdir && make all) ; done
make[1]: Entering directory `/gine/roy/mw-src/src'
/usr/bin/g++ -DPACKAGE_NAME=\"\"
-DPACKAGE_TARNAME=\"\" -DPACKAGE_VERSION=\"\" -DPACKAGE_STRING=\"\"
-DPACKAGE_BUGREPORT=\"\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1
-DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1
-DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1
-DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_FCNTL_H=1 -DHAVE_LIMITS_H=1
-DHAVE_SYS_TIME_H=1 -DHAVE_UNISTD_H=1 -DTIME_WITH_SYS_TIME=1
-DHAVE_VPRINTF=1 -DHAVE_GETCWD=1 -DHAVE_GETHOSTNAME=1
-DHAVE_GETTIMEOFDAY=1 -DHAVE_MKDIR=1 -DHAVE_STRSTR=1
-DHAVE_DYNAMIC_CAST=
-DCONDOR_DIR=\"/opt/condor-6.6.10\"  -I. -I. -IRMComm
-IMW-File -IMW-CondorPVM -IMW-Socket -IMWControlTasks      -g -O2
-Wall -c MW.C
[output trimmed...]

% make install
[ "__src examples" = "__" ] || for subdir in `echo "src examples"`; do (cd $subdir && make install) ; done
make[1]: Entering directory `/home/roy//mw-src/src'
/bin/sh ../mkinstalldirs /home/roy/mw/lib
mkdir /home/roy/school/mw/lib /usr/bin/install -c -m 644 libMW.a  /home/roy/mw/lib/libMW.a
[output trimmed...]

Assuming you don't see any errors, you're set to go!

The examples

MW has provided several examples. They are all in the mw-src/examples directory.

% cd examples

% ls 
Makefile  Makefile.in  fib/  knapsack/  matmul/  n-queens/  newmatmul/  newskel/  skel/

Trying an example in independent mode

MW can run applications within Condor, but it can also run them without Condor, just on your computer. This will only create a single worker, which will execute the tasks serially. This can be easier to try out and easier to debug. Let run matmaul in independent mode. The matrices to be multiplied are in the file named in_master.

% cd matmaul

% cat in_master
1
1
workermatmul_condorpvm.LINUX 0
10
10
10

0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9

% ./mastermatmul_indp < in_master
10:42:52 MWDriver is pid 32316.
10:42:52 Starting from the beginning.
10:42:52 argc=1, argv[0]=./mastermatmul_indp
10:42:52  workermatmul_condorpvm.LINUX
10:42:52 tempnum_executables = 0
10:42:52 Good to go.
10:42:52 num_TODO = 10, num_run = 0, num_done = 0
10:42:52  CONTINUE -- todo list has at least task number: 10

[output trimmed...]

10:42:55 The resulting Matrix is as follows
10:42:55 0 45 90 135 180 225 270 315 360 405 
10:42:55 0 45 90 135 180 225 270 315 360 405 
10:42:55 0 45 90 135 180 225 270 315 360 405 
10:42:55 0 45 90 135 180 225 270 315 360 405 
10:42:55 0 45 90 135 180 225 270 315 360 405 
10:42:55 0 45 90 135 180 225 270 315 360 405 
10:42:55 0 45 90 135 180 225 270 315 360 405 
10:42:55 0 45 90 135 180 225 270 315 360 405 
10:42:55 0 45 90 135 180 225 270 315 360 405 
10:42:55 0 45 90 135 180 225 270 315 360 405 

[output trimmed...]

10:42:55 Killing workers:
10:42:55 MWList::Can't remove any element from empty list.
10:42:55 MWList::Can't remove any element from empty list.
10:42:55 MWList::Can't remove any element from empty list.
10:42:55 MWList::Can't remove any element from empty list.
Ignore those error messages at the end (Can't remove any element...).

Congratulations! You've successfully run your first MW job, albeit a simple one.

Trying an example as a Condor job

The submit file for the matmul example is submit_socket. Theoretically you could use submit_pvm but we don't have PVM installed. You could also use submit_file which uses Condor's standard universe, but there is not particular advantage for our short-running job.

Look at submit_socket:

# Now we're in the scheduler universe

universe = Scheduler

# The name of our executable

Executable     = mastermatmul_socket

# Assume a max image size of 16 Megabytes.

Image_Size     = 4 Meg
+MemoryRequirements = 4

# This goes into stdin for the master.

Input   = in_master.socket

# Set the output of this job to go to out_master

Output  = out_master.socket

# Set the stderr of this job to go to out_worker.  It is named
# out_worker because the output of the workers is directed to stderr

Error   = out_worker.socket

# Keep a log in case of problems.

Log = work.log

notify_user = chang@cs.wisc.edu

Queue
Notice two things about this submit file:
  1. Change the notify_user line to be correct for you.
  2. This is a scheduler universe job. We haven't talked about those very much. It's a job that runs on the submit computer as soon as you submit it. You get all the benefits of Condor (reliability, logging, etc) with a job that executes locally. We use it for DAGMan and MW: it is a job that submits other jobs and watches over them. In this case, it will be master, which will spawn the other workers (as jobs) and will send them their tasks.

Now submit the job and watch it run:

% condor_submit submit_socket
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1.

% condor_q
-- Submitter: chopin.cs.wisc.edu : <128.105.121.21:50689> : chopin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   roy             7/5  10:51   0+00:00:00 R  0   4.0  mastermatmul_socke

1 jobs; 0 idle, 1 running, 0 held

% condor_q
-- Submitter: chopin.cs.wisc.edu : <128.105.121.21:50689> : chopin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   roy             7/5  10:51   0+00:00:01 R  0   4.0  mastermatmul_socke
   2.0   roy             7/5  10:51   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
   2.1   roy             7/5  10:51   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
   2.2   roy             7/5  10:51   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
   2.3   roy             7/5  10:51   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
   2.4   roy             7/5  10:51   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
   2.5   roy             7/5  10:51   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)

% condor_q
-- Submitter: chopin.cs.wisc.edu : <128.105.121.21:50689> : chopin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   roy             7/5  10:51   0+00:00:26 R  0   4.0  mastermatmul_socke
   2.0   roy             7/5  10:51   0+00:00:03 R  0   0.0  mw_exec0.$$(Opsys)
   2.1   roy             7/5  10:51   0+00:00:01 R  0   0.0  mw_exec0.$$(Opsys)
   2.2   roy             7/5  10:51   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
   2.3   roy             7/5  10:51   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
   2.4   roy             7/5  10:51   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
   2.5   roy             7/5  10:51   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)

% condor_q
-- Submitter: chopin.cs.wisc.edu : <128.105.121.21:50689> : chopin.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

We saw the master submit six workers. Two of them started to run, and they did all of the work. Look at out_master.socket to see the result of the run:
% cat out_master.socket 
10:51:19 MWDriver is pid 507.
10:51:19 Socket bound to port: 8997
10:51:19 Starting from the beginning.
10:51:19 argc=1, argv[0]=condor_scheduniv_exec.1.0
10:51:19  workermatmul_socket
10:51:19 tempnum_executables = 0
10:51:19 Making a link from workermatmul_socket to mw_exec0.LINUX.INTEL
10:51:19 In MWSocketRC::init_beginning_workers()
10:51:19 Good to go.

[output trimmed...]
If you look at the output carefully, you'll notice that only one worker did all of the tasks. That is because the time to do the tasks in this simple case was really short.

It's your turn

Now that you've tried out the basics, we'll let you explore by yourself. Here are some ideas: