Condor VM Tutorial for OSG Site Admin Workshop

Preliminaries

You might want to refer to the online Condor manual.
You may enjoy browsing the Condor web page

Getting started

You will need a laptop with a web browser and an ssh client. The web browser is to read these directions, and the ssh client is to log into a computer that has Condor set up and ready to go. The computer's name is tg-condor.purdue.teragrid.org, and we have set up guest accounts for you. We'll hand out the usernames and passwords.

What is the VM universe?

Using VM universe Condor allows jobs to be Virtual Machines instead of simply executables. Virtual Machines allow for a greater flexibility with regards to the types of jobs users can submit. It allows a user to run applications written for one platform to be run on top of an arbitrary platform, without the need to port the original application to the new platform. VM universe supports several virtual machine applications, today we will be looking at VMware Server, but similar jobs can be run using Xen, etc.

Submitting a VM job

For your convenience, we have created a VM for this exercise. It is a small Linux VM. You can find it under $TG_COMMUNITY/osg-vm/condorvm/ on the tutorial machine. You can also download the configuration file and disk image from this webpage.

You also need a Condor submit file. We've provived one under $TG_COMMUNITY/osg-vm/condorvm.desc. Let's take a look at it...

universe                     = vm
executable                   = any_name_you_like
log                          = condorvm.log
vm_type                      = vmware
vm_memory                    = 64
vmware_dir                   = $ENV(TG_COMMUNITY)/osg-vm/condorvm
vmware_should_transfer_files = yes
queue

Note the lack of real executable in this universe (as we mentioned above: the VM image itself is the executable in this universe). So why do we have an executable name? The executable name is provided to identify the job when you run condor_q. Accordingly, you can change it to change it to something more representative, like: linux_vm_test or something similar.

Now submit your job:

% mkdir ~/condor-test
% cd ~/condor-test
% condor_submit $TG_COMMUNITY/osg-vm/condorvm.desc
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 26.

% condor_q

-- Submitter: leovinus : <128.105.48.96:50589> : leovinus
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD
   6.0   aroy           11/20 15:31   0+00:02:46 R  0   0.0  any_name_you_like

1 jobs; 0 idle, 1 running, 0 held


-- Submitter: leovinus : <128.105.48.96:50589> : leovinus
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD
   6.0   aroy           11/20 15:31   0+00:02:56 R  0   0.0  any_name_you_like

1 jobs; 0 idle, 1 running, 0 held


-- Submitter: leovinus : <128.105.48.96:50589> : leovinus
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD
   6.0   aroy           11/20 15:31   0+00:03:06 R  0   0.0  any_name_you_like

1 jobs; 0 idle, 1 running, 0 held


-- Submitter: leovinus : <128.105.48.96:50589> : leovinus
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD
   6.0   aroy           11/20 15:31   0+00:03:16 R  0   0.0  any_name_you_like

1 jobs; 0 idle, 1 running, 0 held

...

-- Submitter: leovinus : <128.105.48.96:50589> : leovinus
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held

The first time the image starts up, it will run its fake "job" which will run for 10-15 minutes. Just enough time to ask your instructors some difficult questions. The second time the job is run, it will do nothing. This is so we ca n open up the image using VMware Server and view /root/job.out for the results.

This is how the Linux image works: the job is run from /etc/rc.d/rc.start/60.job. It invokes /root/job in the background to do the actual work. The job itself will run if and only if /root/job.out doesn't exist. This is so you can extract the output during the next boot. By removing /root/job.out you can force the job to run again.

Examaning the results

When the job completes (and disappears from condor_q), Condor will transfer the modified VM files back to your submit machine.

% ls
condorvm-000001.vmdk  vmware-0.log                    vmxHpmJe_condor.vmsd
condorvm.log          vmware.log                      vmxHpmJe_condor.vmx
nvram                 vmxHpmJe_condor-Snapshot1.vmsn
%

condorvm.log is a file written by Condor that contains the execution history of your job. When the job completes, it'll look something like this:

% cat condorvm.log

000 (013.000.000) 06/20 01:41:16 Job submitted from host: <193.10.156.74:40295>
...
001 (013.000.000) 06/20 01:41:20 Job executing on host: <193.10.156.74:40304>
...
006 (013.000.000) 06/20 01:42:12 Image size of job updated: 66632
...
005 (013.000.000) 06/20 01:41:35 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:11  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:11  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        1237936  -  Run Bytes Sent By Job
        10945025  -  Run Bytes Received By Job
        1237936  -  Total Bytes Sent By Job
        10945025  -  Total Bytes Received By Job

For VMware VMs, Condor creates a snapshot of the original VM image and returns the snapshot disk image (condorvm-000001.vmdk). This snapshot image contains the changes made to the original disk image and is much smaller than the original image. The file contains a reference to the original image file.

Congratulations, you've submitted a VM job to Condor!

VMs on your Condor pool

Condor keeps track of which computers have a functional VMware Server and which version it is. You can find this out by using condor_status:

% condor_status -vm

Name               VMType Ver        State     Activity LoadAv VMMe ActvtyTime    VMNetworking

slot1@tg-data-01.r vmware server1.0  Claimed   Busy     0.000   960  0+00:03:36 nat
slot2@tg-data-01.r vmware server1.0  Unclaimed Idle     0.000   960  1+00:11:40 nat
...

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX    32     2       1        29       0          0        0

               Total    32     2       1        29       0          0        0
% condor_status -l slot1@tg-data-01.rcac.purdue.edu | grep VM
...
HasVM = TRUE
VM_AvailNum = 10000
VM_GAHP_VERSION = "0.0.1"
VM_Type = "vmware"
VM_Version = "server1.0"
VM_Memory = 960
VM_Networking = TRUE
VM_Networking_Types = "nat"