Condor tutorial for Condor Week 2005

Preliminaries

You might want to refer to the online Condor manual.
You may enjoy browsing the Condor web page
You may like to look at pictures of the tutorial writer's baby.

Getting started

This tutorial is taking place in an instructional lab. Everyone has a computer, and there are about 30 computers available. If more people want to participate, you'll need to "buddy-up" at a computer.

User names and passwords will be supplied by your tutorial leader. These users will be deleted shortly after the tutorial.

These computers have names beginning with royal, like royal01.cs.wisc.edu. Each computer is capable of submitting jobs to a Condor pool that is a subset of the University of Wisconsin-Madison Computer Science Department's Condor pool..

When you log into royal, you will see a message that looks something like this:

Last login: Tue Mar 15 13:37:37 2005 from 144.92.79.136
===============================================================================
        REMINDER: NO FOOD or DRINK IN THE CS INSTRUCTIONAL COMPUTER LABS
                  NEVER POWER DOWN WORKSTATIONS IN THE COMPUTER LABS
===============================================================================
royal01(1)% 
These computers are running Tao Linux 1, which is a clone of RedHat Enterprise Linux 3. For the rest of this tutorial, I will assume that you know the basics of Linux. If you don't, sit next to someone that does.

Checking out Condor

Let's make sure that you can find Condor. Your PATH should already be set up. Make sure that it is:

% which condor_submit
/unsup/condor/bin/condor_submit
And check out which version of Condor seems to be running:
% condor_version
$CondorVersion: 6.6.9 Mar 10 2005 $
$CondorPlatform: I386-LINUX_RH9 $

You might be surprised that it reports RedHat 9 instead of Tao Linux 1. It is reporting the operating system that it was compiled on, not the operating system that is in use.

Condor 6.6.9 is the most recent stable release of Condor. (Actually, Condor 6.6.9 hasn't been released yet. We are testing it thoroughly by using it ourselves. We expect it will be released soon.) Condor has two coexisting versions at any given time. Condor 6.6.x is considered a stable release. You know it is stable because the second digit (a 6) is an even number. This is the eighth release in the stable series. It has the same features as Condor 6.6.0, but has had a number of bug fixes. Some people may think that the fact that there have been so many stable releases is a sign that Condor has a lot of bugs. However, you should think of it differently: we actively support our stable series and work hard to make sure that you have the best possible version.

Condor 6.6.9 is the latest development release of Condor. You know it's a stable release because the second digit (a 6) is an even number. Development releases add new features and are more likely to have serious bugs.

We expect Condor 6.6.9 and 6.7.6 to be released soon. Your tutorial leader can tell you more about these releases.

Just because we can run condor_version doesn't mean that Condor is actually running. You can find out if it is running by looking at the running processes:

% ps auwx | grep condor
condor    3857  0.0  0.1  5228 1484 ?        S    06:38   0:05 /unsup/condor/sbin/condor_master
condor    3885  0.0  0.2  6304 2628 ?        S    06:38   0:08 condor_startd -f
condor    3886  0.0  0.2  6108 2272 ?        S    06:38   0:00 condor_schedd -f
temp-01   8701  0.0  0.0  1556  404 pts/2    S    16:07   0:00 grep condor

The output you see may be slightly different. Let's look at what we see here:

condor_master: This program runs constantly and ensures that all other parts of Condor are running. If they hang or crash, it restarts them.

condor_collector: This program is part of the Condor central manager. It collects information about all computers in the pool as well as which users want to run jobs. It is what normally responds to the condor_status command. It's not running on your computer, but on condor.cs.wisc.edu.

condor_negotiator: This program is part of the Condor central manager. It decides what jobs should be run where. It's not running on your computer, but on condor.cs.wisc.edu.

condor_startd: If this program is running, it allows jobs to be started up on this computer--that is, hal is an "execute machine". This advertises hal to the central manager (more on that later) so that it knows about this computer. It will start up the jobs that run.

condor_schedd If this program is running, it allows jobs to be submitted from this computer--that is, hal is a "submit machine". This will advertise jobs to the central manager so that it knows about them. It will contact a condor_startd on other execute machines for each job that needs to be started.

condor_shadow (Not shown above) For each job that has been submitted from this computer, there is one condor_shadow running. It will watch over the job as it runs remotely. In some cases it will provide some assistance (see the standard universe later.) You may or may not see any condor_shadow processes running, depending on what is happening on the computer when you try it out.

We have a graphic representation of these daemons, drawn by Sarah Miller, age 12.

Condor_q

You can find out what jobs have been submitted on hal with the condor_q command:

-- Submitter: royal01.cs.wisc.edu : <128.105.112.101:32775> : royal01.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

Nothing is running right now. If something was running, you would see output like this:

% condor_q

-- Submitter: royal01.cs.wisc.edu : <128.105.112.101:32775> : royal01.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
4589.0   doronn          3/30 18:07  19+09:26:01 I  0   0.0  go1               
5140.0   araddan         7/18 15:59   8+08:16:47 I  0   0.0  .condor_run.23359 
5145.0   araddan         7/18 17:22   0+21:29:41 I  0   0.0  matlab-script.txt 
6041.0   grishas        12/7  18:41   7+08:03:25 R  0   45.7 a.out             
6042.0   grishas        12/7  18:42   8+07:47:14 R  0   45.7 a.out             
6044.0   grishas        12/9  11:15   6+17:14:46 R  0   45.7 a.out         

The output that you see will be different depending on what jobs are running. Notice what we can see from this:

Extra credit

What else can you find out with condor_q? Try any one of:

Double bonus points

How do you use the -constraint or -format options to condor_q? When would you want them? When would you use the -l option?

Condor_status

You can find out what computers are in your Condor pool. (A pool is similar to a cluster, but it doesn't have the connotation that all computers are dedicated full-time to computation: some may be desktop computers owned by users.) To look, use condor_status:

% condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

vm1@f01.cs.wi LINUX       INTEL  Unclaimed  Idle       0.030  2001  0+00:10:04
vm2@f01.cs.wi LINUX       INTEL  Unclaimed  Idle       0.000  2001  0+00:10:05
vm1@f02.cs.wi LINUX       INTEL  Unclaimed  Idle       0.000  2001  0+00:05:04
vm2@f02.cs.wi LINUX       INTEL  Unclaimed  Idle       0.000  2001  0+00:05:05
vm1@f03.cs.wi LINUX       INTEL  Unclaimed  Idle       0.870  2001  0+00:00:04
vm2@f03.cs.wi LINUX       INTEL  Unclaimed  Idle       0.000  2001  0+00:00:05
vm1@f04.cs.wi LINUX       INTEL  Unclaimed  Idle       1.000  2001  0+00:00:04

[snip]

                     Machines Owner Claimed Unclaimed Matched Preempting

         INTEL/LINUX       98    11       0        87       0          0

               Total       98    11       0        87       0          0

I've trimmed the output: you will see many more computers listed. The f computers are a dedicated cluster, and the royal computers are an opportunistic cluster that you are using right now. We can only use the royal computers if they are idle. That is--if there are less than 30 people doing the tutorial.

Extra credit

What else can you find out with condor_status? Try any one of:

Note in particular the options like -master and -schedd. When would these be useful? When would the -l option be useful?

Next: Submitting your first Condor job