Condor tutorial for Fermilab

Preliminaries

You might want to refer to the online Condor manual.
You may enjoy browsing the Condor web page
You may like to look at pictures of the tutorial writer's baby.

Getting started

You will need a laptop with a web browser and an ssh client. The web browser is to read these directions, and the ssh client is to log into a computer that has Condor set up and ready to go. The computer's name will be dynamically assigned when we connect to the Fermi network, so your tutorial leader will announced it. But for convenience, we will refer to the computer as hal. You have been assigned a username and password already. If you do not know your username or password, talk to someone in charge.

When you log into hal, you will see a message that looks something like this:

Last login: Tue Jan 25 10:14:31 2005
Welcome to the Condor tutorial!
% 
Hal is a Fedora Core 3 computer. For the rest of this tutorial, I will assume that you know the basics of Linux. If you don't, sit next to someone that does.

Checking out Condor

Let's make sure that you can find Condor. Your PATH should already be set up. Make sure that it is:

% which condor_submit
/opt/condor/bin/condor_submit
And check out which version of Condor seems to be running:
% condor_version
$CondorVersion: 6.7.3 Dec 28 2004 $
$CondorPlatform: I386-LINUX_RH9 $

You might be surprised that it reports RedHat 9 instead of Fedora Core 3. It is reporting the operating system that it was compiled on, not the operating system that is in use.

Condor 6.7.3 is the most recent development release of Condor. Condor has two coexisting versions at any given time. Condor 6.6.x is considered a stable release. You know it is stable because the second digit (a 6) is an even number. This is the eighth release in the stable series. It has the same features as Condor 6.6.0, but has had a number of bug fixes. Some people may think that the fact that there have been so many stable releases is a sign that Condor has a lot of bugs. However, you should think of it differently: we actively support our stable series and work hard to make sure that you have the best possible version.

Condor 6.7.3 is the latest development release of Condor. You know it's a development release because the second digit (a 7) is an odd number. Development releases add new features and are more likely to have serious bugs. We are using the development release because we want to live on the edge. Besides, Condor-C only exists in the development release.

We expect Condor 6.6.8 and 6.7.4 to be released soon. Your tutorial leader can tell you more about these releases.

Just because we can run condor_version doesn't mean that Condor is actually running. You can find out if it is running by looking at the running processes:

% ps auwx | grep condor
condor    5557  0.0  0.5  5220 2088 ?        Ss   12:09   0:00 condor_master
condor    5558  0.0  0.6  5800 2644 ?        Ss   12:09   0:00 condor_collector -f
condor    5559  0.0  0.6  5564 2544 ?        Ss   12:09   0:00 condor_negotiator -f
condor    5560  1.7  0.7  6524 3052 ?        Ss   12:09   0:05 condor_startd -f
condor    5561  0.0  0.6  6156 2608 ?        Ss   12:09   0:00 condor_schedd -f
roy       5673  0.0  0.1  3788  680 pts/3    R+   12:15   0:00 grep condor

The output you see may be slightly different. Let's look at what we see here:

condor_master: This program runs constantly and ensures that all other parts of Condor are running. If they hang or crash, it restarts them.

condor_collector: This program is part of the Condor central manager. It collects information about all computers in the pool as well as which users want to run jobs. It is what normally responds to the condor_status command.

condor_negotiator: This program is part of the Condor central manager. It decides what jobs should be run where.

condor_startd: If this program is running, it allows jobs to be started up on this computer--that is, hal is an "execute machine". This advertises hal to the central manager (more on that later) so that it knows about this computer. It will start up the jobs that run.

condor_schedd If this program is running, it allows jobs to be submitted from this computer--that is, hal is a "submit machine". This will advertise jobs to the central manager so that it knows about them. It will contact a condor_startd on other execute machines for each job that needs to be started.

condor_shadow (Not shown above) For each job that has been submitted from this computer, there is one condor_shadow running. It will watch over the job as it runs remotely. In some cases it will provide some assistance (see the standard universe later.) You may or may not see any condor_shadow processes running, depending on what is happening on the computer when you try it out.

We have a graphic representation of these daemons, drawn by Sarah Miller, age 12.

Condor_q

You can find out what jobs have been submitted on hal with the condor_q command:

-- Submitter: hal.fnal.gov : <128.105.48.160:32787> : hal.fnal.gov
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

Nothing is running right now. If something was running, you would see output like this:

% condor_q

-- Submitter: hal.fnal.gov : <132.67.192.133:43609> : hal.fnal.gov
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
4589.0   doronn          3/30 18:07  19+09:26:01 I  0   0.0  go1               
5140.0   araddan         7/18 15:59   8+08:16:47 I  0   0.0  .condor_run.23359 
5145.0   araddan         7/18 17:22   0+21:29:41 I  0   0.0  matlab-script.txt 
6041.0   grishas        12/7  18:41   7+08:03:25 R  0   45.7 a.out             
6042.0   grishas        12/7  18:42   8+07:47:14 R  0   45.7 a.out             
6044.0   grishas        12/9  11:15   6+17:14:46 R  0   45.7 a.out         

The output that you see will be different depending on what jobs are running. Notice what we can see from this:

Extra credit

What else can you find out with condor_q? Try any one of:

Double bonus points

How do you use the -constraint or -format options to condor_q? When would you want them? When would you use the -l option?

Condor_status

You can find out what computers are in your Condor pool. (A pool is similar to a cluster, but it doesn't have the connotation that all computers are dedicated full-time to computation: some may be desktop computers owned by users.) To look, use condor_status:

% condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

vm10@wireless LINUX       INTEL  Unclaimed  Idle       0.000    12  0+00:05:05
vm11@wireless LINUX       INTEL  Unclaimed  Idle       0.000    12  0+00:05:06
vm12@wireless LINUX       INTEL  Unclaimed  Idle       0.000    12  0+00:05:07
vm13@wireless LINUX       INTEL  Unclaimed  Idle       0.000    12  0+00:05:08
vm14@wireless LINUX       INTEL  Unclaimed  Idle       0.000    12  0+00:05:09
[snip]
                     Machines Owner Claimed Unclaimed Matched Preempting

         INTEL/LINUX       30     0       0        30       0          0

               Total       30     0       0        30       0
               0

I've trimmed the output: you will see many more apparent computers. What can you learn from this output?

Extra credit

What else can you find out with condor_status? Try any one of:

Note in particular the options like -master and -schedd. When would these be useful? When would the -l option be useful?

Next: Submitting your first Condor job