Condor tutorial for Tel Aviv University

Preliminaries

You might want to refer to the online Condor manual.
You may enjoy browsing the Condor web page
You may like to look at pictures of the tutorial writer's baby.

Getting started

You will need a laptop with a web browser and an ssh client. The web browser is to read these directions, and the ssh client is to log into a computer that has Condor set up and ready to go. The computer's name is nova.cs.tau.ac.il (called nova for short), and you have been assigned a username and password already. If you do not know your username or password, talk to someone in charge.

When you log into nova, you will see a message that looks something like this:

Last login: Thu Dec 16 19:40:27 2004 from wireless28.cs.wisc.edu

Welcome to the CS school at Tel-Aviv University.
Our FAQ can be found at ,
the helpdesk can be reached at system@cs.tau.ac.il.
The system team

nova 1% 
Nova is a RedHat 7.3 computer. For the rest of this tutorial, I will assume that you know the basics of Linux. If you don't, sit next to someone that does.

Checking out Condor

Let's make sure that you can find Condor. Your PATH should already be set up. Make sure that it is:

nova 1% which condor_submit
/usr/local/bin/condor_submit
And check out which version of Condor seems to be running:
nova 2% condor_version
$CondorVersion: 6.6.7 Oct 11 2004 $
$CondorPlatform: I386-LINUX_RH72 $

Condor 6.6.7 is the most recent stable release of Condor. Condor has two coexisting versions at any given time. Condor 6.6.x is considered a stable release. You know it is stable because the second digit (a 6) is an even number. This is the eighth release in the stable series. It has the same features as Condor 6.6.0, but has had a number of bug fixes. Some people may think that the fact that there have been so many stable releases is a sign that Condor has a lot of bugs. However, you should think of it differently: we actively support our stable series and work hard to make sure that you have the best possible version.

Condor 6.7.2 is the latest development release of Condor. You know it's a development release because the second digit (a 7) is an odd number. Development releases add new features and are more likely to have serious bugs.

We expect Condor 6.6.8 and 6.7.3 to be released soon. Your tutorial leader can tell you more about these releases.

Just because we can run condor_version doesn't mean that Condor is actually running. You can find out if it is running by looking at the running processes:

nova 3% ps -auwx | grep condor
condor    1002  0.1  0.0  5252 2108 ?        S    Dec09  16:13 /usr/local/lib/condor/sbin/condor_master
condor    1007  0.0  0.0  5664 1836 ?        S    Dec09   0:19 condor_startd -f
condor    9178  0.0  0.0  7028 3600 ?        S    Dec13   3:59 condor_schedd -f
grishas  20367  0.0  0.0  5496 2544 ?        SN   Dec14   0:05 condor_shadow -f ...
grishas  23205  0.0  0.0  5400 2440 ?        SN   Dec15   0:04 condor_shadow -f ...
alainroy  3009  0.0  0.0  1736  584 pts/15   S    00:00   0:00 grep condor

The output you see may be slightly different. (And I've trimmed the condor_shadow output to be simpler.) Let's look at what we see here:

condor_master: This program runs constantly and ensures that all other parts of Condor are running. If they hang or crash, it restarts them.

condor_startd: If this program is running, it allows jobs to be started up on this computer--that is, nova is an "execute machine". This advertises nova to the central manager (more on that later) so that it knows about this computer. It will start up the jobs that run.

condor_schedd If this program is running, it allows jobs to be submitted from this computer--that is, nova is a "submit machine". This will advertise jobs to the central manager so that it knows about them. It will contact a condor_startd on other execute machines for each job that needs to be started.

condor_shadow For each job that has been submitted from this computer, there is one condor_shadow running. It will watch over the job as it runs remotely. In some cases it will provide some assistance (see the standard universe later.) You may or may not see any condor_shadow processes running, depending on what is happening on the computer when you try it out.

We have a graphic representation of these daemons, drawn by Sarah Miller, age 12.

Condor_q

You can find out what jobs have been submitted on nova with the condor_q command:

nova 4 % condor_q

-- Submitter: nova.cs.tau.ac.il : <132.67.192.133:43609> : nova.cs.tau.ac.il
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
4589.0   doronn          3/30 18:07  19+09:26:01 I  0   0.0  go1               
5140.0   araddan         7/18 15:59   8+08:16:47 I  0   0.0  .condor_run.23359 
5145.0   araddan         7/18 17:22   0+21:29:41 I  0   0.0  matlab-script.txt 
6041.0   grishas        12/7  18:41   7+08:03:25 R  0   45.7 a.out             
6042.0   grishas        12/7  18:42   8+07:47:14 R  0   45.7 a.out             
6044.0   grishas        12/9  11:15   6+17:14:46 R  0   45.7 a.out         

Again, the output that you see will be different. If there are no jobs running, you will see nothing. Notice what we can see from this:

Extra credit

What else can you find out with condor_q? Try any one of:

Double bonus points

How do you use the -constraint or -format options to condor_q? When would you want them?

Condor_status

You can find out what computers are in your Condor pool. (A pool is similar to a cluster, but it doesn't have the connotation that all computers are dedicated full-time to computation: some may be desktop computers owned by users.) To look, use condor_status:

nova 5 % condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

abel-02.cs.ta LINUX       INTEL  Claimed    Busy       1.000   494  0+02:04:36
abel-04.cs.ta LINUX       INTEL  Claimed    Busy       1.000   494  0+06:40:57
abel-05.cs.ta LINUX       INTEL  Unclaimed  Idle       0.020   494  0+03:35:09
...
                     Machines Owner Claimed Unclaimed Matched Preempting

         INTEL/LINUX      135    34      38        63       0          0

               Total      135    34      38        63       0          0

I've trimmed the output: you will see many more computers. What can you learn from this output?

Extra credit

What else can you find out with condor_status? Try any one of:

Note in particular the options like -master and -schedd. When would these be useful?

Next: Submitting your first Condor job