Title: Condor Practical
Subtitle: Looking at Condor
Tutor: Alain Roy
Authors: Alain Roy

2.0 Looking at Condor

2.1 Where is Condor?

You will find the Condor binaries in /usr/bin:

% which condor_q

You can see the version of Condor with condor_version

% condor_version
$CondorVersion: 6.8.8 Dec 19 2007 $
$CondorPlatform: I386-LINUX_RHEL3 $

You might be surprised that it reports RHEL3 instead of Scientific Linux 4 (the version of Linux installed on these computers). It is reporting the operating system that it was compiled on, not the operating system that is in use. Don't worry, the RHEL 3 binaries work just fine on Scientific Linux 4.

Check is Condor is running:

% condor_master

% ps auwx | grep condor | grep -v condor_stater
 4554 ?        Ss   592:11 /usr/sbin/condor_master -pid /var/condor/
 4556 ?        Ss    42:03 condor_collector -f
 4558 ?        Ss    21:39 condor_negotiator -f
 4559 ?        Ss    58:25 condor_schedd -f

Excellent! It's running!

The output you see from ps may be slightly different than ours, but as long as it lists all of those Condor programs, it's okay. Let's look at what we see:

condor_master: This program runs constantly and ensures that all other parts of Condor are running. If they hang or crash, it restarts them.

condor_collector: This program is part of the Condor central manager. It collects information about all computers in the pool as well as which users want to run jobs. It is what normally responds to the condor_status command.

condor_negotiator: This program is part of the Condor central manager. It decides what jobs should be run where.

condor_startd: If this program is running, it allows jobs to be started up on this computer--that is, your computer is an "execute machine". This advertises your computer to the central manager (more on that later, but in this case it's also your computer) so that it knows about this computer. It will start up the jobs that run. This isn't listed on osg-edu, but it is on the other computers in our Condor pool.

condor_schedd If this program is running, it allows jobs to be submitted from this computer--that is, your computer is a "submit machine". This will advertise jobs to the central manager so that it knows about them. It will contact a condor_startd on other execute machines for each job that needs to be started.

condor_shadow (Not shown above) For each job that has been submitted from this computer, there is one condor_shadow running. It will watch over the job as it runs remotely. In some cases it will provide some assistance (see the standard universe later.) You may or may not see any condor_shadow processes running, depending on what is happening on the computer when you try it out.

We have a graphic representation of these daemons, drawn by Sarah Miller, age 12.


2.2 Condor_q

You can find out what jobs have been submitted on your computer with the condor_q command:

% condor_q

-- Submitter: : <> :
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

Nothing is running right now. If something was running, you would see output like this:

% condor_q

-- Submitter: : <> :
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
4589.0   doronn          3/30 18:07  19+09:26:01 I  0   0.0  go1               
5140.0   araddan         7/18 15:59   8+08:16:47 I  0   0.0  .condor_run.23359 
5145.0   araddan         7/18 17:22   0+21:29:41 I  0   0.0  matlab-script.txt 
6041.0   grishas        12/7  18:41   7+08:03:25 R  0   45.7 a.out             
6042.0   grishas        12/7  18:42   8+07:47:14 R  0   45.7 a.out             
6044.0   grishas        12/9  11:15   6+17:14:46 R  0   45.7 a.out         

The output that you see will be different depending on what jobs are running. Notice what we can see from this:
  • ID: We can see each jobs cluster and process number. For the first job, the cluster is 4589 and the process is 0. In some cases, we may have many processes (jobs) within a single cluster.
  • OWNER: We can see who owns the job.
  • SUBMITTED: We can see when the job was submitted
  • RUN_TIME: We can see how long the job has been running.
  • ST: We can see what the current state of the job is. I is idle, R is running.
  • PRI: We can see the memory consumption of the job.
  • CMD: We can see the program that is being executed.

Extra credit

What else can you find out with condor_q? Try any one of:

Double bonus points

How do you use the -constraint or -format options to condor_q? When would you want them? When would you use the -l option? This might be an easier exercise to try once you submit some jobs.


2.3 Condor_status

You can find out what computers are in the Condor pool. (A pool is similar to a cluster, but it doesn't have the connotation that all computers are dedicated full-time to computation: some may be desktop computers owned by users.) To look, use condor_status:

% condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

vm1@osgs-c03. LINUX       INTEL  Unclaimed  Idle       0.000  1013  3+21:40:55
vm2@osgs-c03. LINUX       INTEL  Unclaimed  Idle       0.000  1013  0+02:35:29
osgs-c05.cs.w LINUX       INTEL  Unclaimed  Idle       0.010  2026  0+02:00:58
vm1@osgs-c06. LINUX       INTEL  Owner      Idle       0.010  1013  0+01:14:17
vm2@osgs-c06. LINUX       INTEL  Unclaimed  Idle       0.000  1013107+03:55:50
vm1@osgs-c07. LINUX       INTEL  Unclaimed  Idle       0.000  1013107+03:53:50
vm2@osgs-c07. LINUX       INTEL  Unclaimed  Idle       0.000  1013  0+02:48:48
vm1@osgs-c08. LINUX       INTEL  Unclaimed  Idle       0.000  1013  0+02:20:03
vm2@osgs-c08. LINUX       INTEL  Unclaimed  Idle       0.000  1013152+19:19:58
vm1@osgs-c09. LINUX       INTEL  Unclaimed  Idle       0.000  1013  0+02:26:46
vm2@osgs-c09. LINUX       INTEL  Unclaimed  Idle       0.000  1013106+08:38:57

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX    11     1       0        10       0          0        0

               Total    11     1       0        10       0          0        0

On some of your commputers, you might see two apparent computers because we have multiple CPUs. These are listed as "vm1" or "vm2". Let's look at exactly what you can see:

  • Name The name of the computer. Sometimes this gets chopped off, like above.
  • OpSys The operating system, though not at the granularity you may wish: It says "Linux" instead of which distribution and version of Linux.
  • Arch The architecture, such as INTEL or PPC.
  • State The state is often Claimed (when it is running a Condor job) or Unclaimed (when it is not running a Condor job). It can be in a few other states as well, such as Matched.
  • Activity This is usually something like Busy or Idle. Sometimes you may see a computer that is Claimed, but no job has yet begun on the computer. Then it is Claimed/Idle. Hopefully this doesn't last very long.
  • LoadAv The load average on the computer.
  • Mem The computers memory in megabytes.
  • ActvtyTime How long the computer has been doing what it's been doing.

Extra credit

What else can you find out with condor_status? Try any one of:

Note in particular the options like -master and -schedd. When would these be useful? When would the -l option be useful?


Next: Submitting your first Condor job