Next: 3.2 Installation
Up: 3. Administrators' Manual
Previous: 3. Administrators' Manual
Contents
Index
Subsections
3.1 Introduction
This is the Condor Administrator's Manual for Unix.
Its purpose is to aid in
the installation and administration of a Condor pool. For help on
using Condor, see the Condor User's Manual.
A Condor pool
is comprised of a single machine which serves as the
central manager,
and an arbitrary number of other machines that
have joined the pool. Conceptually, the pool is a collection of
resources (machines) and resource requests (jobs). The role of Condor
is to match waiting requests with available resources. Every part of
Condor sends periodic updates to the central manager, the centralized
repository of information about the state of the pool. Periodically,
the central manager assesses the current state of the pool and tries
to match pending requests with the appropriate resources.
Each resource has an owner,
the user who works at the machine. This
person has absolute power over their own resource and Condor goes out
of its way to minimize the impact on this owner caused by Condor. It
is up to the resource owner to define a policy for when Condor
requests will
serviced and when they will be denied.
Each resource request has an owner as well: the
user who submitted the job. These people want Condor to provide as
many CPU cycles as possible for their work. Often the interests of
the resource owners are in conflict with the interests of the resource
requesters.
The job of the Condor administrator is to configure the Condor pool to
find the happy medium that keeps both resource owners and users of
resources satisfied. The purpose of this manual is to help you
understand the mechanisms that Condor provides to enable you to find
this happy medium for your particular set of users and resource owners.
3.1.1 The Different Roles a Machine Can Play
Every machine in a Condor pool can serve a variety of roles. Most
machines serve more than one role simultaneously. Certain roles can
only be performed by single machines in your pool. The following list
describes what these roles are and what resources are required on the
machine that is providing that service:
- Central Manager
- There can be only one central manager for your
pool. The machine is the collector of information, and the negotiator
between resources and resource requests. These two halves of the
central manager's responsibility are performed by separate daemons, so
it would be possible to have different machines providing those two
services. However, normally they both live on the same machine. This
machine plays a very important part in the Condor pool and should be
reliable. If this machine crashes, no further matchmaking can be
performed within the Condor system (although all current matches
remain in effect until they are broken by either party involved in the
match). Therefore, choose for central manager
a machine that is likely to be
up and running all the time, or at least one that will be rebooted quickly if
something goes wrong.
The central manager will
ideally have a good network connection to all the
machines in your pool, since they all send updates over the network to
the central manager. All queries go to the central manager.
- Execute
- Any machine in your pool (including your Central
Manager) can be configured for whether or not it should execute Condor
jobs. Obviously, some of your machines will have to serve this
function or your pool won't be very useful. Being an execute machine
doesn't require many resources at all. About the only resource that
might matter is disk space, since if the remote job dumps core, that
file is first dumped to the local disk of the execute machine before
being sent back to the submit machine for the owner of the job.
However, if there isn't much disk space, Condor will simply limit the
size of the core file that a remote job will drop. In general the
more resources a machine has (swap space, real memory, CPU speed,
etc.) the larger the resource requests it can serve. However, if
there are requests that don't require many resources, any machine
in your pool could serve them.
- Submit
- Any machine in your pool (including your Central
Manager) can be configured for whether or not it should allow Condor
jobs to be submitted.
The resource requirements for a submit machine
are actually much greater than the resource requirements for an
execute machine. First of all, every job that you submit that is
currently running on a remote machine generates another process on
your submit machine. So, if you have lots of jobs running, you will
need a fair amount of swap space and/or real memory. In addition all
the checkpoint files from your jobs are stored on the local disk of
the machine you submit from. Therefore, if your jobs have a large
memory image and you submit a lot of them, you will need a lot of disk
space to hold these files. This disk space requirement can be
somewhat alleviated with a checkpoint server (described below),
however the binaries of the jobs you submit are still stored on the
submit machine.
- Checkpoint Server
- One machine in your pool can be configured as
a checkpoint server.
This is optional, and is not part of the
standard Condor binary distribution. The checkpoint server is a
centralized machine that stores all the checkpoint files for the jobs
submitted in your pool. This machine should have lots of disk space
and a good network connection to the rest of your pool, as the traffic
can be quite heavy.
Now that you know the various roles a machine can play in a Condor
pool, we will describe the actual daemons within Condor that implement
these functions.
3.1.2 The Condor Daemons
The following list describes all the daemons and programs that could
be started under Condor and what they do:
- condor_master
- This daemon
is responsible for keeping all the
rest of the Condor daemons running on each machine in your pool. It
spawns the other daemons, and periodically checks to see if there are
new binaries installed for any of them. If there are, the master will
restart the affected daemons. In addition, if any daemon crashes, the
master will send e-mail to the Condor Administrator of your pool and
restart the daemon. The condor_master also supports various
administrative commands that let you start, stop or reconfigure
daemons remotely. The condor_master will run on every machine in
your Condor pool, regardless of what functions each machine are
performing.
- condor_startd
- This daemon
represents a given resource
(namely, a machine capable of running jobs) to the Condor pool. It
advertises certain attributes about that resource that are used to
match it with pending resource requests. The startd will run on any
machine in your pool that you wish to be able to execute jobs. It is
responsible for enforcing the policy that resource owners configure
which determines under what conditions remote jobs will be started,
suspended, resumed, vacated, or killed. When the startd is ready to
execute a Condor job, it spawns the condor_starter, described below.
- condor_starter
- This program
is the entity that actually
spawns the remote Condor job on a given machine. It sets up the
execution environment and monitors the job once it is running. When a
job completes, the starter notices this, sends back any status
information to the submitting machine, and exits.
- condor_schedd
- This daemon
represents resource requests to
the Condor pool. Any machine that you wish to allow users to submit
jobs from needs to have a condor_schedd running. When users submit
jobs, they go to the schedd, where they are stored in the job
queue, which the schedd manages. Various tools to view and
manipulate the job queue (such as condor_submit, condor_q, or
condor_rm) all must connect to the schedd to do their work. If the
schedd is down on a given machine, none of these commands will work.
The condor_schedd advertises the number of waiting jobs in its job queue and
is responsible for claiming available resources to serve those
requests. Once a schedd has been matched with a given resource, the
schedd spawns a condor_shadow (described below) to serve that
particular request.
- condor_shadow
- This program
runs on the machine where a given
request was submitted and acts as the resource manager for the
request. Jobs that are linked for Condor's standard universe, which
perform remote system calls, do so via the condor_shadow. Any
system call performed on the remote execute machine is sent over the
network, back to the condor_shadow which actually performs the
system call (such as file I/O) on the submit machine, and the result
is sent back over the network to the remote job. In addition, the
shadow is responsible for making decisions about the request (such as
where checkpoint files should be stored, how certain files should be
accessed, etc).
- condor_collector
- This daemon
is responsible for collecting
all the information about the status of a Condor pool. All other
daemons periodically send ClassAd updates to
the collector. These ClassAds contain all the information about the
state of the daemons, the resources they represent or resource
requests in the pool (such as jobs that have been submitted to a given
schedd). The condor_status command can be used to query the
collector for specific information about various parts of Condor. In
addition, the Condor daemons themselves query the collector for
important information, such as what address to use for sending
commands to a remote machine.
- condor_negotiator
- This daemon
is responsible for all the
match-making within the Condor system. Periodically, the negotiator
begins a negotiation cycle, where it queries the collector for
the current state of all the resources in the pool. It contacts each
schedd that has waiting resource requests in priority order, and tries
to match available resources with those requests. The negotiator is
responsible for enforcing user priorities in the system, where the
more resources a given user has claimed, the less priority they have
to acquire more resources. If a user with a better priority has jobs
that are waiting to run, and resources are claimed by a user with a
worse priority, the negotiator can preempt that resource and match it
with the user with better priority.
NOTE: A higher numerical value of the user priority in Condor
translate into worse priority for that user. The best priority you
can have is 0.5, the lowest numerical value, and your priority gets
worse as this number grows.
- condor_kbdd
- This daemon
is used on Linux and Windows.
On those platforms, the condor_startd frequently cannot determine
console (keyboard or mouse) activity directly from the system, and
requires a separate process to do so.
On Linux, the
condor_kbdd connects to the X Server and periodically checks to see
if there has been any activity.
On Windows, the condor_kbdd runs as the logged-in user and registers
with the system to receive keyboard and mouse events.
When it detects console activity, the condor_kbdd sends a
command to the startd. That way, the startd knows the machine owner
is using the machine again and can perform whatever actions are
necessary, given the policy it has been configured to enforce.
- condor_ckpt_server
- This is the checkpoint server.
It services requests to store and retrieve checkpoint files. If your
pool is configured to use a checkpoint server but that machine (or the
server itself is down) Condor will revert to sending the checkpoint
files for a given job back to the submit machine.
- condor_quill
- This daemon
builds and manages a database that represents a copy of the
Condor job queue.
The condor_q and condor_history tools can then query the database.
- condor_dbmsd
- This daemon
assists the condor_quill daemon.
- condor_gridmanager
- This daemon
handles management and execution of all grid
universe jobs.
The condor_schedd invokes the condor_gridmanager when
there are grid universe jobs in the queue,
and the condor_gridmanager exits when there are no more
grid universe jobs in the queue.
- condor_credd
- This daemon
runs on Windows platforms to manage password storage in a secure manner.
- condor_had
- This daemon
implements the high availability of a pool's central manager
through monitoring the communication of necessary daemons.
If the current, functioning, central manager machine
stops working, then this daemon ensures that another
machine takes its place, and becomes the central manager of
the pool.
- condor_replication
- This daemon
assists the condor_had daemon by keeping an updated copy of the
pool's state. This state provides a better transition
from one machine to the next, in the event
that the central manager machine stops working.
- condor_transferer
- This short lived daemon is invoked by
the condor_replication daemon to accomplish the task of transferring
a state file before exiting.
- condor_procd
- This daemon
controls and monitors process families within Condor. Its use
is optional in general but it must be used if privilege separation
(see Section 3.6.14) or group-ID based tracking (see
Section 3.13.12) is enabled.
- condor_job_router
- This daemon
transforms vanilla universe jobs into grid
universe jobs, such that the transformed jobs are capable
of running elsewhere, as appropriate.
- condor_lease_manager
- This daemon
manages leases in a persistent manner.
Leases are represented by ClassAds.
- condor_rooster
- This daemon wakes hibernating machines based
upon configuration details.
- condor_shared_port
- This daemon listens for incoming TCP packets
on behalf of Condor daemons, thereby reducing the number of required
ports that must be opened when Condor is accessible through a firewall.
- condor_hdfs
- This daemon manages the configuration of a
Hadoop file system as well as the invocation of a properly configured
Hadoop file system.
See figure 3.1 for a graphical representation of the
pool architecture.
Figure 3.1:
Pool Architecture
|
Next: 3.2 Installation
Up: 3. Administrators' Manual
Previous: 3. Administrators' Manual
Contents
Index
htcondor-admin@cs.wisc.edu