Next: condor_history
Up: 9. Command Reference Manual
Previous: condor_findhost
Contents
Index
Subsections
condor_glidein
add a remote grid resource to a local Condor pool
condor_glidein
[-help]
condor_glidein
[-admin address]
[-anybody]
[-archdir dir]
[-basedir basedir]
[-count CPU count]
[<Execute Task Options>]
[<Generate File Options>]
[-gsi_daemon_name cert_name]
[-idletime minutes]
[-install_gsi_trusted_ca_dir path]
[-install_gsi_gridmap file]
[-localdir dir]
[-memory MBytes]
[-project name]
[-queue name]
[-runtime minutes]
[-runonly]
[<Set Up Task Options>]
[-suffix suffix]
[-slots slot count]
<contact argument>
condor_glidein allows the temporary addition of a grid resource to
a local Condor pool.
The addition is accomplished by installing and executing some of the Condor
daemons on the remote grid resource,
such that it reports in as part of the local Condor pool.
condor_glidein accomplishes two separate tasks: set up and execution.
These separated tasks allow flexibility,
in that the user may use condor_glidein to do only one
of the tasks or both,
in addition to customizing the tasks.
The set up task generates a script that may be used
to start the Condor daemons during the execution task,
places this script on the remote grid resource,
composes and installs a configuration file,
and it installs the condor_master, condor_startd
and condor_starter daemons on the grid resource.
The execution task runs the script generated by
the set up task.
The goal of the script is to invoke the condor_master daemon.
The Condor job glidein_startup appears in the queue of the local
Condor pool for each invocation of condor_glidein.
To remove the grid resource from the local Condor pool,
use condor_rm to remove the glidein_startup job.
The Condor jobs to do both the set up and execute tasks
utilize
Condor-G and Globus gt2 protocols to communicate with the remote resource.
Therefore,
an X.509 certificate (proxy) is required
for the user running condor_glidein.
Specify the remote grid machine with the command line
argument <contact argument>.
<contact argument> takes one of 4 forms:
- hostname
- Globus contact string
- hostname/jobmanager-<schedulername>
- -contactfile filename
The argument -contactfile filename specifies the full
path and file name of a file that contains Globus contact strings.
Each of the resources given by a Globus contact string
is added to the local Condor pool.
The set up task of condor_glidein
copies the binaries for the correct platform from a central server.
To obtain access to the server,
or to set up your own server, follow
instructions on the Glidein Server Setup page,
at http://www.cs.wisc.edu/condor/glidein.
Set up need only be done once per site, as the installation
is never removed.
By default, all files installed on the remote grid resource are placed in
the directory
$(HOME)/Condor_glidein.
$(HOME) is evaluated and defined
on the remote machine using a grid map.
This directory must be in a shared file system accessible
by all machines that will run the Condor daemons.
By default, the daemon's log files will also be written in this
directory.
Change this directory with the -localdir option to make
Condor daemons write to local scratch space on the execution machine.
For debugging initial problems, it may be convenient to have the log
files in the more accessible default directory.
If using the default directory,
occasionally clean up old log and execute directories
to avoid running out of space.
To have
10 grid resources running PBS at a grid site with a
gatekeeper named gatekeeper.site.edu join the local Condor pool:
% condor_glidein -count 10 gatekeeper.site.edu/jobmanager-pbs
If you try something like the above and condor_glidein is not able to
automatically determine everything it needs to know about the remote site,
it will ask you to provide more information. A typical result of this
process is something like the following command:
% condor_glidein \
-count 10 \
-arch 6.6.7-i686-pc-Linux-2.4 \
-setup_jobmanager jobmanager-fork \
gatekeeper.site.edu/jobmanager-pbs
The Condor jobs that do the set up and execute tasks
will appear in the queue for the local Condor pool.
As a result of a successful glidein,
use condor_status to see that the remote grid resources
are part of the local Condor pool.
A list of common problems and solutions is presented in this
manual page.
- -genconfig
- Create a local copy of the configuration file that may be used
on the remote resource.
The file is named glidein_condor_config.<suffix>.
The string defined by suffix defaults to the process
id (PID) of the condor_glidein process or is defined with
the -suffix command line option.
The configuration file may be edited for later use with
the -useconfig option.
- -genstartup
- Create a local copy of the script used
on the remote resource to invoke the condor_master.
The file is named glidein_startup.<suffix>.
The string defined by suffix defaults to the process
id (PID) of the condor_glidein process or is defined with
the -suffix command line option.
The file may be edited for later use with the
-usestartup option.
- -gensubmit
- Generate submit description files,
but do not submit.
The submit description file for the set up task is
named glidein_setup.submit.<suffix>.
The submit description file for the execute task is
named glidein_run.submit.<suffix>.
The string defined by suffix defaults to the process
id (PID) of the condor_glidein process or is defined with
the -suffix command line option.
- -setuponly
- Do only the set up task of condor_glidein.
This option cannot be run simultaneously with -runonly.
- -setup_here
- Do the set up task on the local machine,
instead of at a remote grid resource.
This may be used, for example,
to do the set up task of condor_glidein in an AFS area that is read-only
from the remote grid resource.
- -forcesetup
- During the set up task, force the copying of files,
even if this overwrites existing files.
Use this to push out changes to the configuration.
- -useconfig config_file
- The set up task copies the specified configuration file,
rather than generating one.
- -usestartup startup_file
- The set up task copies the specified startup script,
rather than generating one.
- -setup_jobmanager jobmanagername
- Identifies the jobmanager on the remote grid resource
to receive the files during the set up task. If a
reasonable default can be discovered through MDS, this is optional.
jobmanagername is a string representing any gt2
name for the job manager. The correct string in most cases
will be jobmanager-fork.
Other common strings may be jobmanager, jobmanager-condor,
jobmanager-pbs, and jobmanager-lsf.
- -runonly
- Starts execution of the Condor daemons on the grid
resource. If any of the necessary files or executables are missing,
condor_glidein exits with an error code.
This option cannot be run simultaneously with -setuponly.
- -run_here
- Runs condor_master directly rather than submitting a Condor job
that causes the remote execution.
To instead generate a script that
does this, use -run_here in combination with
-gensubmit. This may be useful for running Condor daemons on
resources that are not directly accessible by Condor.
- -help
- Display brief usage information and exit.
- -basedir basedir
- Specifies the base directory on the remote grid resource
used for placing files.
The default directory is $(HOME)/Condor_glidein
on the grid resource.
- -archdir dir
- Specifies the directory on the remote grid resource for placement
of the Condor executables.
The default value for -archdir is based upon version information on the grid resource.
It is of the form
basedir/condor-version-Globus canonicalsystemname.
An example of the directory
(without the base directory)
for Condor version 7.6.0
running on a 64-bit Intel processor with RHEL 3 is
7.6.0-x86_64-pc-Linux-2.4-glibc2.3 .
- -localdir dir
- Specifies the directory on the remote grid resource
in which to create log and execution subdirectories needed by Condor.
If limited disk quota in the home or base directory
on the grid resource is a problem,
set -localdir to a large temporary space,
such as /tmp or /scratch. If the batch system
requires invocation of Condor daemons in a temporary scratch directory,
'.' may be used for the definition of the -localdir option.
- -arch architecture
- Identifies the platform of the required tarball
containing the correct Condor daemon executables to download
and install.
If a reasonable default can be discovered through MDS, this is
optional. A list of possible values may be found at
http://www.cs.wisc.edu/condor/glidein/binaries.
The architecture
name is the same as the tarball name without the suffix
tar.gz. An example is 6.6.5-i686-pc-Linux-2.4 .
- -queue name
- The argument name is a string used at the
grid resource to identify a job queue.
- -project name
- The argument name is a string used at the
grid resource to identify a project name.
- -memory MBytes
- The maximum memory size in Megabytes to request from the grid resource.
- -count CPU count
- The number of CPUs requested to join the local pool.
The default is 1.
- -slots slot count
- For machines with multiple CPUs, the CPUs maybe divided
up into slots. slot count is the number
of slots that results.
By default, Condor divides multiple-CPU resources such that
each CPU is a slot, each with an equal share of RAM,
disk, and swap space.
This option configures the number of slots, so that
multi-threaded jobs can run in a slot with multiple
CPUs.
For example, if 4 CPUs are requested and
-slots is not specified, Condor will
divide the request up into 4 slots with 1 CPU each.
However, if -slots 2 is specified,
Condor will divide the request up into 2 slots with
2 CPUs each, and if -slots 1 is
specified, Condor will put all 4 CPUs into one slot.
- -idletime minutes
- The amount of time that a remote grid resource
will remain idle state, before the daemons shut down.
A value of 0 (zero) means that the daemons never shut
down due to remaining in the idle state.
In this case,
the -runtime option defines when the daemons shut down.
The default value is 20 minutes.
- -runtime minutes
- The maximum amount of time the Condor daemons on
the remote grid resource
will run before shutting themselves down. This option is useful
for resources with enforced maximum run times. Setting
-runtime to be a few minutes shorter than the enforced
limit gives the daemons time to perform a graceful shut down.
- -anybody
- Sets the Condor START expression for the added
remote grid resource to True.
This permits any user's job which can run on
the added remote grid resource to run.
Without this option, only jobs owned by the user executing
condor_glidein can execute on the remote grid resource. WARNING:
Using this option may violate the usage policies of many
institutions.
- -admin address
- Where to send e-mail with problems.
The default is the login of the user running
condor_glidein at UID domain of the local Condor pool.
- -suffix X
- Suffix to use when generating files. Default is process id.
- -gsi_daemon_name cert_name
- Includes and enables GSI authentication in the
configuration for the remote grid resource.
The argument is the GSI
certificate name that the daemons will use to authenticate
themselves.
- -install_gsi_trusted_ca_dir path
- The argument identifies the directory
containing the trusted CA certificates that the
daemons are to use
(for example, /etc/grid-security/certificates).
The contents of this directory will be installed at the remote
site in the directory basedir/grid-security.
- -install_gsi_gridmap file
- The argument is the file name
of the GSI-specific X.509 map file that the daemons will use.
The file will be installed at the remote site
in basedir/grid-security.
The file contains
entries mapping certificates to user names. At the
very least, it must contain an entry for the certificate
given by the command-line option -gsi_daemon_name .
If other Condor
daemons use different certificates, then this file will
also list any certificates that the daemons
will encounter for
the condor_schedd, condor_collector, and condor_negotiator.
See section 3.6.3 for more information.
condor_glidein will exit with a status value of 0 (zero) upon
complete success,
or with non-zero values upon failure.
The status value will be 1 (one) if
condor_glidein encountered an error making a directory,
was unable to copy a tar file,
encountered an error in parsing the command line,
or was not able to gather required information.
The status value will be 2 (two) if
there was an error in the remote set up.
The status value will be 3 (three) if
there was an error in remote submission.
The status value will be -1 (negative one) if
no resource was specified in the command line.
Common problems are listed below. Many of these are best discovered by
looking in the StartLog log file on the remote grid resource.
- WARNING: The file xxx is not writable by condor
- This error occurs when
condor_glidein is run in a directory that does not have the proper
permissions for Condor to access files. An AFS directory
does not give Condor the user's AFS ACLs.
- Glideins fail to run due to GLIBC errors
- Check the list of
available glidein binaries
(http://www.cs.wisc.edu/condor/glidein/binaries), and try
specifying the architecture name that includes the correct glibc
version for the remote grid site.
- Glideins join pool but no jobs run on them
- One common
cause of this problem is that the remote grid resources are in a different
file system domain, and the submitted Condor jobs
have an implicit
requirement that they must run in the same file system domain.
See
section 2.5.4 for details on
using Condor's file transfer capabilities to solve this problem.
Another cause of this
problem is a communication failure. For example, a firewall may be
preventing the condor_negotiator or the condor_schedd daemons
from connecting to
the condor_startd on the remote grid resource.
Although work is being done to remove
this requirement in the future, it is currently necessary to have full
bidirectional connectivity, at least over a restricted range of
ports. See page for more information on
configuring a port range.
- Glideins run but fail to join the pool
- This may be caused by
the local pool's security settings or by a communication failure. Check
that the security settings in the local pool's configuration file allow
write access to the remote grid resource. To not modify
the security settings for the pool, run a separate pool
specifically for the remote grid resources,
and use flocking to balance jobs across
the two pools of resources. If the log files
indicate a communication failure, then see the next item.
- The startd cannot connect to the collector
- This may be caused
by several things. One is a firewall. Another is when the compute
nodes do not have even outgoing network access. Configuration
to work without full network access to and from the compute nodes is
still in the experimental stages, so for now, the short answer is that
you must at least have a range of open (bidirectional) ports and set
up the configuration file as described on
page . Use the option -genconfig,
edit the generated configuration file,
and then do the glidein execute task with the option -useconfig.)
Another possible cause of connectivity problems may be the use of UDP by
the condor_startd to register itself with the condor_collector.
Force it to use TCP as described on
page .
Yet another possible cause of connectivity problems is when the
remote grid resources
have more than one network interface, and the default one chosen
by Condor is not the correct one. One way to fix this is to modify
the glidein startup script using the -genstartup and -usestartup
options.
The script needs to determine the IP address associated with
the correct network interface, and assign this to the environment
variable _condor_NETWORK_INTERFACE.
- NFS file locking problems
- If the -localdir option
uses files
on NFS (not recommended, but sometimes convenient
for testing), the Condor daemons may have trouble manipulating file
locks. Try inserting the following into the configuration file:
IGNORE_NFS_LOCK_ERRORS = True
Center for High Throughput Computing, University of Wisconsin-Madison
Copyright © 1990-2012 Center for High Throughput Computing,
Computer Sciences Department,
University of Wisconsin-Madison, Madison, WI. All Rights Reserved.
Licensed under the Apache License, Version 2.0.
See the Condor Version 7.6.10 Manual or
http://research.cs.wisc.edu/htcondor/
for
additional notices.
Next: condor_history
Up: 9. Command Reference Manual
Previous: condor_findhost
Contents
Index
htcondor-admin@cs.wisc.edu