Next: 5. Grid Computing
Up: 4. Miscellaneous Concepts
Previous: 4.4 Hooks
Contents
Index
Subsections
4.5 Application Program Interfaces
4.5.1 Web Service
Condor's Web Service (WS) API provides a way for application developers
to interact with Condor, without needing to utilize
Condor's command-line tools.
In keeping with the Condor philosophy of reliability and fault-tolerance,
this API is designed to provide a simple and powerful way
to interact with Condor.
Condor daemons understand and implement
the SOAP (Simple Object Access Protocol) XML API
to provide a web service interface for Condor job submission
and management.
To deal with the issues of reliability and fault-tolerance,
a two-phase commit mechanism to provides a transaction-based protocol.
The following API description describes interaction
between a client using the API and both the condor_schedd and
condor_collector daemons to illustrate transactions
for use in job submission, queue management and ClassAd
management functions.
4.5.1.1 Transactions
All applications using the API to interact with the condor_schedd
will need to use transactions.
A transaction is
an ACID unit of work (atomic, consistent, isolated, and durable).
The API limits the lifetime of a transaction,
and both the client (application) and the server
(the condor_schedd daemon)
may place a limit on the lifetime.
The server reserves the right to specify a maximum
duration for a transaction.
The client initiates a transaction using the
beginTransaction() method.
It ends the transaction with either
a commit (using commitTransaction())
or an abort (using abortTransaction()).
Not all operations in the API need to be performed within a
transaction.
Some accept a null transaction.
A null transaction is a SOAP message with
<transaction xsi:type="ns1:Transaction" xsi:nil="true"/>
Often this is achieved by passing the programming
language's equivalent of null
in place of a transaction identifier.
It is possible that some operations will have access to more
information when they are used inside a transaction. For instance, a
getJobAds().
query would have access to the jobs that are pending in a
transaction, which are not committed and therefore not visible
outside of the transaction.
Transactions are as ACID compliant as possible.
Therefore, do not query for information
outside of a transaction on which to make a decision inside a
transaction based on the query's results.
4.5.1.2 Job Submission
A ClassAd is required to describe a job.
The job ClassAd will be
submitted to the condor_schedd within a transaction
using the submit() method.
The complexity of job ClassAd creation may be simplified
by the createJobTemplate() method.
It returns an instance of a ClassAd structure that may be
further modified.
A necessary part of the job ClassAd are the job attributes
ClusterId and ProcId, which uniquely identify
the cluster and the job within a cluster.
Allocation and assignment of (monotonically increasing)
ClusterId values utilize the newCluster() method.
Jobs may be submitted within the assigned cluster only until
the newCluster() method is invoked a subsequent time.
Each job is allocated and assigned a (monotonically increasing)
ProcId within the current cluster using the newJob()
method.
Therefore, the sequence of method calls to submit a set of jobs
initially calls newCluster().
This is followed by calls to newJob() and then submit()
for each job within the cluster.
As an example, here are sample cluster and job numbers that
result from the ordered calls to submission methods:
- A call to newCluster(), assigns a ClusterId of 6.
- A call to newJob(), assigns a ProcId of 0, as
this is the first job within the cluster.
- A call to submit() results in a job submission numbered 6.0.
- A call to newJob(), assigns a ProcId of 1.
- A call to submit() results in a job submission numbered 6.1.
- A call to newJob(), assigns a ProcId of 2.
- A call to submit() results in a job submission numbered 6.2.
- A call to newCluster(), assigns a ClusterId of 7.
- A call to newJob(), assigns a ProcId of 0, as
this is the first job within the cluster.
- A call to submit() results in a job submission numbered 7.0.
- A call to newJob(), assigns a ProcId of 1.
- A call to submit() results in a job submission numbered 7.1.
There is the
potential that a call to submit() will fail.
Failure means that the
job is in the queue,
and it typically indicates that
something needed by the job has not been sent.
As a result the job has no hope in successfully running.
It is possible to recover from
such a failure by trying to resend information that the job will
need. It is also completely acceptable to abort and make another
attempt. To simplify the client's effort in figuring out what the job
requires, a discoverJobRequirements() method accepting a
job ClassAd and
returning a list of things that should be sent along with the job is
provided.
4.5.1.3 File Transfer
A common job submission case requires the job's
executable and input files to be transferred
from the machine where the application is running
to the machine where the condor_schedd daemon is running.
This is the analogous situation to running condor_submit
using the -spool or -remote option.
The executable and input files must be sent directly to
the condor_schedd daemon, which places all files
in a spool location.
The two methods
declareFile()
and sendFile() work in tandem to transfer files
to the condor_schedd daemon.
The declareFile() method causes the condor_schedd daemon
to create the file in its spool location,
or indicate in its return value that the file already exists.
This increases efficiency,
as resending an existing file is a waste of resources.
The sendFile() method sends
base64 encoded data.
sendFile() may be used to send an
entire file, or chunks of files as desired.
The declareFile() method has both required and
optional arguments.
declareFile() requires the name of the file
and its size in bytes.
The optional arguments relate hash information.
A hash type of NOHASH disables file verification;
the condor_schedd daemon will not have a reliable way
to determine the existence of the file being declared.
Methods for retrieving files are most useful when a job is completed.
Consider the categorization of the typical life-cycle for a job:
- Birth:
- The birth of a job begins with submit().
- Childhood:
- The job executes.
- Middle Age:
- A completed job waits to be removed.
As the job enters Middle Age,
its JobStatus ClassAd attribute becomes Completed (the value 4).
- Old Age:
- The job's information goes into the history log.
Once the job enters Middle Age,
the getFile() method retrieves a file.
The listSpool() method assists by providing
a list of all the job's files in the spool location.
The job enters Old Age by the application's use of the
closeSpool() method.
It causes the condor_schedd daemon to remove the
job from the queue,
and the job's spool files are no longer available.
As there is no requirement for the application to invoke
the closeSpool() method,
jobs can potentially remain in the queue forever.
The configuration variable SOAP_LEAVE_IN_QUEUE
may mitigate this problem.
When this boolean variable evaluates to False,
a job enters Old Age.
A reasonable example for this configuration variable is
SOAP_LEAVE_IN_QUEUE = ((JobStatus==4) && ((ServerTime - CompletionDate) < (60 * 60 * 24)))
This expression results in Old age for a job (removed from the queue),
once the job has been Middle Aged (been completed) for 24 hours.
4.5.1.4 Implementation Details
Condor daemons understand and communicate using the
SOAP XML protocol.
An application seeking to use this protocol
will require code that handles the communication.
The XML WSDL (Web Services Description Language)
that Condor implements is included with the
Condor distribution.
It is in $(RELEASE_DIR)/lib/webservice.
The WSDL must be run through a toolkit to produce
language-specific routines that do communication.
The application is compiled with these routines.
Condor must be configured to enable responses to SOAP calls.
Please see
section 3.3.33 for definitions of the
configuration variables related to the web services API.
The WS interface is listening on the condor_schedd daemon's command port.
To obtain a list of all the the condor_schedd daemons in the
pool with a WS interface, issue the command:
% condor_status -schedd -constraint "HasSOAPInterface=?=TRUE"
With this information,
a further command locates the port number to use:
% condor_status -schedd -constraint "HasSOAPInterface=?=TRUE" -l | grep MyAddress
Condor's security configuration must be set up such that
access is authorized for the SOAP client.
See Section 3.6.7
for information on how to set the
ALLOW_SOAP and DENY_SOAP configuration variables.
The API's routines can be roughly categorized into ones that
deal with
- Transactions
- Job Submission
- File Transfer
- Job Management
- ClassAd Management
- Version Information
The routines for each of these categories is detailed.
Note that the signature provided will accurately
reflect a routine's name,
but that return values and parameter specification
will vary according to the target programming language.
4.5.1.5 Get These Items Correct
- For jobs that are to be executed on Windows platforms,
explicitly set the job ClassAd attribute NTDomain.
This attribute defines the NT domain within which the
job's owner authenticates. The attribute is necessary,
and it is not set for the job by the createJobTemplate()
function.
4.5.1.6 Methods for Transaction Management
- beginTransaction
- Begin a transaction.
A prototype is
StatusAndTransaction beginTransaction(int duration);
- Parameters
- duration The expected duration of the transaction.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values. Additionally,
on success, the return value contains the new transaction.
- commitTransaction
- Commits a transaction.
A prototype is
Status commitTransaction(Transaction transaction);
- Parameters
- transaction The transaction to be committed.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values.
- abortTransaction
- Abort a transaction.
A prototype is
Status abortTransaction(Transaction transaction);
- Parameters
- transaction The transaction to be aborted.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values.
- extendTransaction
- Request an extension in duration for a specific transaction.
A prototype is
StatusAndTransaction extendTransaction(
Transaction transaction, int duration);
- Parameters
- transaction The transaction to be extended.
- duration The duration of the extension.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values. Additionally,
on success, the return value contains the transaction with the extended
duration.
4.5.1.7 Methods for Job Submission
- submit
- Submit a job.
A prototype is
StatusAndRequirements submit(Transaction transaction,
int clusterId, int jobId, ClassAd jobAd);
- Parameters
- transaction
The transaction in which the submission takes place.
- clusterId The cluster identifier.
- jobId The job identifier.
- jobAd
The ClassAd describing the job. Creation of this ClassAd can be simplified
with createJobTemplate();.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values. Additionally,
the return value contains the job's requirements.
- createJobTemplate
- Request a job Class Ad, given some of the job requirements.
This job Class Ad will be suitable for use when submitting the job.
Note that the job attribute NTDomain is not set by this
function, but must be set for jobs that will execute on Windows
platforms.
A prototype is
StatusAndClassAd createJobTemplate(int clusterId,
int jobId, String owner, UniverseType type, String command,
String arguments, String requirements);
- Parameters
- clusterId The cluster identifier.
- jobId The job identifier.
- owner
The name to be associated with the job.
- type
The universe under which the job will run, where type can be
one of the following:
enum UniverseType { STANDARD = 1, VANILLA = 5,
SCHEDULER = 7, MPI = 8, GRID = 9, JAVA = 10,
PARALLEL = 11, LOCALUNIVERSE = 12, VM = 13 };
- command
The command to execute once the job has started.
- arguments
The command-line arguments for command.
- requirements
The requirements expression for the job. For further details
and examples of the expression syntax, please refer to
section 4.1.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values.
- discoverJobRequirements
- Discover the requirements of a job, given a Class Ad. May be helpful
in determining what should be sent along with the job.
A prototype is
StatusAndRequirements discoverJobRequirements(
ClassAd jobAd);
- Parameters
- jobAd The ClassAd of the job.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values. Additionally,
on success, the return value contains the job's requirements.
4.5.1.8 Methods for File Transfer
- declareFile
- Declare a file that may be used by a job.
A prototype is
Status declareFile(Transaction transaction, int clusterId,
int jobId, String name, int size, HashType hashType, String hash);
- Parameters
-
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values.
- sendFile
- Send a file that a job may use.
A prototype is
Status sendFile(Transaction transaction, int clusterId,
int jobId, String name, int offset, Base64 data);
- Parameters
- transaction
The transaction in which this file is send.
- clusterId The cluster identifier.
- jobId
An identifier of the job that will use the file.
- name
The name of the file being sent.
- offset
The starting offset within the file being sent.
- length
The length from the offset to send.
- data
The data block being sent. This could be the entire file or a
sub-section of the file as defined by offset and
length.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values.
- getFile
- Get a file from a job's spool.
A prototype is
StatusAndBase64 getFile(Transaction transaction,
int clusterId, int jobId, String name, int offset, int length);
- Parameters
- transaction
An optionally nullable transaction, meaning this call does not
need to occur in a transaction.
- clusterId The cluster in which to search.
- jobId
The job identifier the file is associated with.
- name
The name of the file to retrieve.
- offset
The starting offset withing the file being retrieved.
- length
The length from the offset to retrieve.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values. Additionally,
on success, the return value contains the file or a sub-section of the
file as defined by offset and length.
- closeSpool
- Close a job's spool.
All the files in the job's spool can be deleted.
A prototype is
Status closeSpool(Transaction transaction, int clusterId,
int jobId);
- Parameters
- transaction
An optionally nullable transaction, meaning this call does not
need to occur in a transaction.
- clusterId
The cluster identifier which the job is associated with.
- jobId
The job identifier for which the spool is to be removed.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values.
- listSpool
- List the files in a job's spool.
A prototype is
StatusAndFileInfoArray listSpool(Transaction transaction,
int clusterId, int jobId);
- Parameters
- transaction
An optionally nullable transaction, meaning this call does not
need to occur in a transaction.
- clusterId The cluster in which to search.
- jobId The job identifier to search for.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values. Additionally,
on success, the return value contains a list of files and their
respective sizes.
4.5.1.9 Methods for Job Management
- newCluster
- Create a new job cluster.
A prototype is
StatusAndInt newCluster(Transaction transaction);
- Parameters
- transaction
The transaction in which this cluster is created.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values. Additionally,
on success, the return value contains the cluster id.
- removeCluster
- Remove a job cluster, and all the jobs within it.
A prototype is
Status removeCluster(Transaction transaction, int clusterId,
String reason);
- Parameters
- transaction
An optionally nullable transaction, meaning this call does not
need to occur in a transaction.
- clusterId
The cluster to remove.
- reason
The reason for the removal.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values.
- newJob
- Creates a new job within the most recently created job cluster.
A prototype is
StatusAndInt newJob(Transaction transaction, int clusterId);
- Parameters
- transaction
The transaction in which this job is created.
- clusterId
The cluster identifier of the most recently created cluster.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values. Additionally,
on success, the return value contains the job id.
- removeJob
- Remove a job, regardless of the job's state.
A prototype is
Status removeJob(Transaction transaction, int clusterId,
int jobId, String reason, boolean forceRemoval);
- Parameters
- transaction
An optionally nullable transaction, meaning this call does not
need to occur in a transaction.
- clusterId The cluster identifier to search in.
- jobId The job identifier to search for.
- reason The reason for the release.
- forceRemoval
Set if the job should be forcibly removed.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values.
- holdJob
- Put a job into the Hold state, regardless of the job's current state.
A prototype is
Status holdJob(Transaction transaction, int clusterId,
int jobId, string reason, boolean emailUser, boolean emailAdmin,
boolean systemHold);
- Parameters
- transaction
An optionally nullable transaction, meaning this call does not
need to occur in a transaction.
- clusterId The cluster in which to search.
- jobId The job identifier to search for.
- reason The reason for the release.
- emailUser
Set if the submitting user should be notified.
- emailAdmin
Set if the administrator should be notified.
- systemHold
Set if the job should be put on hold.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values.
- releaseJob
- Release a job that has been in the Hold state.
A prototype is
Status releaseJob(Transaction transaction, int clusterId,
int jobId, String reason, boolean emailUser, boolean emailAdmin);
- Parameters
- transaction
An optionally nullable transaction, meaning this call does not
need to occur in a transaction.
- clusterId The cluster in which to search.
- jobId The job identifier to search for.
- reason The reason for the release.
- emailUser
Set if the submitting user should be notified.
- emailAdmin
Set if the administrator should be notified.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values.
- getJobAds
-
A prototype is
StatusAndClassAdArray getJobAds(Transaction transaction,
String constraint);
- Parameters
- transaction
An optionally nullable transaction, meaning this call does not
need to occur in a transaction.
- constraint
A string constraining the number ClassAds to return. For further details
and examples of the constraint syntax, please refer to
section 4.1.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values. Additionally,
on success, the return value contains all job ClassAds matching the
given constraint.
- getJobAd
- Finds a specific job ClassAd.
This method does much the same as the first element from the array
returned by
getJobAds(transaction, "(ClusterId==clusterId && JobId==jobId)")
A prototype is
StatusAndClassAd getJobAd(Transaction transaction,
int clusterId, int jobId);
- Parameters
- transaction
An optionally nullable transaction, meaning this call does not
need to occur in a transaction.
- clusterId The cluster in which to search.
- jobId The job identifier to search for.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values. Additionally,
on success, the return value contains the requested ClassAd.
- requestReschedule
- Request a condor_reschedule from the condor_schedd daemon.
A prototype is
Status requestReschedule();
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values.
4.5.1.10 Methods for ClassAd Management
- insertAd
- A prototype is
Status insertAd(ClassAdType type, ClassAdStruct ad);
- Parameters
- type
The type of ClassAd to insert, where type can be one of the
following:
enum ClassAdType {
STARTD_AD_TYPE, QUILL_AD_TYPE,
SCHEDD_AD_TYPE, SUBMITTOR_AD_TYPE,
LICENSE_AD_TYPE, MASTER_AD_TYPE,
CKPTSRVR_AD_TYPE, COLLECTOR_AD_TYPE,
STORAGE_AD_TYPE, NEGOTIATOR_AD_TYPE,
HAD_AD_TYPE, GENERIC_AD_TYPE };
- ad The ClassAd to insert.
- Return Value
- If the function succeeds, the return value is SUCCESS;
otherwise, see StatusCode for valid return values.
- queryStartdAds
- A prototype is
ClassAdArray queryStartdAds(String constraint);
- Parameters
- constraint
A string constraining the number ClassAds to return. For further details
and examples of the constraint syntax, please refer to
section 4.1.
- Return Value
- A list of all the condor_startd ClassAds matching the
given constraint.
- queryScheddAds
- A prototype is
ClassAdArray queryScheddAds(String constraint);
- Parameters
- constraint
A string constraining the number ClassAds to return. For further details
and examples of the constraint syntax, please refer to
section 4.1.
- Return Value
- A list of all the condor_schedd ClassAds matching the given
constraint.
- queryMasterAds
- A prototype is
ClassAdArray queryMasterAds(String constraint);
- Parameters
- constraint
A string constraining the number ClassAds to return. For further details
and examples of the constraint syntax, please refer to
section 4.1.
- Return Value
- A list of all the condor_master ClassAds matching the given
constraint.
- querySubmittorAds
- A prototype is
ClassAdArray querySubmittorAds(String constraint);
- Parameters
- constraint
A string constraining the number ClassAds to return. For further details
and examples of the constraint syntax, please refer to
section 4.1.
- Return Value
- A list of all the submitters ClassAds matching the given
constraint.
- queryLicenseAds
- A prototype is
ClassAdArray queryLicenseAds(String constraint);
- Parameters
- constraint
A string constraining the number ClassAds to return.For further details
and examples of the constraint syntax, please refer to
section 4.1.
- Return Value
- A list of all the license ClassAds matching the given constraint.
- queryStorageAds
- A prototype is
ClassAdArray queryStorageAds(String constraint);
- Parameters
- constraint
A string constraining the number ClassAds to return. For further details
and examples of the constraint syntax, please refer to
section 4.1.
- Return Value
- A list of all the storage ClassAds matching the given constraint.
- queryAnyAds
- A prototype is
ClassAdArray queryAnyAds(String constraint);
- Parameters
- constraint
A string constraining the number ClassAds to return. For further details
and examples of the constraint syntax, please refer to
section 4.1.
- Return Value
- A list of all the ClassAds matching the given constraint.
to return.
4.5.1.11 Methods for Version Information
- getVersionString
- A prototype is
StatusAndString getVersionString();
- Return Value
- Returns the Condor version as a string.
- getPlatformString
- A prototype is
StatusAndString getPlatformString();
- Return Value
- Returns the platform information Condor is running on as string.
4.5.1.12 Common Data Structures
Many methods return a status.
Table 4.5 lists and defines the
StatusCode return values.
Table 4.5:
StatusCode definitions
Value |
Identifier |
Definition |
0 |
SUCCESS |
All OK |
1 |
FAIL |
An error occurred that is not specific to another error code |
2 |
INVALIDTRANSACTION |
No such transaction exists |
3 |
UNKNOWNCLUSTER |
The specified cluster is not the currently active one |
4 |
UNKNOWNJOB |
The specified job does not exist within the specified cluster |
5 |
UNKNOWNFILE |
|
6 |
INCOMPLETE |
|
7 |
INVALIDOFFSET |
|
8 |
ALREADYEXISTS |
For this job, the specified file already exists |
|
4.5.2 The DRMAA API
The following quote from the DRMAA Specification 1.0 abstract
nicely describes the purpose of the API:
The Distributed Resource Management Application API (DRMAA),
developed by a working group of the Global Grid Forum (GGF),
provides a generalized API to distributed resource management systems
(DRMSs) in order to facilitate integration of application programs.
The scope of DRMAA is limited to job submission,
job monitoring and control,
and the retrieval of the finished job status.
DRMAA provides application developers and
distributed resource management builders
with a programming model that enables
the development of distributed applications
tightly coupled to an underlying DRMS.
For deployers of such distributed applications,
DRMAA preserves flexibility and choice in system design.
The API allows users who write programs using DRMAA functions
and link to a DRMAA library to submit,
control, and retrieve information about jobs to a Grid system.
The Condor implementation of a portion of the API
allows programs (applications) to use the library
functions provided to submit, monitor and control
Condor jobs.
See the DRMAA site
(http://www.drmaa.org) to find the
API specification for DRMA 1.0 for further details on the API.
4.5.2.1 Implementation Details
The library was developed from the DRMA API Specification 1.0 of January 2004
and the DRMAA C Bindings v0.9 of September 2003.
It is a static C library that expects a POSIX thread model
on Unix systems and a Windows thread model on Windows systems.
Unix systems that do not support POSIX threads
are not guaranteed thread safety when calling the library's functions.
The object library file is called libcondordrmaa.a,
and it is located within
the <release>/lib directory in the Condor download.
Its header file is called lib_condor_drmaa.h, and it is located within
the <release>/include directory in the Condor download.
Also within <release>/include is the file
lib_condor_drmaa.README,
which gives further details on the implementation.
Use of the library requires that a
local condor_schedd daemon must be running,
and the program linked to the library must have
sufficient spool space.
This space should be in /tmp
or specified by the environment variables
TEMP, TMP, or SPOOL.
The program linked to the library and the local condor_schedd daemon
must have read, write, and traverse rights to the spool space.
The library currently supports the following specification-defined
job attributes:
- DRMAA_REMOTE_COMMAND
- DRMAA_JS_STATE
- DRMAA_NATIVE_SPECIFICATION
- DRMAA_BLOCK_EMAIL
- DRMAA_INPUT_PATH
- DRMAA_OUTPUT_PATH
- DRMAA_ERROR_PATH
- DRMAA_V_ARGV
- DRMAA_V_ENV
- DRMAA_V_EMAIL
The attribute DRMAA_NATIVE_SPECIFICATION can be used
to direct all commands supported within
submit description files.
See the condor_submit manual page at
section 9 for a complete list.
Multiple commands can be specified if separated by newlines.
As in the normal submit file,
arbitrary attributes can be added to the job's ClassAd
by prefixing the attribute with +. In this case, you will need to put
string values in quotation marks, the same as in a submit file.
Thus to tell Condor that the job will likely use 64 megabytes of memory (65536
kilobytes), to more highly rank machines with more memory, and to add the
arbitrary attribute of department set to chemistry, you would set
AttrDRMAA_NATIVE_SPECIFICATION to the C string:
drmaa_set_attribute(jobtemplate, DRMAA_NATIVE_SPECIFICATION,
"image_size=65536\nrank=Memory\n+department=\"chemistry\"",
err_buf, sizeof(err_buf)-1);
4.5.3 The Condor User and Job Log Reader API
Condor has the ability to log a Condor job's significant events during
its lifetime.
This is enabled in the job's submit description file with the
Log command.
This section describes the API defined by the C++ ReadUserLog class,
which provides a programming interface for applications
to read and parse events,
polling for events, and saving and restoring reader state.
The following define enumerated types useful to the API.
- ULogEventOutcome (defined in condor_event.h):
- ULOG_OK: Event is valid
- ULOG_NO_EVENT: No event occurred (like EOF)
- ULOG_RD_ERROR: Error reading log file
- ULOG_MISSED_EVENT: Missed event
- ULOG_UNK_ERROR: Unknown Error
- ReadUserLog::FileStatus
- LOG_STATUS_ERROR: An error was encountered
- LOG_STATUS_NOCHANGE: No change in file size
- LOG_STATUS_GROWN: File has grown
- LOG_STATUS_SHRUNK: File has shrunk
All ReadUserLog constructors invoke one of the initialize()
methods.
Since C++ constructors cannot return errors,
an application using any but the default constructor should call
isIinitialized() to verify that the object initialized correctly,
and for example, had permissions to open required files.
Note that because the constructors cannot return status information,
most of these constructors will be eliminated in the future.
All constructors, except for the default constructor with no parameters,
will be removed.
The application will need to call the appropriate initialize() method.
- ReadUserLog::ReadUserLog(bool isEventLog)
Synopsis: Constructor default
Returns: None
Constructor parameters:
- bool isEventLog (Optional with default = false)
If true, the ReadUserLog object is
initialized to read the schedd-wide event log.
NOTE: If isEventLog is true, the initialization may
silently fail, so the value of ReadUserLog::isInitialized
should be checked to verify that the initialization was successful.
NOTE: The isEventLog parameter will be removed in the future.
- ReadUserLog::ReadUserLog(FILE *fp, bool is_xml, bool enable_close)
Synopsis: Constructor of a limited functionality reader: no rotation handling, no locking
Returns: None
Constructor parameters:
- FILE * fp
File pointer to the previously opened log file to read.
- bool is_xml
If true, the file is treated as XML; otherwise, it will be
read as an old style file.
- bool enable_close (Optional with default = false)
If true, the reader will open the file read-only.
NOTE: The ReadUserLog::isInitialized method
should be invoked to verify that this constructor was initialized
successfully.
NOTE: This constructor will be removed in the future.
- ReadUserLog::ReadUserLog(const char *filename, bool read_only)
Synopsis: Constructor to read a specific log file
Returns: None
Constructor parameters:
- const char * filename
Path to the log file to read
- bool read_only (Optional with default = false)
If true, the reader will open the file read-only and
disable locking.
NOTE: This constructor will be removed in the future.
- ReadUserLog::ReadUserLog(const FileState &state, bool read_only)
Synopsis: Constructor to continue from a persisted reader state
Returns: None
Constructor parameters:
- const FileState & state
Reference to the persisted state to restore from
- bool read_only (Optional with default = false)
If true, the reader will open the file read-only and
disable locking.
NOTE: The ReadUserLog::isInitialized
method should be invoked to verify that this constructor was
initialized successfully.
NOTE: This constructor will be removed in the future.
- ReadUserLog::~ReadUserLog(void)
Synopsis: Destructor
Returns: None
Destructor parameters:
These methods are used to perform the initialization of the
ReadUserLog objects. These initializers are used by all constructors
that do real work.
Applications should never use those constructors,
should use the default constructor,
and should instead use one of these initializer methods.
All of these functions will return false if there are problems
such as being unable to open the log file,
or true if successful.
- bool ReadUserLog::initialize(void)
Synopsis: Initialize to read the EventLog file.
NOTE: This method will likely be eliminated in the future, and this
functionality will be moved to a new ReadEventLog class.
Returns: bool; true: success, false: failed
Method parameters:
- bool ReadUserLog::initialize(const char *filename, bool handle_rotation,
bool check_for_rotated, bool read_only)
Synopsis: Initialize to read a specific log file.
Returns: bool; true: success, false: failed
Method parameters:
- const char * filename
Path to the log file to read
- bool handle_rotation (Optional with default = false)
If true, enable the reader to handle rotating log files,
which is only useful for global user logs
- bool check_for_rotated (Optional with default = false)
If true, try to open the rotated files
(with file names appended with .old or .1, .2, ... )
first.
- bool read_only (Optional with default = false)
If true, the reader will open the file read-only and
disable locking.
- bool ReadUserLog::initialize(const char *filename, int max_rotation,
bool check_for_rotated, bool read_only)
Synopsis: Initialize to read a specific log file.
Returns: bool; true: success, false: failed
Method parameters:
- const char * filename
Path to the log file to read
- int max_rotation
Limits what previously rotated files will be considered by the number
given in the file name suffix.
A value of 0 disables looking for rotated files.
A value of 1 limits the rotated file to be that with the file name suffix
of .old.
As only event logs are rotated, this parameter is only useful for
event logs.
- bool check_for_rotated (Optional with default = false)
If true, try to open the rotated files
(with file names appended with .old or .1, .2, ... )
first.
- bool read_only (Optional with default = false)
If true, the reader will open the file read-only and
disable locking.
- bool ReadUserLog::initialize(const FileState &state, bool read_only)
Synopsis: Initialize to continue from a persisted reader state.
Returns: bool; true: success, false: failed
Method parameters:
- const FileState & state
Reference to the persisted state to restore from
- bool read_only (Optional with default = false)
If true, the reader will open the file read-only and
disable locking.
- bool ReadUserLog::initialize(const FileState &state, int max_rotation, bool read_only)
Synopsis: Initialize to continue from a persisted reader state and set the
rotation parameters.
Returns: bool; true: success, false: failed
Method parameters:
- const FileState & state
Reference to the persisted state to restore from
- int max_rotation
Limits what previously rotated files will be considered by the number
given in the file name suffix.
A value of 0 disables looking for rotated files.
A value of 1 limits the rotated file to be that with the file name suffix
of .old.
As only event logs are rotated, this parameter is only useful for
event logs.
- bool read_only (Optional with default = false)
If true, the reader will open the file read-only and
disable locking.
- ULogEventOutcome ReadUserLog::readEvent(ULogEvent & event)
Synopsis: Read the next event from the log file.
Returns: ULogEventOutcome; Outcome of the log read attempt. ULogEventOutcome is an enumerated
type.
Method parameters:
- ULogEvent & event
Pointer to an ULogEvent that is allocated by this call to
ReadUserLog::readEvent.
If no event is allocated, this pointer is
set to NULL. Otherwise the event needs to be delete()ed by the application.
- bool ReadUserLog::synchronize(void)
Synopsis: Synchronize the log file if the last event read was an error. This
safe guard function should be called if there is some error reading an
event, but there are events after it in the file.
It will skip over the
bad event, meaning it will read up to and including the event separator,
so that the rest of the events can be read.
Returns: bool; true: success, false: failed
Method parameters:
- ReadUserLog::FileStatus ReadUserLog::CheckFileStatus(void)
Synopsis: Check the status of the file, and whether it has grown, shrunk, etc.
Returns: ReadUserLog::FileStatus; the status of the log file, an
enumerated type.
Method parameters:
- ReadUserLog::FileStatus ReadUserLog::CheckFileStatus(bool &is_empty)
Synopsis: Check the status of the file, and whether it has grown, shrunk, etc.
Returns: ReadUserLog::FileStatus; the status of the log file, an
enumerated type.
Method parameters:
- bool & is_empty
Set to true if the file is empty, false otherwise.
The ReadUserLog::FileState structure is used to save and
restore the state of the ReadUserLog state for persistence. The
application should always use InitFileState() to initialize this
structure.
All of these methods take a reference to a state buffer
as their only parameter.
All of these methods return true upon success.
To save the state, do something like this:
ReadUserLog reader;
ReadUserLog::FileState statebuf;
status = ReadUserLog::InitFileState( statebuf );
status = reader.GetFileState( statebuf );
write( fd, statebuf.buf, statebuf.size );
...
status = reader.GetFileState( statebuf );
write( fd, statebuf.buf, statebuf.size );
...
status = UninitFileState( statebuf );
To restore the state, do something like this:
ReadUserLog::FileState statebuf;
status = ReadUserLog::InitFileState( statebuf );
read( fd, statebuf.buf, statebuf.size );
ReadUserLog reader;
status = reader.initialize( statebuf );
status = UninitFileState( statebuf );
....
- static bool ReadUserLog::InitFileState(ReadUserLog::FileState &state)
Synopsis: Initialize a file state buffer
Returns: bool; true if successful, false otherwise
Method parameters:
- ReadUserLog::FileState & state
The file state buffer to initialize.
- static bool ReadUserLog::UninitFileState(ReadUserLog::FileState &state)
Synopsis: Clean up a file state buffer and free allocated memory
Returns: bool; true if successful, false otherwise
Method parameters:
- ReadUserLog::FileState & state
The file state buffer to un-initialize.
- bool ReadUserLog::GetFileState(ReadUserLog::FileState &state) const
Synopsis: Get the current state to persist it or save it off to disk
Returns: bool; true if successful, false otherwise
Method parameters:
- ReadUserLog::FileState & state
The file state buffer to read the state into.
- bool ReadUserLog::SetFileState(const ReadUserLog::FileState &state)
Synopsis: Use this method to set the current state, after restoring it.
NOTE: The state buffer is NOT automatically updated; a call
MUST be made to
the GetFileState() method each time before persisting the
buffer to disk, or however else is chosen to persist its contents.
Returns: bool; true if successful, false otherwise
Method parameters:
- const ReadUserLog::FileState & state
The file state buffer to restore from.
If the application needs access to the data elements in a persistent
state, it should instantiate a ReadUserLogStateAccess object.
- Constructors / Destructors
- ReadUserLogStateAccess::ReadUserLogStateAccess(const ReadUserLog::FileState &state)
Synopsis: Constructor default
Returns: None
Constructor parameters:
- const ReadUserLog::FileState & state
Reference to the persistent state data to initialize from.
- ReadUserLogStateAccess::~ReadUserLogStateAccess(void)
Synopsis: Destructor
Returns: None
Destructor parameters:
- Accessor Methods
- bool ReadUserLogFileState::isInitialized(void) const
Synopsis: Checks if the buffer initialized
Returns: bool; true if successfully initialized, false otherwise
Method parameters:
- bool ReadUserLogFileState::isValid(void) const
Synopsis: Checks if the buffer is valid for use by
ReadUserLog::initialize()
Returns: bool; true if successful, false otherwise
Method parameters:
- bool ReadUserLogFileState::getFileOffset(unsigned long &pos) const
Synopsis: Get position within individual file.
NOTE: Can return an error if the result is too large to be
stored in a long.
Returns: bool; true if successful, false otherwise
Method parameters:
- unsigned long & pos
Byte position within the current log file
- bool ReadUserLogFileState::getFileEventNum(unsigned long &num) const
Synopsis: Get event number in individual file.
NOTE: Can return an error if the result is too large to be
stored in a long.
Returns: bool; true if successful, false otherwise
Method parameters:
- unsigned long & num
Event number of the current event in the current log file
- bool ReadUserLogFileState::getLogPosition(unsigned long &pos) const
Synopsis: Position of the start of the current file in overall log.
NOTE: Can return an error if the result is too large
to be stored in a long.
Returns: bool; true if successful, false otherwise
Method parameters:
- unsigned long & pos
Byte offset of the start of the current file in the overall
logical log stream.
- bool ReadUserLogFileState::getEventNumber(unsigned long &num) const
Synopsis: Get the event number of the first event in the current file
NOTE: Can return an error if the result is too large
to be stored in a long.
Returns: bool; true if successful, false otherwise
Method parameters:
- unsigned long & num
This is the absolute event number of the first event in the
current file in the overall logical log stream.
- bool ReadUserLogFileState::getUniqId(char *buf, int size) const
Synopsis: Get the unique ID of the associated state file.
Returns: bool; true if successful, false otherwise
Method parameters:
- char buf
Buffer to fill with the unique ID of the current file.
- int size
Size in bytes of buf.
This is to prevent ReadUserLogFileState::getUniqId
from writing past the end of buf.
- bool ReadUserLogFileState::getSequenceNumber(int &seqno) const
Synopsis: Get the sequence number of the associated state file.
Returns: bool; true if successful, false otherwise
Method parameters:
- int & seqno
Sequence number of the current file
- Comparison Methods
- bool ReadUserLogFileState::getFileOffsetDiff(const ReadUserLogStateAccess &other, unsigned long &pos) const
Synopsis: Get the position difference of two states given by this
and other.
NOTE: Can return an error if the result is too large to be
stored in a long.
Returns: bool; true if successful, false otherwise
Method parameters:
- const ReadUserLogStateAccess & other
Reference to the state to compare to.
- long & diff
Difference in the positions
- bool ReadUserLogFileState::getFileEventNumDiff(const ReadUserLogStateAccess &other, long &diff) const
Synopsis: Get event number in individual file.
NOTE: Can return an error if the result is too large to be
stored in a long.
Returns: bool; true if successful, false otherwise
Method parameters:
- const ReadUserLogStateAccess & other
Reference to the state to compare to.
- long & diff
Event number of the current event in
the current log file
- bool ReadUserLogFileState::getLogPosition(const ReadUserLogStateAccess &other, long &diff) const
Synopsis: Get the position difference of two states given by this
and other.
NOTE: Can return an error if the result is too large
to be stored in a long.
Returns: bool; true if successful, false otherwise
Method parameters:
- const ReadUserLogStateAccess & other
Reference to the state to compare to.
- long & diff
Difference between the byte offset of the start of the current
file in the overall logical log stream and that of other.
- bool ReadUserLogFileState::getEventNumber(const ReadUserLogStateAccess &other, long &diff) const
Synopsis: Get the difference between the event number of the first event in
two state buffers (this - other).
NOTE: Can return an error if the result is too large
to be stored in a long.
Returns: bool; true if successful, false otherwise
Method parameters:
- const ReadUserLogStateAccess & other
Reference to the state to compare to.
- long & diff
Difference between the absolute event number of the first event in
the current file in the overall logical log stream and that of
other.
The ReadUserLog::FileState will likely be replaced with a new
C++ ReadUserLog::NewFileState, or a similarly named class that
will self initialize.
Additionally, the functionality of ReadUserLogStateAccess will
be integrated into this class.
4.5.4 Chirp
4.5.5 The Command Line Interface
4.5.6 The Condor GAHP
4.5.7 The Condor Perl Module
The Condor Perl module facilitates automatic submitting and monitoring of
Condor jobs, along with automated administration of Condor.
The most common
use of this module is the monitoring of Condor jobs.
The Condor Perl module can be used as a meta scheduler for the submission
of Condor jobs.
The Condor Perl module provides several subroutines.
Some of the subroutines are used as callbacks;
an event triggers the execution of a specific subroutine.
Other of the subroutines denote actions to be taken by Perl.
Some of these subroutines take other subroutines as arguments.
- Submit(submit_description_file)
- This subroutine takes the action of submitting a job to Condor.
The argument is the name of a submit description file.
The condor_submit program should be in the
path of the user. If the user wishes to monitor the job with condor
they must specify a log file in the command file. The cluster
submitted is returned. For more information
see the condor_submit man page.
- Vacate(machine)
- This subroutine takes the action of sending a
condor_vacate command to the machine specified as an argument.
The machine may be specified
either by host name, or by sinful string. For more information
see the condor_vacate man page.
- Reschedule(machine)
- This subroutine takes the action of sending a
condor_reschedule command to the machine specified as an argument.
The machine may be specified either
by host name, or by sinful string. For more information see
the condor_reschedule man page.
- Monitor(cluster)
- Takes the action of monitoring this cluster.
It returns when all jobs in cluster terminate.
- Wait()
- Takes the action of waiting until all monitor subroutines finish,
and then exits the Perl script.
- DebugOn()
- Takes the action of turning debug messages on.
This may be useful when attempting to debug the Perl script.
- DebugOff()
- Takes the action of turning debug messages off.
- RegisterEvicted(sub)
- Register a subroutine (called sub)
to be used as a callback when a job from
a specified cluster is evicted. The subroutine will be
called with two arguments: cluster and job. The cluster
and job are the cluster number and process number of the job that
was evicted.
- RegisterEvictedWithCheckpoint(sub)
- Same as RegisterEvicted except that the handler is called when the
evicted job was checkpointed.
- RegisterEvictedWithoutCheckpoint(sub)
- Same as RegisterEvicted except that the handler is called when the
evicted job was not checkpointed.
- RegisterExit(sub)
- Register a termination handler that is called when a job exits.
The termination handler will be called with two arguments: cluster and
job. The cluster and job are the cluster and process numbers of the
existing job.
- RegisterExitSuccess(sub)
- Register a termination handler that is called when a job exits without
errors. The termination handler will be called with two arguments:
cluster and job The cluster and job are the cluster and process
numbers of the existing job.
- RegisterExitFailure(sub)
- Register a termination handler that is called when a job exits with
errors. The termination handler will be called with three arguments:
cluster, job and retval. The cluster and job are the cluster
and process numbers of the existing job and the retval is the exit
code of the job.
- RegisterExitAbnormal(sub)
- Register an termination handler that is called when a job abnormally
exits (segmentation fault, bus error, ...). The termination handler
will be called with four arguments: cluster, job signal and
core. The cluster and job are the cluster and process numbers of
the existing job. The signal indicates the signal that the job
died with and core indicates whether a core file was created and if
so, what the full path to the core file is.
- RegisterAbort(sub)
- Register a handler that is called when a job is aborted by a user.
- RegisterJobErr(sub)
- Register a handler that is called when a job is not executable.
- RegisterExecute(sub)
- Register an execution handler that is called whenever a job starts
running on a given host. The handler is called with four arguments:
cluster, job host, and sinful. Cluster and job are the cluster and
process numbers for the job, host is the Internet address of the
machine running the job, and sinful is the Internet address and
command port of the condor_starter supervising the job.
- RegisterSubmit(sub)
- Register a submit handler that is called whenever a job is submitted
with the given cluster. The handler is called with cluster, job
host, and sinful. Cluster and job are the cluster and
process numbers for the job, host is the Internet address of the
machine running the job, and sinful is the Internet address and
command port of the condor_schedd responsible for the job.
- Monitor(cluster)
- Begin monitoring this cluster. Returns when all jobs in cluster
terminate.
- Wait()
- Wait until all monitors finish and exit.
- DebugOn()
- Turn debug messages on. This may be useful if you don't understand
what your script is doing.
- DebugOff()
- Turn debug messages off.
- TestSubmit(command_file)
- This subroutine submits a job to Condor for testing, and places
all variables from the command file into
the Perl hash %submit_info.
Does not reset the state of variables, so that testing preserves
callbacks.
- SubmitDagman(DAG_file, DAGMan_args)
- Takes the action of submitting a DAG using condor_dagman.
The first argument is the name of the DAG input file,
and the second argument is the command line arguments for
condor_dagman.
Information from the submit description file generated by
condor_dagman is placed into the Perl hash %submit_info
for access during callbacks.
- TestSubmitDagman(DAG_file, DAGMan_args)
- This subroutine submits a condor_dagman to Condor for testing,
and places information from the submit description file generated by
condor_dagman into the Perl hash %submit_info
for access during callbacks.
The first argument is the name of the DAG input file,
and the second argument is the command line arguments for
condor_dagman.
Does not reset the state of variables, so that testing preserves
callbacks.
- RegisterEvictedWithRequeue(sub)
- Register a subroutine (called sub)
to be used as a callback when a job from
a specified cluster is requeued. The subroutine will be
called with two arguments: cluster and job. The cluster
and job are the cluster number and process number of the job that
was requeued.
- RegisterShadow(sub)
- Register a subroutine (called sub)
to be used as a callback when a shadow exception occurs.
- RegisterHold(sub)
- Register a subroutine (called sub)
to be used as a callback when a job enters the hold state.
- RegisterRelease(sub)
- Register a subroutine (called sub)
to be used as a callback when a job is released.
- RegisterWantError(sub)
- Register a subroutine (called sub)
to be used as a callback when a system call invoked using runCommand
experiences an error.
- runCommand(string)
- string identifies a syscall that is invoked.
If the syscall exits abnormally or exits with an error, the callback
registered with RegisterWantError() is called, and
an error message is issued.
- RegisterTimed(sub, seconds)
- Register a subroutine (called sub)
to be called back at a delay of seconds time from
this registration time. Only one callback may be registered,
as subsequent calls modify the timer only.
- RemoveTimed()
- Remove the single, timed callback registered with RegisterTimed().
4.5.7.2 Examples
The following is an example that uses the Condor Perl module.
The example uses the submit description file
mycmdfile.cmd to specify the submission of a job.
As the job is matched with a machine and begins to execute,
a callback subroutine (called execute
)
sends a condor_vacate signal to the job,
and it increments a counter which keeps track of the
number of times this callback executes.
A second callback keeps a count of the number of times
that the job was evicted before the job completes.
After the job completes, the termination
callback (called normal
) prints out a summary of what happened.
#!/usr/bin/perl
use Condor;
$CMD_FILE = 'mycmdfile.cmd';
$evicts = 0;
$vacates = 0;
# A subroutine that will be used as the normal execution callback
$normal = sub
{
%parameters = @_;
$cluster = $parameters{'cluster'};
$job = $parameters{'job'};
print "Job $cluster.$job exited normally without errors.\n";
print "Job was vacated $vacates times and evicted $evicts times\n";
exit(0);
};
$evicted = sub
{
%parameters = @_;
$cluster = $parameters{'cluster'};
$job = $parameters{'job'};
print "Job $cluster, $job was evicted.\n";
$evicts++;
&Condor::Reschedule();
};
$execute = sub
{
%parameters = @_;
$cluster = $parameters{'cluster'};
$job = $parameters{'job'};
$host = $parameters{'host'};
$sinful = $parameters{'sinful'};
print "Job running on $sinful, vacating...\n";
&Condor::Vacate($sinful);
$vacates++;
};
$cluster = Condor::Submit($CMD_FILE);
printf("Could not open. Access Denied\n");
break;
&Condor::RegisterExitSuccess($normal);
&Condor::RegisterEvicted($evicted);
&Condor::RegisterExecute($execute);
&Condor::Monitor($cluster);
&Condor::Wait();
This example program will submit the command file 'mycmdfile.cmd' and attempt
to vacate any machine that the job runs on. The termination
handler then prints out a summary of what has happened.
A second example Perl script facilitates the meta-scheduling of
two of Condor jobs.
It submits a second job if the first job successfully completes.
#!/s/std/bin/perl
# tell Perl where to find the Condor library
use lib '/unsup/condor/lib';
# tell Perl to use what it finds in the Condor library
use Condor;
$SUBMIT_FILE1 = 'Asubmit.cmd';
$SUBMIT_FILE2 = 'Bsubmit.cmd';
# Callback used when first job exits without errors.
$firstOK = sub
{
%parameters = @_;
$cluster = $parameters{'cluster'};
$job = $parameters{'job'};
$cluster = Condor::Submit($SUBMIT_FILE2);
if (($cluster) == 0)
{
printf("Could not open $SUBMIT_FILE2.\n");
}
&Condor::RegisterExitSuccess($secondOK);
&Condor::RegisterExitFailure($secondfails);
&Condor::Monitor($cluster);
};
$firstfails = sub
{
%parameters = @_;
$cluster = $parameters{'cluster'};
$job = $parameters{'job'};
print "The first job, $cluster.$job failed, exiting with an error. \n";
exit(0);
};
# Callback used when second job exits without errors.
$secondOK = sub
{
%parameters = @_;
$cluster = $parameters{'cluster'};
$job = $parameters{'job'};
print "The second job, $cluster.$job successfully completed. \n";
exit(0);
};
# Callback used when second job exits WITH an error.
$secondfails = sub
{
%parameters = @_;
$cluster = $parameters{'cluster'};
$job = $parameters{'job'};
print "The second job ($cluster.$job) failed. \n";
exit(0);
};
$cluster = Condor::Submit($SUBMIT_FILE1);
if (($cluster) == 0)
{
printf("Could not open $SUBMIT_FILE1. \n");
}
&Condor::RegisterExitSuccess($firstOK);
&Condor::RegisterExitFailure($firstfails);
&Condor::Monitor($cluster);
&Condor::Wait();
Some notes are in order about this example.
The same task could be accomplished using the Condor DAGMan
metascheduler.
The first job is the parent, and the second job is the child.
The input file to DAGMan is significantly simpler than this
Perl script.
A third example using the Condor Perl module
expands upon the second example.
Whereas the second example could have been more easily
implemented using DAGMan, this third example shows
the versatility of using Perl as a metascheduler.
In this example, the result generated from the successful completion of
the first job are used to decide which subsequent job should be
submitted.
This is a very simple example of a branch and bound technique,
to focus the search for a problem solution.
#!/s/std/bin/perl
# tell Perl where to find the Condor library
use lib '/unsup/condor/lib';
# tell Perl to use what it finds in the Condor library
use Condor;
$SUBMIT_FILE1 = 'Asubmit.cmd';
$SUBMIT_FILE2 = 'Bsubmit.cmd';
$SUBMIT_FILE3 = 'Csubmit.cmd';
# Callback used when first job exits without errors.
$firstOK = sub
{
%parameters = @_;
$cluster = $parameters{'cluster'};
$job = $parameters{'job'};
# open output file from first job, and read the result
if ( -f "A.output" )
{
open(RESULTFILE, "A.output") or die "Could not open result file.";
$result = <RESULTFILE>;
close(RESULTFILE);
# next job to submit is based on output from first job
if ($result < 100)
{
$cluster = Condor::Submit($SUBMIT_FILE2);
if (($cluster) == 0)
{
printf("Could not open $SUBMIT_FILE2.\n");
}
&Condor::RegisterExitSuccess($secondOK);
&Condor::RegisterExitFailure($secondfails);
&Condor::Monitor($cluster);
}
else
{
$cluster = Condor::Submit($SUBMIT_FILE3);
if (($cluster) == 0)
{
printf("Could not open $SUBMIT_FILE3.\n");
}
&Condor::RegisterExitSuccess($thirdOK);
&Condor::RegisterExitFailure($thirdfails);
&Condor::Monitor($cluster);
}
}
else
{
printf("Results file does not exist.\n");
}
};
$firstfails = sub
{
%parameters = @_;
$cluster = $parameters{'cluster'};
$job = $parameters{'job'};
print "The first job, $cluster.$job failed, exiting with an error. \n";
exit(0);
};
# Callback used when second job exits without errors.
$secondOK = sub
{
%parameters = @_;
$cluster = $parameters{'cluster'};
$job = $parameters{'job'};
print "The second job, $cluster.$job successfully completed. \n";
exit(0);
};
# Callback used when third job exits without errors.
$thirdOK = sub
{
%parameters = @_;
$cluster = $parameters{'cluster'};
$job = $parameters{'job'};
print "The third job, $cluster.$job successfully completed. \n";
exit(0);
};
# Callback used when second job exits WITH an error.
$secondfails = sub
{
%parameters = @_;
$cluster = $parameters{'cluster'};
$job = $parameters{'job'};
print "The second job ($cluster.$job) failed. \n";
exit(0);
};
# Callback used when third job exits WITH an error.
$thirdfails = sub
{
%parameters = @_;
$cluster = $parameters{'cluster'};
$job = $parameters{'job'};
print "The third job ($cluster.$job) failed. \n";
exit(0);
};
$cluster = Condor::Submit($SUBMIT_FILE1);
if (($cluster) == 0)
{
printf("Could not open $SUBMIT_FILE1. \n");
}
&Condor::RegisterExitSuccess($firstOK);
&Condor::RegisterExitFailure($firstfails);
&Condor::Monitor($cluster);
&Condor::Wait();
Next: 5. Grid Computing
Up: 4. Miscellaneous Concepts
Previous: 4.4 Hooks
Contents
Index
htcondor-admin@cs.wisc.edu