In general, this release on NT works the same as the release of Condor for Unix.
However, the following items are not supported in this version:
universe = vanilla
Except for those items listed above, most everything works the same way in Condor NT as it does in the Unix release. This release is based on the Condor Version 6.2.2source tree, and thus the feature set is the same as Condor Version 6.2.2for Unix. For instance, all of the following work in Condor NT:
Condor remote system calls and the ability to access network shares is not yet supported on NT -- they will be in the near future. For now, Condor NT users must utilize the Condor File Transfer mechanism.
When Condor finds a machine willing to execute a job, it will create a temporary subdirectory for the job on the execute machine. The Condor File Transfer mechanism will then send (by TCP) the job executable(s) and input files from the submitting machine to the temporary subdirectory on the execute machine. After the input files have been transferred, the execute machine starts running the job with the temporary directory as the job's current working directory. When the job completes or is kicked off, Condor File Transfer automatically sends back to the submit machine any output files created by the job. After the files have been successfully sent back, the temporary working directory on the execute machine is removed.
Condor's File Transfer mechanism has several features to ensure data integrity in a non-dedicated environment. For instance, transfers of multiple files are performed atomically.
), the following new commands
are available for use in the submit description file:
It is highly recommended that the Requirements expression in the submit description file includes a requirement for the size of the Disk attribute. Doing so ensures that Condor picks a machine with enough local disk space for the job. Here is a sample submit description file:
# Condor submit file for program "foo.exe".
#
# foo reads from files "my-input-data" and "my-other-input-data".
# foo then writes out results into several files.
# The total disk space foo uses for all input and output files
# is never more than 10 megabytes.
#
executable = foo.exe
universe = vanilla
# Now set Requirements saying that the machine which runs our job
# must have more than 10megs of free disk space. Note that "Disk"
# is expressed in kilobytes; 10 Mbytes is 10000 kbytes.
requirements = Disk > 10000
#
queue
If the requirements do not specify the necessary amount of local disk space, condor_submit appends the job Requirements with Disk >= DiskUsage. The DiskUsage attribute represents the maximum amount of total disk space required by the job in kilobytes. Condor automatically updates the DiskUsage attribute approximately every 20 minutes while the job runs with the amount of space being used by the job on the execute machine.
Itemized below are some current limitations of the File Transfer mechanism. We anticipate improvement in upcoming releases.
This section provides some details on how Condor NT starts and stops jobs. This discussion is geared for the Condor administrator or advanced user who is already familiar with the material in the Administrators' Manual and wishes to know detailed information on what Condor NT does when starting and stopping jobs.
When Condor NT is about to start a job, the condor_startd on the execute machine spawns a condor_starter process. The condor_starter then creates:
Next, the condor_starter (called the starter) contacts the condor_shadow (called the shadow) process, which is running on the submitting machine, and pulls over the job's executable and input files. These files are placed into the temporary working directory for the job. After all files have been received, the starter spawns the user's executable as user ``condor-run-dir_XXX'' with its current working directory set to the temporary working directory (that is, $(EXECUTE)/dir_XXX).
While the job is running, the starter closely monitors the CPU usage and image size of all processes started by the job. Every 20 minutes the starter sends this information, along with the total size of all files contained in the job's temporary working directory, to the shadow. The shadow then inserts this information into the job's ClassAd so that policy and scheduling expressions can make use of this dynamic information.
If the job exits of its own accord (that is, the job completes), the starter first terminates any processes started by the job which could still be around if the job did not clean up after itself. The starter examines the job's temporary working directory for any files which have been created or modified and sends these files back to the shadow running on the submit machine. The shadow places these files into the initialdir specified in the submit description file; if no initialdir was specified, the files go into the directory where the user invoked condor_submit. Once all the output files are safely transferred back, the job is removed from the queue. If, however, the condor_startd forcibly kills the job before all output files could be transferred, the job is not removed from the queue but instead switches back to the Idle state.
If the condor_startd decides to vacate a job prematurely, the starter sends a WM_CLOSE message to the job. If the job spawned multiple child processes, the WM_CLOSE message is only sent to the parent process (that is, the one started by the starter). The WM_CLOSE message is the preferred way to terminate a process on Windows NT, since this method allows the job to cleanup and free any resources it may have allocated. When the job exits, the starter cleans up any processes left behind. At this point, if transfer_files is set to ONEXIT (the default) in the job's submit description file, the job switches from states, from Running to Idle, and no files are transferred back. If transfer_files is set to ALWAYS, then any files in the job's temporary working directory which were changed or modified are first sent back to the submitting machine. But this time, the shadow places these so-called intermediate files into a subdirectory created in the $(SPOOL) directory on the submitting machine ($(SPOOL) is specified in Condor's configuration file). The job is then switched back to the Idle state until Condor finds a different machine on which to run. When the job is started again, Condor places into the job's temporary working directory the executable and input files as before, plus any files stored in the submit machine's $(SPOOL) directory for that job.
NOTE: A Windows console process can intercept a WM_CLOSE message via the Win32 SetConsoleCtrlHandler() function if it needs to do special cleanup work at vacate time; a WM_CLOSE message generates a CTRL_CLOSE_EVENT. See SetConsoleCtrlHandler() in the Win32 documentation for more info.
NOTE: The default handler in Windows NT for a WM_CLOSE message is for the process to exit. Of course, the job could be coded to ignore it and not exit, but eventually the condor_startd will get impatient and hard-kill the job (if that is the policy desired by the administrator).
Finally, after the job has left and any files transferred back, the starter deletes the temporary working directory, the temporary account, the WindowStation and the Desktop before exiting itself. If the starter should terminate abnormally, the condor_startd attempts the clean up. If for some reason the condor_startd should disappear as well (that is, if the entire machine was power-cycled hard), the condor_startd will clean up when Condor is restarted.
On the execute machine, the user job is run using the access token of an
account dynamically created by Condor which has bare-bones access rights and
privileges. For instance, if your machines are configured so that only
Administrators have write access C:
WINNT, then certainly no
Condor job run on that machine would be able to write anything there. The
only files the job should be able to access on the execute machine are files
accessible by group Everybody and files in the job's temporary working
directory.
On the submit machine, Condor permits the File Transfer mechanism to only
read files which the submitting user has access to read, and only write
files to which the submitting user has access to write. For example, say
only Administrators can write to C:
WINNT on the submit machine,
and a user gives the following to condor_submit :
executable = mytrojan.exe
initialdir = c:\winnt
output = explorer.exe
queue
Unless that user is in group Administrators, Condor will not permit
explorer.exe to be overwritten.
If for some reason the submitting user's account disappears between the time condor_submit was run and when the job runs, Condor is not able to check and see if the now-defunct submitting user has read/write access to a given file. In this case, Condor will ensure that group ``Everyone'' has read or write access to any file the job subsequently tries to read or write. This is in consideration for some network setups, where the user account only exists for as long as the user is logged in.
Condor also provides protection to the job queue. It would be bad if the integrity of the job queue is compromised, because a malicious user could remove other user's jobs or even change what executable a user's job will run. To guard against this, in Condor's default configuration all connections to the condor_schedd (the process which manages the job queue on a given machine) are authenticated using Windows NT's SSPI security layer. The user is then authenticated using the same challenge-response protocol that NT uses to authenticate users to Windows NT file servers. Once authenticated, the only users allowed to edit job entry in the queue are:
To protect the actual job queue files themselves, the Condor NT installation program will automatically set permissions on the entire Condor release directory so that only Administrators have write access.
Finally, Condor NT has all the IP/Host-based security mechanisms present
in the full-blown version of Condor. See section 3.8
starting on page
for complete information
on how to allow/deny access to Condor based upon machine hostname or
IP address.
Unix machines and Windows NT machines running Condor can happily co-exist in the same Condor pool without any problems. For now, the only restriction is jobs submitted on Windows NT must run on Windows NT, and job submitted on Unix must run on Unix. You will get this behavior by default, since condor_submit will automatically set a Requirements expression in the job ClassAd stating that the execute machine must have the same architecture and operating system as the submit machine.
There is absolutely no need to run more than one Condor central manager, even if you have both Unix and NT machines. The Condor central manager itself can run on either Unix or NT; there is no advantage to choosing one over the other. Here at University of Wisconsin-Madison, for instance, we have hundreds of Unix (Solaris, Linux, Irix, etc) and Windows NT machines in our Computer Science Department Condor pool. Our central manager is running on Windows NT. All is happy.