Windows is a strategic platform for Condor, and therefore we have been working toward a complete port to Windows. Our goal is to make Condor every bit as capable on Windows as it is on Unix - or even more capable.
Porting Condor from Unix to Windows is a formidable task, because many components of Condor must interact closely with the underlying operating system. Provided is a clipped version of Condor for Windows. A clipped version is one in which there is no checkpointing and there are no remote system calls.
This section contains additional information specific to running Condor on Windows. In order to effectively use Condor, first read the overview chapter (section 1.1) and the user's manual (section 2.1). If administrating or customizing the policy and set up of Condor, also read the administrator's manual chapter (section 3.1). After reading these chapters, review the information in this chapter for important information and differences when using and administrating Condor on Windows. For information on installing Condor for Windows, see section 3.2.5.
In general, this release for Windows works the same as the release of Condor for Unix. However, the following items are not supported in this version:
Except for those items listed above, most everything works the same way in Condor as it does in the Unix release. This release is based on the Condor Version 7.6.10 source tree, and thus the feature set is the same as Condor Version 7.6.10 for Unix. For instance, all of the following work in Condor:
In order for Condor to operate properly, it must at times be able to act on behalf of users who submit jobs. This is required on submit machines, so that Condor can access a job's input files, create and access the job's output files, and write to the job's log file from within the appropriate security context. On Unix systems, arbitrarily changing what user Condor performs its actions as is easily done when Condor is started with root privileges. On Windows, however, performing an action as a particular user or on behalf of a particular user requires knowledge of that user's password, even when running at the maximum privilege level. Condor provides secure password storage through the use of the condor_store_cred tool. Passwords managed by Condor are encrypted and stored in a secure location within the Windows registry. When Condor needs to perform an action as or on behalf of a particular user, it uses the securely stored password to do so. This implies that a password is stored for every user that will submit jobs from the Windows submit machine.
A further feature permits Condor to execute the job itself under the security context of its submitting user, specifying the run_as_owner command in the job's submit description file. With this feature, it is necessary to configure and run a centralized condor_credd daemon to manage the secure password storage. This makes each user's password available, via an encrypted connection to the condor_credd, to any execute machine that may need it.
By default, the secure password store for a submit machine when no condor_credd is running is managed by the condor_schedd. This approach works in environments where the user's password is only needed on the submit machine.
By default, Condor executes jobs on Windows using dedicated run accounts that have minimal access rights and privileges, and which are recreated for each new job. As an alternative, Condor can be configured to allow users to run jobs using their Windows login accounts. This may be useful if jobs need access to files on a network share, or to other resources that are not available to the low-privilege run account.
This feature requires use of a condor_credd daemon for secure password storage and retrieval. With the condor_credd daemon running, the user's password must be stored, using the condor_store_cred tool. Then, a user that wants a job to run using their own account places into the job's submit description file
run_as_owner = True
It is first necessary to select the single machine on which to run the condor_credd. Often, the machine acting as the pool's central manager is a good choice. An important restriction, however, is that the condor_credd host must be a machine running Windows.
All configuration settings necessary to enable the condor_credd are
contained in the example file etc\condor_config.local.credd
from the Condor distribution. Copy these settings into a local
configuration file for the machine that will run the condor_credd.
Run condor_restart for these new settings to take effect, then
verify (via Task Manager) that a condor_credd process is running.
A second set of configuration variables specify security for the
communication among Condor daemons.
These variables must be set for all machines in the pool.
The following example settings are in the comments contained in the
etc\condor_config.local.credd
example file.
These sample settings rely on the PASSWORD method for
authentication among daemons,
including communication with the condor_credd daemon.
The LOCAL_CREDD variable must be customized to point to the machine
hosting the condor_credd and the ALLOW_CONFIG variable
will be customized, if needed, to refer to an administrative account
that exists on all Condor nodes.
CREDD_HOST = credd.cs.wisc.edu CREDD_CACHE_LOCALLY = True STARTER_ALLOW_RUNAS_OWNER = True ALLOW_CONFIG = Administrator@* SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD SEC_CONFIG_NEGOTIATION = REQUIRED SEC_CONFIG_AUTHENTICATION = REQUIRED SEC_CONFIG_ENCRYPTION = REQUIRED SEC_CONFIG_INTEGRITY = REQUIRED
In order for PASSWORD authenticated communication to work, a pool password must be chosen and distributed. The chosen pool password must be stored identically for each machine. The pool password first should be stored on the condor_credd host, then on the other machines in the pool.
To store the pool password on a Windows machine, run
condor_store_cred add -cwhen logged in with the administrative account on that machine, and enter the password when prompted. If the administrative account is shared across all machines, that is if it is a domain account or has the same password on all machines, logging in separately to each machine in the pool can be avoided. Instead, the pool password can be securely pushed out for each Windows machine using a command of the form
condor_store_cred add -c -n exec01.cs.wisc.edu
Once the pool password is distributed, but before submitting jobs, all machines must reevaluate their configuration, so execute
condor_reconfig -allfrom the central manager. This will cause each execute machine to test its ability to authenticate with the condor_credd. To see whether this test worked for each machine in the pool, run the command
condor_status -f "%s\t" Name -f "%s\n" ifThenElse(isUndefined(LocalCredd),\"UNDEF\",LocalCredd)Any rows in the output with the UNDEF string indicate machines where secure communication is not working properly. Verify that the pool password is stored correctly on these machines.
This may be useful if the job requires direct access to the user's registry entries. It also may be useful when the job requires an application, and the application requires registry access. This feature is always enabled on the condor_startd, but it is limited to the dedicated run account. For security reasons, the profiles are removed after the job has completed and exited. This ensures that malicious jobs cannot discover what any previous job has done, nor sabotage the registry for future jobs. It also ensures the next job has a fresh registry hive.
A user that then wants a job to run with a profile uses the load_profile command in the job's submit description file:
load_profile = True
This feature is currently not compatible with run_as_owner, and will be ignored if both are specified.
Condor has added support for scripting jobs on Windows. Previously, Condor jobs on Windows were limited to executables or batch files. With this new support, Condor determines how to interpret the script using the file name's extension. Without a file name extension, the file will be treated as it has been in the past: as a Windows executable.
This feature may not require any modifications to Condor's configuration. An example that does not require administrative intervention are Perl scripts using ActivePerl.
Windows Scripting Host scripts do require configuration to work correctly. The configuration variables set values to be used in registry look up, which results in a command that invokes the correct interpreter, with the correct command line arguments for the specific scripting language. In Microsoft nomenclature, verbs are actions that can be taken upon a given a file. The familiar examples of Open, Print, and Edit, can be found on the context menu when a user right clicks on a file. The command lines to be used for each of these verbs are stored in the registry under the HKEY_CLASSES_ROOT hive. In general, a registry look up uses the form:
HKEY_CLASSES_ROOT\<FileType>\Shell\<OpenVerb>\Command
Within this specification,
<FileType>
is the name of a file type
(and therefore a scripting language),
and is obtained from the file name extension.
<OpenVerb>
identifies the verb,
and is obtained from the Condor configuration.
The Condor configuration sets the selection of a verb, to aid in the registry look up. The file name extension sets the name of the Condor configuration variable. This variable name is of the form:
OPEN_VERB_FOR_<EXT>_FILES
<EXT>
represents the file name extension.
The following configuration example uses the Open2
verb for
a Windows Scripting Host registry look up for several scripting
languages:
OPEN_VERB_FOR_JS_FILES = Open2 OPEN_VERB_FOR_VBS_FILES = Open2 OPEN_VERB_FOR_VBE_FILES = Open2 OPEN_VERB_FOR_JSE_FILES = Open2 OPEN_VERB_FOR_WSF_FILES = Open2 OPEN_VERB_FOR_WSH_FILES = Open2
In this example, Condor specifies the Open2
verb,
instead of the default Open
verb,
for a script with the file name extension of wsh
.
The Windows Scripting Host's Open2
verb allows standard input,
standard output, and standard error to be redirected
as needed for Condor jobs.
A common difficulty is encountered when a script interpreter requires access to the user's registry. Note that the user's registry is different than the root registry. If not given access to the user's registry, some scripts, such as Windows Scripting Host scripts, will fail. The failure error message appears as:
CScript Error: Loading your settings failed. (Access is denied.)
The fix for this error is to give explicit access to the submitting user's registry hive. This can be accomplished with the addition of the load_profile command in the job's submit description file:
load_profile = True
With this command, there should be no registry access errors. This command should also work for other interpreters. Note that not all interpreters will require access. For example, ActivePerl does not by default require access to the user's registry hive.
This section provides some details on how Condor starts and stops jobs. This discussion is geared for the Condor administrator or advanced user who is already familiar with the material in the Administrator's Manual and wishes to know detailed information on what Condor does when starting and stopping jobs.
When Condor is about to start a job, the condor_startd on the execute machine spawns a condor_starter process. The condor_starter then creates:
Next, the condor_starter daemon contacts the condor_shadow daemon, which is running on the submitting machine, and the condor_starter pulls over the job's executable and input files. These files are placed into the temporary working directory for the job. After all files have been received, the condor_starter spawns the user's executable. Its current working directory set to the temporary working directory.
While the job is running, the condor_starter closely monitors the CPU usage and image size of all processes started by the job. Every 20 minutes the condor_starter sends this information, along with the total size of all files contained in the job's temporary working directory, to the condor_shadow. The condor_shadow then inserts this information into the job's ClassAd so that policy and scheduling expressions can make use of this dynamic information.
If the job exits of its own accord (that is, the job completes), the condor_starter first terminates any processes started by the job which could still be around if the job did not clean up after itself. The condor_starter examines the job's temporary working directory for any files which have been created or modified and sends these files back to the condor_shadow running on the submit machine. The condor_shadow places these files into the initialdir specified in the submit description file; if no initialdir was specified, the files go into the directory where the user invoked condor_submit. Once all the output files are safely transferred back, the job is removed from the queue. If, however, the condor_startd forcibly kills the job before all output files could be transferred, the job is not removed from the queue but instead switches back to the Idle state.
If the condor_startd decides to vacate a job prematurely, the condor_starter sends a WM_CLOSE message to the job. If the job spawned multiple child processes, the WM_CLOSE message is only sent to the parent process. This is the one started by the condor_starter. The WM_CLOSE message is the preferred way to terminate a process on Windows, since this method allows the job to clean up and free any resources it may have allocated. When the job exits, the condor_starter cleans up any processes left behind. At this point, if when_to_transfer_output is set to ON_EXIT (the default) in the job's submit description file, the job switches states, from Running to Idle, and no files are transferred back. If when_to_transfer_output is set to ON_EXIT_OR_EVICT, then any files in the job's temporary working directory which were changed or modified are first sent back to the submitting machine. But this time, the condor_shadow places these intermediate files into a subdirectory created in the $(SPOOL) directory on the submitting machine. The job is then switched back to the Idle state until Condor finds a different machine on which to run. When the job is started again, Condor places into the job's temporary working directory the executable and input files as before, plus any files stored in the submit machine's $(SPOOL) directory for that job.
NOTE: A Windows console process can intercept a WM_CLOSE message via the Win32 SetConsoleCtrlHandler() function, if it needs to do special cleanup work at vacate time; a WM_CLOSE message generates a CTRL_CLOSE_EVENT. See SetConsoleCtrlHandler() in the Win32 documentation for more info.
NOTE: The default handler in Windows for a WM_CLOSE message is for the process to exit. Of course, the job could be coded to ignore it and not exit, but eventually the condor_startd will become impatient and hard-kill the job, if that is the policy desired by the administrator.
Finally, after the job has left and any files transferred back, the condor_starter deletes the temporary working directory, the temporary account if one was created, the Window Station and the Desktop before exiting. If the condor_starter should terminate abnormally, the condor_startd attempts the clean up. If for some reason the condor_startd should disappear as well (that is, if the entire machine was power-cycled hard), the condor_startd will clean up when Condor is restarted.
On the execute machine (by default), the user job is run using the
access token of an account dynamically created by Condor which has
bare-bones access rights and privileges. For instance, if your
machines are configured so that only Administrators have write access
to
C:\WINNT
, then certainly no Condor job run on that machine
would be able to write anything there. The only files the job should
be able to access on the execute machine are files accessible by the
Users and Everyone groups, and files in the job's temporary working
directory. Of course, if the job is configured to run using the
account of the submitting user (as described in section
6.2.4), it will be able to do anything that
the user is able to do on the execute machine it runs on.
On the submit machine, Condor impersonates the submitting user, therefore
the File Transfer mechanism has the same access rights as the submitting
user. For example, say only Administrators can write to
C:\WINNT
on the submit machine,
and a user gives the following to condor_submit :
executable = mytrojan.exe initialdir = c:\winnt output = explorer.exe queueUnless that user is in group Administrators, Condor will not permit explorer.exe to be overwritten.
If for some reason the submitting user's account disappears between the time condor_submit was run and when the job runs, Condor is not able to check and see if the now-defunct submitting user has read/write access to a given file. In this case, Condor will ensure that group ``Everyone'' has read or write access to any file the job subsequently tries to read or write. This is in consideration for some network setups, where the user account only exists for as long as the user is logged in.
Condor also provides protection to the job queue. It would be bad if the integrity of the job queue is compromised, because a malicious user could remove other user's jobs or even change what executable a user's job will run. To guard against this, in Condor's default configuration all connections to the condor_schedd (the process which manages the job queue on a given machine) are authenticated using Windows' eSSPI security layer. The user is then authenticated using the same challenge-response protocol that Windows uses to authenticate users to Windows file servers. Once authenticated, the only users allowed to edit job entry in the queue are:
To protect the actual job queue files themselves, the Condor installation program will automatically set permissions on the entire Condor release directory so that only Administrators have write access.
Finally, Condor has all the IP/Host-based security mechanisms present in the full-blown version of Condor. See section 3.6.9 starting on page for complete information on how to allow/deny access to Condor based upon machine host name or IP address.
Condor can work well with a network file server. The recommended approach to having jobs access files on network shares is to configure jobs to run using the security context of the submitting user (see section 6.2.4). If this is done, the job will be able to access resources on the network in the same way as the user can when logged in interactively.
In some environments, running jobs as their submitting users is not a feasible option. This section outlines some possible alternatives. The heart of the difficulty in this case is that on the execute machine, Condor creates a temporary user that will run the job. The file server has never heard of this user before.
Choose one of these methods to make it work:
All of these methods have advantages and disadvantages.
Here are the methods in more detail:
METHOD A - access the file server as a different user via a net use command with a login and password
Example: you want to copy a file off of a server before running it....
@echo off net use \\myserver\someshare MYPASSWORD /USER:MYLOGIN copy \\myserver\someshare\my-program.exe my-program.exe
The idea here is to simply authenticate to the file server with a different login than the temporary Condor login. This is easy with the "net use" command as shown above. Of course, the obvious disadvantage is this user's password is stored and transferred as clear text.
METHOD B - access the file server as guest
Example: you want to copy a file off of a server before running it as GUEST
@echo off net use \\myserver\someshare copy \\myserver\someshare\my-program.exe my-program.exe
In this example, you'd contact the server MYSERVER as the Condor temporary user. However, if you have the GUEST account enabled on MYSERVER, you will be authenticated to the server as user "GUEST". If your file permissions (ACLs) are setup so that either user GUEST (or group EVERYONE) has access the share "someshare" and the directories/files that live there, you can use this method. The downside of this method is you need to enable the GUEST account on your file server. WARNING: This should be done *with extreme caution* and only if your file server is well protected behind a firewall that blocks SMB traffic.
METHOD C - access the file server with a "NULL" descriptor
One more option is to use NULL Security Descriptors. In this way, you can specify which shares are accessible by NULL Descriptor by adding them to your registry. You can then use the batch file wrapper like:
net use z: \\myserver\someshare /USER:"" z:\my-program.exe
so long as 'someshare' is in the list of allowed NULL session shares. To edit this list, run regedit.exe and navigate to the key:
HKEY_LOCAL_MACHINE\ SYSTEM\ CurrentControlSet\ Services\ LanmanServer\ Parameters\ NullSessionShares
and edit it. unfortunately it is a binary value, so you'll then need to type in the hex ASCII codes to spell out your share. each share is separated by a null (0x00) and the last in the list is terminated with two nulls.
although a little more difficult to set up, this method of sharing is a relatively safe way to have one quasi-public share without opening the whole guest account. you can control specifically which shares can be accessed or not via the registry value mentioned above.
METHOD D - create and have Condor use a special account
Create a permanent account (called condor-guest in this description) under which Condor will run jobs. On all Windows machines, and on the file server, create the condor-guest account.
On the network file server, give the condor-guest user permissions to access files needed to run Condor jobs.
Securely store the password of the condor-guest user in the Windows registry using condor_store_cred on all Windows machines.
Tell Condor to use the condor-guest user as the owner of jobs, when required. Details for this are in section 3.6.13.
METHOD E - access with the contrib module from Bristol
Another option: some hardcore Condor users at Bristol University developed their own module for starting jobs under Condor NT to access file servers. It involves storing submitting user's passwords on a centralized server. Below I have included the README from this contrib module, which will soon appear on our website within a week or two. If you want it before that, let me know, and I could e-mail it to you.
Here is the README from the Bristol Condor contrib module:
README Compilation Instructions Build the projects in the following order CondorCredSvc CondorAuthSvc Crun Carun AfsEncrypt RegisterService DeleteService Only the first 3 need to be built in order. This just makes sure that the RPC stubs are correctly rebuilt if required. The last 2 are only helper applications to install/remove the services. All projects are Visual Studio 6 projects. The nmakefiles have been exported for each. Only the project for Carun should need to be modified to change the location of the AFS libraries if needed. Details CondorCredSvc CondorCredSvc is a simple RPC service that serves the domain account credentials. It reads the account name and password from the registry of the machine it's running on. At the moment these details are stored in clear text under the key HKEY_LOCAL_MACHINE\Software\Condor\CredService The account name and password are held in REG_SZ values "Account" and "Password" respectively. In addition there is an optional REG_SZ value "Port" which holds the clear text port number (e.g. "1234"). If this value is not present the service defaults to using port 3654. At the moment there is no attempt to encrypt the username/password when it is sent over the wire - but this should be reasonably straightforward to change. This service can sit on any machine so keeping the registry entries secure ought to be fine. Certainly the ACL on the key could be set to only allow administrators and SYSTEM access. CondorAuthSvc and Crun These two programs do the hard work of getting the job authenticated and running in the right place. CondorAuthSvc actually handles the process creation while Crun deals with getting the winstation/desktop/working directory and grabbing the console output from the job so that Condor's output handling mechanisms still work as advertised. Probably the easiest way to see how the two interact is to run through the job creation process: The first thing to realize is that condor itself only runs Crun.exe. Crun treats its command line parameters as the program to really run. e.g. "Crun \\mymachine\myshare\myjob.exe" actually causes \\mymachine\myshare\myjob.exe to be executed in the context of the domain account served by CondorCredSvc. This is how it works: When Crun starts up it gets its window station and desktop - these are the ones created by condor. It also gets its current directory - again already created by condor. It then makes sure that SYSTEM has permission to modify the DACL on the window station, desktop and directory. Next it creates a shared memory section and copies its environment variable block into it. Then, so that it can get hold of STDOUT and STDERR from the job it makes two named pipes on the machine it's running on and attaches a thread to each which just prints out anything that comes in on the pipe to the appropriate stream. These pipes currently have a NULL DACL, but only one instance of each is allowed so there shouldn't be any issues involving malicious people putting garbage into them. The shared memory section and both named pipes are tagged with the ID of Crun's process in case we're on a multi-processor machine that might be running more than one job. Crun then makes an RPC call to CondorAuthSvc to actually start the job, passing the names of the window station, desktop, executable to run, current directory, pipes and shared memory section (it only attempts to call CondorAuthSvc on the same machine as it is running on). If the jobs starts successfully it gets the process ID back from the RPC call and then just waits for the new process to finish before closing the pipes and exiting. Technically, it does this by synchronizing on a handle to the process and waiting for it to exit. CondorAuthSvc sets the ACL on the process to allow EVERYONE to synchronize on it. [ Technical note: Crun adds "C:\WINNT\SYSTEM32\CMD.EXE /C" to the start of the command line. This is because the process is created with the network context of the caller i.e. LOCALSYSTEM. Pre-pending cmd.exe gets round any unexpected "Access Denied" errors. ] If Crun gets a WM_CLOSE (CTRL_CLOSE_EVENT) while the job is running it attempts to stop the job, again with an RPC call to CondorAuthSvc passing the job's process ID. CondorAuthSvc runs as a service under the LOCALSYSTEM account and does the work of starting the job. By default it listens on port 3655, but this can be changed by setting the optional REG_SZ value "Port" under the registry key HKEY_LOCAL_MACHINE\Software\Condor\AuthService (Crun also checks this registry key when attempting to contact CondorAuthSvc.) When it gets the RPC to start a job CondorAuthSvc first connects to the pipes for STDOUT and STDERR to prevent anyone else sending data to them. It also opens the shared memory section with the environment stored by Crun. It then makes an RPC call to CondorCredSvc (to get the name and password of the domain account) which is most likely running on another system. The location information is stored in the registry under the key HKEY_LOCAL_MACHINE\Software\Condor\CredService The name of the machine running CondorCredSvc must be held in the REG_SZ value "Host". This should be the fully qualified domain name of the machine. You can also specify the optional "Port" REG_SZ value in case you are running CondorCredSvc on a different port. Once the domain account credentials have been received the account is logged on through a call to LogonUser. The DACLs on the window station, desktop and current directory are then modified to allow the domain account access to them and the job is started in that window station and desktop with a call to CreateProcessAsUser. The starting directory is set to the same as sent by Crun, STDOUT and STDERR handles are set to the named pipes and the environment sent by Crun is used. CondorAuthSvc also starts a thread which waits on the new process handle until it terminates to close the named pipes. If the process starts correctly the process ID is returned to Crun. If Crun requests that the job be stopped (again via RPC), CondorAuthSvc loops over all windows on the window station and desktop specified until it finds the one associated with the required process ID. It then sends that window a WM_CLOSE message, so any termination handling built in to the job should work correctly. [Security Note: CondorAuthSvc currently makes no attempt to verify the origin of the call starting the job. This is, in principal, a bad thing since if the format of the RPC call is known it could let anyone start a job on the machine in the context of the domain user. If sensible security practices have been followed and the ACLs on sensitive system directories (such as C:\WINNT) do not allow write access to anyone other than trusted users the problem should not be too serious.] Carun and AFSEncrypt Carun and AFSEncrypt are a couple of utilities to allow jobs to access AFS without any special recompilation. AFSEncrypt encrypts an AFS username/password into a file (called .afs.xxx) using a simple XOR algorithm. It's not a particularly secure way to do it, but it's simple and self-inverse. Carun reads this file and gets an AFS token before running whatever job is on its command line as a child process. It waits on the process handle and a 24 hour timer. If the timer expires first it briefly suspends the primary thread of the child process and attempts to get a new AFS token before restarting the job, the idea being that the job should have uninterrupted access to AFS if it runs for more than 25 hours (the default token lifetime). As a security measure, the AFS credentials are cached by Carun in memory and the .afs.xxx file deleted as soon as the username/password have been read for the first time. Carun needs the machine to be running either the IBM AFS client or the OpenAFS client to work. It also needs the client libraries if you want to rebuild it. For example, if you wanted to get a list of your AFS tokens under Condor you would run the following: Crun \\mymachine\myshare\Carun tokens.exe Running a job To run a job using this mechanism specify the following in your job submission (assuming Crun is in C:\CondorAuth): Executable= c:\CondorAuth\Crun.exe Arguments = \\mymachine\myshare\carun.exe \\anothermachine\anothershare\myjob.exe Transfer_Input_Files = .afs.xxx along with your usual settings. Installation A basic installation script for use with the Inno Setup installation package compiler can be found in the Install folder.
Unix machines and Windows machines running Condor can happily co-exist in the same Condor pool without any problems. Jobs submitted on Windows can run on Windows or Unix, and jobs submitted on Unix can run on Unix or Windows. Without any specification (using the requirements expression in the submit description file), the default behavior will be to require the execute machine to be of the same architecture and operating system as the submit machine.
There is absolutely no need to run more than one Condor central manager, even if you have both Unix and Windows machines. The Condor central manager itself can run on either Unix or Windows; there is no advantage to choosing one over the other. Here at University of Wisconsin-Madison, for instance, we have hundreds of Unix and Windows machines in our Computer Science Department Condor pool.