next up previous contents index
Next: 5.3 Installation of Condor Up: 5. Condor for Microsoft Previous: 5.1 Introduction to Condor

Subsections

   
5.2 Release Notes for Condor NT Version 6.2.2

5.2.0.1 What is missing from Condor NT Version 6.2.2?

In general, this release on NT works the same as the release of Condor for Unix.

However, the following items are not supported in this version:

5.2.0.2 What is included in Condor NT Version 6.2.2?

Except for those items listed above, most everything works the same way in Condor NT as it does in the Unix release. This release is based on the Condor Version 6.2.2source tree, and thus the feature set is the same as Condor Version 6.2.2for Unix. For instance, all of the following work in Condor NT:

  
5.2.1 Condor File Transfer Mechanism

Condor remote system calls and the ability to access network shares is not yet supported on NT -- they will be in the near future. For now, Condor NT users must utilize the Condor File Transfer mechanism.

When Condor finds a machine willing to execute a job, it will create a temporary subdirectory for the job on the execute machine. The Condor File Transfer mechanism will then send (by TCP) the job executable(s) and input files from the submitting machine to the temporary subdirectory on the execute machine. After the input files have been transferred, the execute machine starts running the job with the temporary directory as the job's current working directory. When the job completes or is kicked off, Condor File Transfer automatically sends back to the submit machine any output files created by the job. After the files have been successfully sent back, the temporary working directory on the execute machine is removed.

Condor's File Transfer mechanism has several features to ensure data integrity in a non-dedicated environment. For instance, transfers of multiple files are performed atomically.

5.2.1.1 File Transfer Commands

Condor File Transfer behavior is specified at job submit time by the submit description file and by condor_submit. Along with all the other job description commands (see section 8 on page [*]), the following new commands are available for use in the submit description file:

transfer_input_files = < file1, file2, file... >
Lists all the files to be transferred into the working directory for the job before the job is started. Separate multiple filenames with a comma. By default, the file specified in the executable command and any file specified in the input command (for example, stdin) are transferred.

transfer_output_files = < file1, file2, file... >
This command forms an explicit list of output files to be transferred back from the temporary working directory on the execute machine to the submit machine. Most of the time, there is no need to use this command. If transfer_output_files is not specified, Condor will automatically transfer back all files in the job's temporary working directory which have been modified or created by the job. This is usually the desired behavior. Explicitly listing output files is typically only done when the job creates many files, and the user wants to keep a subset of those files. WARNING: Do not specify transfer_output_file in the submit description file unless there is a really good reason - it is best to let Condor figure things out by itself based upon what the job.

transfer_files = <ONEXIT | ALWAYS>
Setting transfer_files equal to ONEXIT will cause Condor to transfer the job's output files back to the submitting machine only when the job completes (exits). If not specified, ONEXIT is used as the default. Specifying ALWAYS tells Condor to transfer back the output files when the job completes or when the job is preempted or kicked off a machine prior to job completion. The ALWAYS option is intended for fault tolerant jobs which periodically save their own state and can restart where they left off. Any output files transferred back to the submit machine are automatically sent back out again as input files when the job restarts.

5.2.1.2 Ensuring File Transfer has enough disk space

It is highly recommended that the Requirements expression in the submit description file includes a requirement for the size of the Disk attribute. Doing so ensures that Condor picks a machine with enough local disk space for the job. Here is a sample submit description file:

        # Condor submit file for program "foo.exe".
        #
        # foo reads from files "my-input-data" and "my-other-input-data".
        # foo then writes out results into several files.
        # The total disk space foo uses for all input and output files
        # is never more than 10 megabytes.
        #
        executable = foo.exe
        universe = vanilla
        # Now set Requirements saying that the machine which runs our job
        # must have more than 10megs of free disk space.  Note that "Disk"
        # is expressed in kilobytes; 10 Mbytes is 10000 kbytes.
        requirements = Disk > 10000
        # 
        queue

If the requirements do not specify the necessary amount of local disk space, condor_submit appends the job Requirements with Disk >= DiskUsage. The DiskUsage attribute represents the maximum amount of total disk space required by the job in kilobytes. Condor automatically updates the DiskUsage attribute approximately every 20 minutes while the job runs with the amount of space being used by the job on the execute machine.

5.2.1.3 Current Limitations of File Transfer

Itemized below are some current limitations of the File Transfer mechanism. We anticipate improvement in upcoming releases.

5.2.2 Details on how Condor NT starts/stops a job

This section provides some details on how Condor NT starts and stops jobs. This discussion is geared for the Condor administrator or advanced user who is already familiar with the material in the Administrators' Manual and wishes to know detailed information on what Condor NT does when starting and stopping jobs.

When Condor NT is about to start a job, the condor_startd on the execute machine spawns a condor_starter process. The condor_starter then creates:

1.
a new temporary run account on the machine with a login name of ``condor-run-dir_XXX'', where XXX is the process ID of the condor_starter. This account is added to group Users and group Everyone.

2.
a new temporary working directory for the job on the execute machine. This directory is named ``dir_XXX'', where XXX is the process ID of the condor_starter. The directory is created in the $(EXECUTE) directory as specified in Condor's configuration file. Condor then grants write permission to this directory for the user account newly created for the job.

3.
a new, non-visible Window Station and Desktop for the job. Permissions are set so that only the user account newly created has access rights to this Desktop. Any windows created by this job are not seen by anyone; the job is run in the background.

Next, the condor_starter (called the starter) contacts the condor_shadow (called the shadow) process, which is running on the submitting machine, and pulls over the job's executable and input files. These files are placed into the temporary working directory for the job. After all files have been received, the starter spawns the user's executable as user ``condor-run-dir_XXX'' with its current working directory set to the temporary working directory (that is, $(EXECUTE)/dir_XXX).

While the job is running, the starter closely monitors the CPU usage and image size of all processes started by the job. Every 20 minutes the starter sends this information, along with the total size of all files contained in the job's temporary working directory, to the shadow. The shadow then inserts this information into the job's ClassAd so that policy and scheduling expressions can make use of this dynamic information.

If the job exits of its own accord (that is, the job completes), the starter first terminates any processes started by the job which could still be around if the job did not clean up after itself. The starter examines the job's temporary working directory for any files which have been created or modified and sends these files back to the shadow running on the submit machine. The shadow places these files into the initialdir specified in the submit description file; if no initialdir was specified, the files go into the directory where the user invoked condor_submit. Once all the output files are safely transferred back, the job is removed from the queue. If, however, the condor_startd forcibly kills the job before all output files could be transferred, the job is not removed from the queue but instead switches back to the Idle state.

If the condor_startd decides to vacate a job prematurely, the starter sends a WM_CLOSE message to the job. If the job spawned multiple child processes, the WM_CLOSE message is only sent to the parent process (that is, the one started by the starter). The WM_CLOSE message is the preferred way to terminate a process on Windows NT, since this method allows the job to cleanup and free any resources it may have allocated. When the job exits, the starter cleans up any processes left behind. At this point, if transfer_files is set to ONEXIT (the default) in the job's submit description file, the job switches from states, from Running to Idle, and no files are transferred back. If transfer_files is set to ALWAYS, then any files in the job's temporary working directory which were changed or modified are first sent back to the submitting machine. But this time, the shadow places these so-called intermediate files into a subdirectory created in the $(SPOOL) directory on the submitting machine ($(SPOOL) is specified in Condor's configuration file). The job is then switched back to the Idle state until Condor finds a different machine on which to run. When the job is started again, Condor places into the job's temporary working directory the executable and input files as before, plus any files stored in the submit machine's $(SPOOL) directory for that job.

NOTE: A Windows console process can intercept a WM_CLOSE message via the Win32 SetConsoleCtrlHandler() function if it needs to do special cleanup work at vacate time; a WM_CLOSE message generates a CTRL_CLOSE_EVENT. See SetConsoleCtrlHandler() in the Win32 documentation for more info.

NOTE: The default handler in Windows NT for a WM_CLOSE message is for the process to exit. Of course, the job could be coded to ignore it and not exit, but eventually the condor_startd will get impatient and hard-kill the job (if that is the policy desired by the administrator).

Finally, after the job has left and any files transferred back, the starter deletes the temporary working directory, the temporary account, the WindowStation and the Desktop before exiting itself. If the starter should terminate abnormally, the condor_startd attempts the clean up. If for some reason the condor_startd should disappear as well (that is, if the entire machine was power-cycled hard), the condor_startd will clean up when Condor is restarted.

5.2.3 Security considerations in Condor NT

On the execute machine, the user job is run using the access token of an account dynamically created by Condor which has bare-bones access rights and privileges. For instance, if your machines are configured so that only Administrators have write access C: $\mathtt{\backslash}$WINNT, then certainly no Condor job run on that machine would be able to write anything there. The only files the job should be able to access on the execute machine are files accessible by group Everybody and files in the job's temporary working directory.

On the submit machine, Condor permits the File Transfer mechanism to only read files which the submitting user has access to read, and only write files to which the submitting user has access to write. For example, say only Administrators can write to C: $\mathtt{\backslash}$WINNT on the submit machine, and a user gives the following to condor_submit :

         executable = mytrojan.exe
         initialdir = c:\winnt
         output = explorer.exe
         queue
Unless that user is in group Administrators, Condor will not permit explorer.exe to be overwritten.

If for some reason the submitting user's account disappears between the time condor_submit was run and when the job runs, Condor is not able to check and see if the now-defunct submitting user has read/write access to a given file. In this case, Condor will ensure that group ``Everyone'' has read or write access to any file the job subsequently tries to read or write. This is in consideration for some network setups, where the user account only exists for as long as the user is logged in.

Condor also provides protection to the job queue. It would be bad if the integrity of the job queue is compromised, because a malicious user could remove other user's jobs or even change what executable a user's job will run. To guard against this, in Condor's default configuration all connections to the condor_schedd (the process which manages the job queue on a given machine) are authenticated using Windows NT's SSPI security layer. The user is then authenticated using the same challenge-response protocol that NT uses to authenticate users to Windows NT file servers. Once authenticated, the only users allowed to edit job entry in the queue are:

1.
the user who originally submitted that job (i.e. Condor allows users to remove or edit their own jobs)
2.
users listed in the condor_config file parameter QUEUE_SUPER_USERS. In the default configuration, only the ``SYSTEM'' (LocalSystem) account is listed here.
WARNING: Do not remove ``SYSTEM'' from QUEUE_SUPER_USERS, or Condor itself will not be able to access the job queue when needed. If the LocalSystem account on your machine is compromised, you have all sorts of problems!

To protect the actual job queue files themselves, the Condor NT installation program will automatically set permissions on the entire Condor release directory so that only Administrators have write access.

Finally, Condor NT has all the IP/Host-based security mechanisms present in the full-blown version of Condor. See section 3.8 starting on page [*] for complete information on how to allow/deny access to Condor based upon machine hostname or IP address.

5.2.4 Interoperability between Condor for Unix and Condor NT

Unix machines and Windows NT machines running Condor can happily co-exist in the same Condor pool without any problems. For now, the only restriction is jobs submitted on Windows NT must run on Windows NT, and job submitted on Unix must run on Unix. You will get this behavior by default, since condor_submit will automatically set a Requirements expression in the job ClassAd stating that the execute machine must have the same architecture and operating system as the submit machine.

There is absolutely no need to run more than one Condor central manager, even if you have both Unix and NT machines. The Condor central manager itself can run on either Unix or NT; there is no advantage to choosing one over the other. Here at University of Wisconsin-Madison, for instance, we have hundreds of Unix (Solaris, Linux, Irix, etc) and Windows NT machines in our Computer Science Department Condor pool. Our central manager is running on Windows NT. All is happy.

5.2.5 Some differences between Condor for Unix -vs- Condor NT

 
next up previous contents index
Next: 5.3 Installation of Condor Up: 5. Condor for Microsoft Previous: 5.1 Introduction to Condor
condor-admin@cs.wisc.edu