CONDOR TECHNICAL SUMMARY Allan Bricker Michael Litzkow and Miron Livny Computer Sciences Department University of Wisconsin - Madison allan@chorus.fr, mike@cs.wisc.edu, miron@cs.wisc.edu 1. Introduction to the Problem A common computing environment consists of many workstations connected together by a high speed local area network. These workstations have grown in power over the years, and if viewed as an aggregate they can represent a significant computing resource. However in many cases even though these workstations are owned by a single organization, they are dedicated to the exclusive use of indivi duals. In examining the usage patterns of the workstations, we find it useful to identify three ``typical'' types of users. ``Type 1'' users are individuals who mostly use their workstations for sending and receiving mail or preparing papers. Theoreticians and administrative people often fall into this category. We identify many software development people as ``type 2'' users. These people are fre quently involved in the debugging cycle where they edit software, compile, then run it possibly using some kind of debugger. This cycle is repeated many times during a typical working day. Type 2 users sometimes have too much computing capacity on their workstations such as when editing, but then dur ing the compilation and debugging phases they could often use more CPU power. Finally there are ``type 3'' users. These are people who frequently do large numbers of simulations, or combinitoric searches. These people are almost never happy with just a workstation, because it really isn't powerful enough to meet their needs. Another point is that most type 1 and type 2 users leave their machines completely idle when they are not working, while type 3 users often keep their machines busy 24 hours a day. Condor is an attempt to make use of the idle cycles from type 1 and 2 users to help satisfy the needs of the type 3 users. The condor software monitors the activity on all the participating worksta tions in the local network. Those machines which are determined to be idle, are placed into a resource pool or ``processor bank''. Machines are then allocated from the bank for the execution of jobs belonging to the type 3 users. The bank is a dynamic entity; workstations enter the bank when they become idle, and leave again when they get busy. 2. Design Features (1) No special programming is required to use condor. Condor is able to run normal UNIX 1 pro grams, only requiring the user to relink, not recompile them or change any code. (2) The local execution environment is preserved for remotely executing processes. Users do not have to worry about moving data files to remote workstations before executing programs there. ************************************ 1 UNIX is a trademark of AT&T. Version 4.1b 10/9/91 Version 4.1b 10/9/91 (3) The condor software is responsible for locating and allocating idle workstations. Condor users do not have to search for idle machines, nor are they restricted to using machines only during a static portion of the day. (4) ``Owners'' of workstations have complete priority over their own machines. Workstation own ers are generally happy to let somebody else compute on their machines while they are out, but they want their machines back promptly upon returning, and they don't want to have to take special action to regain control. Condor handles this automatically. (5) Users of condor may be assured that their jobs will eventually complete. If a user submits a job to condor which runs on somebody else's workstation, but the job is not finished when the workstation owner returns, the job will be checkpointed and restarted as soon as possible on another machine. (6) Measures have been taken to assure owners of workstations that their filesystems will not be touched by remotely executing jobs. (7) Condor does its work completely outside the kernel, and is compatible with Berkeley 4.2 and 4.3 UNIX kernels and many of their derivitives. You do not have to run a custom operating system to get the benefits of condor. 3. Limitations (1) Only single process jobs are supported, i.e. the fork(2), exec(2), and similar calls are not imple mented. (2) Signals and signal handlers are not supported, i.e. the signal(3), sigvec(2), and kill(2) calls are not implemented. (3) Interprocess communication (IPC) calls are not supported, i.e. the socket(2), send(2), recv(2), and similar calls are not implemented. (4) All file operations must be idempotent -- read-only and write-only file accesses work correctly, but programs which both read and write the same file may not. (5) Each condor job has an associated ``checkpoint file'' which is approximately the size of the address space of the process. Disk space must be available to store the checkpoint file both on the submitting and executing machines. (6) Condor does a significant amount of work to prevent security hazards, but some loopholes are known to exist. One problem is that condor user jobs are supposed to do only remote system calls, but this is impossible to guarantee. User programs are limited to running as an ordinary user on the executing machine, but a sufficiently malicious and clever user could still cause problems by doing local system calls on the executing machine. (7) A different security problem exists for owners of condor jobs who necessarily give remotely running processes access to their own file system. 4. Overview of Condor Software In some circumstances condor user programs may utilize ``remote system calls'' to access files on the machine from which they were submitted. In other situations files on the submitting machine are accessed more efficiently by use of NFS. In either case the user program is provided with the illusion that it is operating in the environment of the submitting machine. Programs written for operation in the local environment are converted to using remote file access simply by relinking with a special library. The remote file access mechanisms are described in Section 5. Condor user programs are constructed in such a way that they can be checkpointed and restarted at will. This assures users that their jobs will complete, even if they are interrupted during execution by the return of a hosting workstation's owner. Checkpointing is also implemented by linking with the special library. The checkpointing mechanism is described more fully in Section 6. Condor includes control software consisting of three daemons which run on each member of the condor pool, and two other daemons which run on a single machine called the central manager. This software automatically locates and releases ``target machines'' and manages the queue of jobs waiting CONDOR TECHNICAL SUMMARY 2 Version 4.1b 10/9/91 for condor resources. The control software is described in Section 7. 5. Remote File Access Condor programs executing on a remote workstation may access files on the submitting worksta tion in one of two ways. The preferred mechanism is direct access to the file via NFS, but this is only possible if those files appear to be in the filesystem of the executing machine, i.e. they are either physi cally located on the executing machine, or are mounted there via NFS. If the desired file does not appear in the filesystem of the executing workstation, condor provides called ``remote system calls'' which allows access to most of the normal system calls available on the submitting machine, including those that access files. In either case, the remote access is completely transparent to the user program, i.e. it simply executes such system calls as open(), close(), read(), and write(). The condor library pro vides the remote access below the system call level. To better understand how the condor remote system calls work, it is appropriate to quickly review how normal UNIX system calls work. Figure 1 illustrates the normal UNIX system call mechanism. The user program is linked with a standard library called the ``C library''. This is true even for programs written in languages other than C. The C library contains routines, often referred to as ``system call stubs'', which cause the actual system calls to happen. What the stubs really do is push the system call number, and system call arguments onto the stack, then execute an instruction which causes a trap to the kernel. When the kernel trap handler is called, it reads the system call number and arguments, and performs the system call on behalf of the user program. The trap handler will then place the system call return value in a well known register or registers, and return control to the user program. The system call stub then returns the result to the calling process, completing the system call. Figure 2 illustrates how this mechanism has been altered by condor to implement remote system calls. Whenever condor is executing a user program remotely, it also runs a ``shadow'' program on the initiating host. The shadow acts an agent for the remotely executing program in doing system calls. Condor user programs are linked with a special version of the C library. The special version contains all of the functions provided by the normal C library, but the system call stubs have been changed to accomplish remote system calls. The remote system call stubs package up the system call number and arguments and send them to the shadow using the network. The shadow, which is linked with the nor mal C library, then executes the system call on behalf of the remotely running job in the normal way. The shadow then packages up the results of the system call and sends them back to the system call stub in the special C library on the remote machine. The remote system call stub then returns its result to the User Program C Library (Trap to Kernel) Kernel Kernel Services e.g. File System Figure 1: Normal UNIX System Calls CONDOR TECHNICAL SUMMARY 3 Version 4.1b 10/9/91 Initiating Machine Executing Machine Figure 2: Remote System Calls Shadow (UID = User) C Library Kernel Condor User Program (UID = User) Special C Library Local File System System Call Request Reply System Call calling procedure which is unaware that the call was done remotely rather than locally. Note that the shadow runs with its UID set to the owner of the remotely running job so that it has the correct permis sions into the local file system. In many cases, it is more efficient to access files using NFS rather than via the remote system call mechanism. This is generally the case when the desired file is not physically located on the submitting machine, e.g. the file actually resides on a fileserver. In such a situation data transferred to or from the file would require two trips over the network, one via NFS to the shadow, and another via remote sys tem call to the condor user program. The open() system call provided in the condor version of the C library can detect such circumstances, and will open files via NFS rather than remote system calls when this is possible. The condor open() routine does this by sending the desired pathname to the shadow Initiating Machine Executing Machine Figure 3: NFS File Access Shadow Condor User Program Special C Library /u2/john fileserver:/staff/john / var usr1 usr2 Fileserver / faculty staff students / var u1 u2 open(/usr1/john) open(/u2/john) CONDOR TECHNICAL SUMMARY 4 Version 4.1b 10/9/91 program on the submitting machine along with a translation request. The shadow replies with the name of the host where the file physically resides along with a pathname for the file which is appropriate on the host where the file actually resides. The open() routine then examines the mount table on the exe cuting machine to determine whether the file is accessible via NFS and what pathname it is known by. This pathanme translation is repeated whenever the user job moves from one execution machine to another. Note that condor does not assume that all files are available from all machines, nor that every machine will mount filesystems in such a way that the same pathnames refer to the same physical files. Figure 3 illustrates a situation where the condor user program opens a file which is known as ``/u2/john'' on the submitting machine, but the same file is known as ``/usr1/jobn'' on the executing machine. 6. Checkpointing To checkpoint a UNIX process, several things must be preserved. The text, data, stack, and register contents are needed, as well as information about what files are open, where they are seek'd to, and what mode they were opened in. The data, and stack are available in a core file, while the text is available in the original executable. Condor gathers the information about currently open files through the special C library. In condor's special C library the system call stubs for ``open'', ``close'', and ``dup'' not only do those things remotely, but they also record which files are opened in what mode, and which file descriptors correspond to which files. Condor causes a running job to checkpoint by sending it a signal. When the program is linked, a special version of ``crt0'' is included which sets up CKPT() as that signal handler. When CKPT() is called, it updates the table of open files by seeking each one to the current location and recording the file position. Next a setjmp(3) is executed to save key register contents in a global data area, then the process sends itself a signal which results in a core dump. The condor software then combines the ori ginal executable file, and the core file to produce a ``checkpoint'' file, (figure 4). The checkpoint file is data stack uarea text initialized data symbol and debugging info Core Executable Checkpoint text initialized data stack Figure 4: Creating a Checkpiont File symbol and debugging info CONDOR TECHNICAL SUMMARY 5 Version 4.1b 10/9/91 itself executable. When the checkpoint file is restarted, it starts from the crt0 code just like any UNIX executable, but again this code is special, and it will set up the restart() routine as a signal handler with a special signal stack, then send itself that signal. When restart() is called, it will operate in the temporary stack area and read the saved stack in from the checkpoint file, reopen and reposition all files from the saved file state information, and execute a longjmp(3) back to CKPT(). When the restart routine returns, it does so with respect to the restored stack, and CKPT() returns to the routine which was active at the time of the checkpoint signal, not crt0. To the user code, checkpointing looks exactly like a signal handler was called, and restarting from a checkpoint looks like a return from that signal handler. 7. Control Software Each machine in the condor pool runs two daemons, the schedd and the startd. In addition, one machine runs two other daemons called the collector and the negotiator. While the collector and the negotiator are separate processes, they work closely together, and for purposes of this discussion can be considered one logical process called the central manager. The central manager has the job of keeping track of which machines are idle, and allocating those machines to other machines which have condor jobs to run. On each machine the schedd maintains a queue of condor jobs, and negotiates with the central manager to get permission to run those jobs on remote machines. The startd determines whether its machine is idle, and also is responsible for starting and managing foreign jobs which it may be hosting. On machines running the X window system, an additional daemon the kbdd will periodi cally inform the startd of the keyboard and mouse ``idle time''. Periodically the startd will examine its machine, and update the central manager on its degree of "idleness". Also periodically the schedd will examine its job queue and update the central manager on how many jobs it wants to run and how many jobs it is currently running. Figure 5 illustrates the situation when no condor jobs are running. At some point the central manager may learn that machine b is idle, and decide that machine c should execute one of its jobs remotely on machine b. The central manager will then contact the schedd on machine c and give it ``permission'' to run a job on machine b. The schedd on machine c will then select a job from its queue and spawn off a shadow process to run it. The shadow will then contact the startd on machine b and tell it that it would like to run a job. If the situation on machine b hasn't changed since the last update to the central manager, machine b will still be idle, and will respond with an OK. The startd on machine b then spawns a process called the starter. It's the Initiating Machine Central Manager Machine Central Manager Execution Machine Startd Kbdd Startd Kbdd Schedd Schedd Figure 5: Condor Processes With No Jobs Running Legend process started by fork/exec communication link CONDOR TECHNICAL SUMMARY 6 Version 4.1b 10/9/91 starter's job to start and manage the remotely running job (figure 6). The shadow on machine c will transfer the checkpoint file to the starter on machine b. The starter then sets a timer and spawns off the remotely running job from machine c (figure 7). The sha dow on machine c will handle all system calls for the job. When the starter's timer expires it will send the user job a checkpoint signal, causing it to save its file state and stack, then dump core. The starter then builds a new version of the checkpoint file which is stored temporarily on machine b. The starter restarts the job from the new checkpoint file, and the cycle of execute and checkpoint continues. At some point, either the job will finish, or machine b's user will return. If the job finishes, the job's owner is notified by mail, and the starter and shadow clean up. If machine b becomes busy, the startd on machine b will detect that either by noting recent activity on one of the tty or pty's, or by the rising load average. When the startd on machine b detects this activity, it will send a ``suspend'' signal to the starter, and the starter will temporarily suspend the user job. This is because frequently the own ers of machines are active for only a few seconds, then become idle again. This would be the case if the owner were just checking to see if there were new mail for example. If machine b remains busy for a period of about 5 minutes, the startd there will send a ``vacate'' signal to the starter. In this case, the starter will abort the user job and return the latest checkpoint file to the shadow on machine c. If the job had not run long enough on machine b to reach a checkpoint, the job is just aborted, and will be restarted later from the most recent checkpoint on machine c. Notice that the starter checkpoints the condor user job periodically rather than waiting until the remote workstation's owner wants it back. Checkpointing, and in particular core dumping, is an I/O intensive activity which we avoid doing when the hosting workstation's owner is active. 8. Control Expressions The condor control software is driven by a set of powerful ``control expressions''. These expres sions are read from the file ``~condor/condor_config'' on each machine at run time. It is often con venient for many machines of the same type to share common control expressions, and this may be done through a fileserver. To allow flexibility for control of individual machines, the file ``~condor/condor_config.local'' is provided, and expressions defined there take precedence over those defined in condor_config. Following are examples of a few of the more important condor control expressions with explanations. See condor_config(5) for a detailed description of all the control expressions. Initiating Machine Central Manager Machine Central Manager Shadow Execution Machine Starter Startd Kbdd Startd Kbdd Schedd Schedd Figure 6: Condor Processes While Starting a Job Legend process started by fork/exec communication link CONDOR TECHNICAL SUMMARY 7 Version 4.1b 10/9/91 Initiating Machine Central Manager Machine Central Manager Shadow Execution Machine Starter Startd Kbdd Startd Kbdd Schedd Schedd Figure 7: Condor Processes With One Job Running User Job Legend process started by fork/exec communication link 8.1. Starting Foreign Jobs This set of expressions is used by the startd to determine when to allow a foreign job to begin execution. BackgroundLoad = 0.3 StartIdleTime = 15 * $(MINUTE) CPU_Idle = LoadAvg <= $(BackgroundLoad) START : $(CPU_Idle) && KeyboardIdle > $(StartIdleTime) This example of the START expression specifies that to begin execution of a foreign job the load average must be less than 0.3, and there must have been no keyboard activity during the past 15 minutes. Other expressions are used to determine when to suspend, resume, and abort foreign jobs. 8.2. Prioritizing Jobs The schedd must prioritize its own jobs and negotiate with the central manager to get per mission to run them. It uses a control expression to assign priorities to its local jobs. PRIO : (UserPrio * 10) + $(Expanded) - (QDate / 1000000000.0) ``UserPrio'' is a number defined by the jobs owner in a similar spirit to the UNIX ``nice'' com mand. ``Expanded'' will be 1 if the job has already completed some execution, and 0 otherwise. This is an issue because expanded jobs require more disk space than unexpanded ones. ``QDate'' is the UNIX time when the job was submitted. The constants are chosen so that ``UserPrio'' will be the major criteria, ``Expanded'' will be less important, and ``QDate'' will be the minor criteria in determining job priority. ``UserPrio'', ``Expanded'', and ``QDate'' are variables known to the schedd which it determines for each job before applying the PRIO expression. 8.3. Prioritizing Machines The central manager does not keep track of individual jobs on the member machines. Instead it keeps track of how many jobs a machine wants to run, and how many it is running at any CONDOR TECHNICAL SUMMARY 8 Version 4.1b 10/9/91 particular time. This keeps the information that must be transmitted between the schedd and the central manager to a minimum. The central manager has the job of prioritizing the machines which want to run jobs, then it can give permission to the schedd on high priority machines and let them make their own decision about what jobs to run. UPDATE_PRIO : Prio + Users - Running Periodically the central manager will apply this expression to all of the machines in the pool. The priority of each machine will be incremented by the number of individual users on that machine who have jobs in the queue, and decremented by the number of jobs that machine is already execut ing remotely. Machines which are running lots of jobs will tend to have low priorities, and machines which have jobs to run, but can't run them, will accumulate high priorities. 9. Acknowledgements This project is based on the idea of a ``processor bank'', which was introduced by Maurice Wilkes in connection with his work on the Cambridge Ring. 2 We would like to thank Don Neuhengen and Tom Virgilio for their pioneering work on the remote system call implementation; Matt Mutka and Miron Livny for first convincing us that a general checkpointing mechanism could be practical and for ideas on how to distribute control and prioritize the jobs; and David Dewitt and Marvin Solomon for their continued guidance and support throughout this project. This research was supported by the National Science Foundataion under grants MCS81-05904 and DCR-8512862, by a Digital Equipment Corporation External Research Grant, and by an Interna tional Business Machines Department Grant. Porting to the SGI 4D Workstation was funded by NRL/SFA. 10. Copyright Information Copyright 1986, 1987, 1988, 1989, 1990, 1991 by the Condor Design Team Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the name of the University of Wisconsin not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission. The University of Wisconsin and the Condor Design team make no representations about the suitability of this software for any purpose. It is pro vided "as is" without express or implied warranty. THE UNIVERSITY OF WISCONSIN AND THE CONDOR DESIGN TEAM DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRAN TIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE UNIVERSITY OF WISCONSIN OR THE CONDOR DESIGN TEAM BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLI GENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. Authors: Allan Bricker, Michael J. Litzkow, and others. University of Wisconsin, Computer Sciences Dept. 11. Bibliography (1) Mutka, M. and Livny, M. ``Profiling Workstations' Available Capacity For Remote Execu tion''. Proceedings of Performance-87, The 12th IFIP W.G. 7.3 International Symposium on Computer Performance Modeling, Measurement and Evaluation. Brussels, Belgium, ************************************ 2 Wilkes, M. V., Invited Keynote Address, 10th Annual International Symposium on Computer Architecture, June 1983. CONDOR TECHNICAL SUMMARY 9 Version 4.1b 10/9/91 December 1987. (2) Litzkow, M. ``Remote Unix -- Turning Idle Workstations Into Cycle Servers''. Proceedings of the Summer 1987 Usenix Conference. Phoenix, Arizona. June 1987 (3) Mutka, M. Sharing in a Privately Owned Workstation Environment. Ph.D. Th., University of Wisconsin, May 1988. (4) Litzkow, M., Livny, M. and Mutka, M. ``Condor -- A Hunter of Idle Workstations''. Proceedings of the 8th International Conference on Distributed Computing Systems. San Jose, Calif. June 1988 (5) Bricker, A. and Litzkow M. ``Condor Installation Guide''. May 1989 (6) Bricker, A. and Litzkow, M. Unix manual pages: condor_intro(1), condor(1), condor_q(1), condor_rm(1), condor_status(1), condor_summary(1), condor_config(5), condor_control(8), and condor_master(8). January 1991 CONDOR TECHNICAL SUMMARY 10