Condor Pool Goodput
Goodput is the allocation time when an application
uses a remote workstation to make forward progress. Goodput can be
significantly less than total allocation time.
An application is prevented from using the workstation when it is
waiting for the network. Additionally, forward progress is lost when
an application must roll-back to an earlier state due to a failure.
Goodput = Throughput - Network Wait Time - Roll-back
Network Wait Time includes:
- application placement: transferring the
executable and checkpoint data at the start of the allocation
- periodic checkpoints: suspending execution to
perform a periodic checkpoint
- preemption: suspending execution to see if
workstation will become available again soon and/or performing a
checkpoint when preempted
Roll-back occurs when the application does not
successfully write a checkpoint when preempted.
We published a paper on this subject, titled
Improving Goodput by Co-scheduling CPU and
Network Capacity, in the Fall 1999
International Journal of High Performance Computing
Applications,
Volume 13(3).
Statistics maintained by the checkpoint server are also useful for
monitoring goodput. The checkpoint server maintains a record of all
attempted checkpoint transfers. This record is used to report network
usage by the system, including success rate and network throughput.
The matchmaker initiates all job placements in the Condor pool. The
matchmaker maintains a record of all matches it performs, which
includes an estimate of the placement cost of each job.
Condor monitors keyboard activity on all of the machines it manages.
Statistics about when and how long machines are idle can influence
goodput scheduling decisions. For example, if machines are mostly
idle for short periods of time, checkpointing plays an increased role
in the goodput of the pool.
One method we use to monitor goodput is to keep a small number of
representative applications running at all times and track the goodput
obtained by these applications. Together, these representative
applications form a "goodput index." Note: we are not currently
running the representative applications.
Goodput statistics are updated daily. All graphs are also available
on a single goodput statistics page.
condor-admin@cs.wisc.edu