next up previous contents index
Next: 3.5 Policy Configuration for Up: 3. Administrators' Manual Previous: 3.3 Configuration   Contents   Index

Subsections


3.4 User Priorities and Negotiation

HTCondor uses priorities to determine machine allocation for jobs. This section details the priorities and the allocation of machines (negotiation).

For accounting purposes, each user is identified by username@uid_domain. Each user is assigned a priority value even if submitting jobs from different machines in the same domain, or even if submitting from multiple machines in the different domains.

The numerical priority value assigned to a user is inversely related to the goodness of the priority. A user with a numerical priority of 5 gets more resources than a user with a numerical priority of 50. There are two priority values assigned to HTCondor users:

This section describes these two priorities and how they affect resource allocations in HTCondor. Documentation on configuring and controlling priorities may be found in section 3.3.17.


3.4.1 Real User Priority (RUP)

A user's RUP measures the resource usage of the user through time. Every user begins with a RUP of one half (0.5), and at steady state, the RUP of a user equilibrates to the number of resources used by that user. Therefore, if a specific user continuously uses exactly ten resources for a long period of time, the RUP of that user stabilizes at ten.

However, if the user decreases the number of resources used, the RUP gets better. The rate at which the priority value decays can be set by the macro PRIORITY_HALFLIFE , a time period defined in seconds. Intuitively, if the PRIORITY_HALFLIFE in a pool is set to 86400 (one day), and if a user whose RUP was 10 has no running jobs, that user's RUP would be 5 one day later, 2.5 two days later, and so on.


3.4.2 Effective User Priority (EUP)

The effective user priority (EUP) of a user is used to determine how many resources that user may receive. The EUP is linearly related to the RUP by a priority factor which may be defined on a per-user basis. Unless otherwise configured, an initial priority factor for all users as they first submit jobs is set by the configuration variable DEFAULT_PRIO_FACTOR , and defaults to the value 1.0. This default makes the EUP the same as the RUP. If desired, the priority factors of specific users can be increased using condor_userprio, so that some are served preferentially.

The number of resources that a user may receive is inversely related to the ratio between the EUPs of submitting users. Therefore user $A$ with EUP=5 will receive twice as many resources as user $B$ with EUP=10 and four times as many resources as user $C$ with EUP=20. However, if $A$ does not use the full number of resources that $A$ may be given, the available resources are repartitioned and distributed among remaining users according to the inverse ratio rule.

HTCondor supplies mechanisms to directly support two policies in which EUP may be useful:

Nice users
A job may be submitted with the submit command nice_user set to True. This nice user job will have its RUP boosted by the NICE_USER_PRIO_FACTOR priority factor specified in the configuration, leading to a very large EUP. This corresponds to a low priority for resources, therefore using resources not used by other HTCondor users.

Remote Users
HTCondor's flocking feature (see section 5.2) allows jobs to run in a pool other than the local one. In addition, the submit-only feature allows a user to submit jobs to another pool. In such situations, submitters from other domains can submit to the local pool. It may be desirable to have HTCondor treat local users preferentially over these remote users. If configured, HTCondor will boost the RUPs of remote users by REMOTE_PRIO_FACTOR specified in the configuration, thereby lowering their priority for resources.

The priority boost factors for individual users can be set with the setfactor option of condor_userprio. Details may be found in the condor_userprio manual page on page [*].


3.4.3 Priorities in Negotiation and Preemption

Priorities are used to ensure that users get their fair share of resources. The priority values are used at allocation time, meaning during negotiation and matchmaking. Therefore, there are ClassAd attributes that take on defined values only during negotiation, making them ephemeral. In addition to allocation, HTCondor may preempt a machine claim and reallocate it when conditions change.

Too many preemptions lead to thrashing, a condition in which negotiation for a machine identifies a new job with a better priority most every cycle. Each job is, in turn, preempted, and no job finishes. To avoid this situation, the PREEMPTION_REQUIREMENTS configuration variable is defined for and used only by the condor_negotiator daemon to specify the conditions that must be met for a preemption to occur. It is usually defined to deny preemption if a current running job has been running for a relatively short period of time. This effectively limits the number of preemptions per resource per time interval. Note that PREEMPTION_REQUIREMENTS only applies to preemptions due to user priority. It does not have any effect if the machine's RANK expression prefers a different job, or if the machine's policy causes the job to vacate due to other activity on the machine. See section 3.5.9 for a general discussion of limiting preemption.

The following ephemeral attributes may be used within policy definitions. Care should be taken when using these attributes, due to their ephemeral nature; they are not always defined, so the usage of an expression to check if defined such as

  (RemoteUserPrio =?= UNDEFINED)
is likely necessary.

Within these attributes, those with names that contain the string Submitter refer to characteristics about the candidate job's user; those with names that contain the string Remote refer to characteristics about the user currently using the resource. Further, those with names that end with the string ResourcesInUse have values that may change within the time period associated with a single negotiation cycle. Therefore, the configuration variables PREEMPTION_REQUIREMENTS_STABLE and and PREEMPTION_RANK_STABLE exist to inform the condor_negotiator daemon that values may change. See section 3.3.17 on page [*] for definitions of these configuration variables.

SubmitterUserPrio:
A floating point value representing the user priority of the candidate job.
SubmitterUserResourcesInUse:
The integer number of slots currently utilized by the user submitting the candidate job.
RemoteUserPrio:
A floating point value representing the user priority of the job currently running on the machine. This version of the attribute, with no slot represented in the attribute name, refers to the current slot being evaluated.
Slot<N>_RemoteUserPrio:
A floating point value representing the user priority of the job currently running on the particular slot represented by <N> on the machine.
RemoteUserResourcesInUse:
The integer number of slots currently utilized by the user of the job currently running on the machine.
SubmitterGroupResourcesInUse:
If the owner of the candidate job is a member of a valid accounting group, with a defined group quota, then this attribute is the integer number of slots currently utilized by the group.
SubmitterGroup:
The accounting group name of the requesting submitter.
SubmitterGroupQuota:
If the owner of the candidate job is a member of a valid accounting group, with a defined group quota, then this attribute is the integer number of slots defined as the group's quota.
RemoteGroupResourcesInUse:
If the owner of the currently running job is a member of a valid accounting group, with a defined group quota, then this attribute is the integer number of slots currently utilized by the group.
RemoteGroup:
The accounting group name of the owner of the currently running job.
RemoteGroupQuota:
If the owner of the currently running job is a member of a valid accounting group, with a defined group quota, then this attribute is the integer number of slots defined as the group's quota.
SubmitterNegotiatingGroup:
The accounting group name that the candidate job is negotiating under.
RemoteNegotiatingGroup:
The accounting group name that the currently running job negotiated under.
SubmitterAutoregroup:
Boolean attribute is True if candidate job is negotiated via autoregoup.
RemoteAutoregroup:
Boolean attribute is True if currently running job negotiated via autoregoup.

3.4.4 Priority Calculation

This section may be skipped if the reader so feels, but for the curious, here is HTCondor's priority calculation algorithm.

The RUP of a user $u$ at time $t$, $\pi_r(u,t)$, is calculated every time interval $\delta t$ using the formula

\begin{displaymath}\pi_r(u,t) = \beta\times\pi(u,t-\delta t) + (1-\beta)\times\rho(u,t)\end{displaymath}

where $\rho(u,t)$ is the number of resources used by user $u$ at time $t$, and $\beta=0.5^{{\delta t}/h}$. $h$ is the half life period set by PRIORITY_HALFLIFE .

The EUP of user $u$ at time $t$, $\pi_e(u,t)$ is calculated by

\begin{displaymath}\pi_e(u,t) = \pi_r(u,t)\times f(u,t)\end{displaymath}

where $f(u,t)$ is the priority boost factor for user $u$ at time $t$.

As mentioned previously, the RUP calculation is designed so that at steady state, each user's RUP stabilizes at the number of resources used by that user. The definition of $\beta$ ensures that the calculation of $\pi_r(u,t)$ can be calculated over non-uniform time intervals $\delta t$ without affecting the calculation. The time interval $\delta t$ varies due to events internal to the system, but HTCondor guarantees that unless the central manager machine is down, no matches will be unaccounted for due to this variance.


3.4.5 Negotiation

Negotiation is the method HTCondor undergoes periodically to match queued jobs with resources capable of running jobs. The condor_negotiator daemon is responsible for negotiation.

During a negotiation cycle, the condor_negotiator daemon accomplishes the following ordered list of items.

  1. Build a list of all possible resources, regardless of the state of those resources.
  2. Obtain a list of all job submitters (for the entire pool).
  3. Sort the list of all job submitters based on EUP (see section 3.4.2 for an explanation of EUP). The submitter with the best priority is first within the sorted list.
  4. Iterate until there are either no more resources to match, or no more jobs to match.
    For each submitter (in EUP order):
    For each submitter, get each job. Since jobs may be submitted from more than one machine (hence to more than one condor_schedd daemon), here is a further definition of the ordering of these jobs. With jobs from a single condor_schedd daemon, jobs are typically returned in job priority order. When more than one condor_schedd daemon is involved, they are contacted in an undefined order. All jobs from a single condor_schedd daemon are considered before moving on to the next. For each job:
    • For each machine in the pool that can execute jobs:
      1. If machine.requirements evaluates to False or job.requirements evaluates to False, skip this machine
      2. If the machine is in the Claimed state, but not running a job, skip this machine.
      3. If this machine is not running a job, add it to the potential match list by reason of No Preemption.
      4. If the machine is running a job
        • If the machine.RANK on this job is better than the running job, add this machine to the potential match list by reason of Rank.
        • If the EUP of this job is better than the EUP of the currently running job, and PREEMPTION_REQUIREMENTS is True, and the machine.RANK on this job is not worse than the currently running job, add this machine to the potential match list by reason of Priority.
    • Of machines in the potential match list, sort by NEGOTIATOR_PRE_JOB_RANK, job.RANK, NEGOTIATOR_POST_JOB_RANK, Reason for claim (No Preemption, then Rank, then Priority), PREEMPTION_RANK
    • The job is assigned to the top machine on the potential match list. The machine is removed from the list of resources to match (on this negotiation cycle).

The condor_negotiator asks the condor_schedd for the "next job" from a given submitter/user. Typically, the condor_schedd returns jobs in the order of job priority. If priorities are the same, job submission time is used; older jobs go first. If a cluster has multiple procs in it and one of the jobs cannot be matched, the condor_schedd will not return any more jobs in that cluster on that negotiation pass. This is an optimization based on the theory that the cluster jobs are similar. The configuration variable NEGOTIATE_ALL_JOBS_IN_CLUSTER disables the cluster-skipping optimization. Use of the configuration variable SIGNIFICANT_ATTRIBUTES will change the definition of what the condor_schedd considers a cluster from the default definition of all jobs that share the same ClusterId.


3.4.6 The Layperson's Description of the Pie Spin and Pie Slice

HTCondor schedules in a variety of ways. First, it takes all users who have submitted jobs and calculates their priority. Then, it totals the number of resources available at the moment, and using the ratios of the user priorities, it calculates the number of machines each user could get. This is their pie slice.

The HTCondor matchmaker goes in user priority order, contacts each user, and asks for job information. The condor_schedd daemon (on behalf of a user) tells the matchmaker about a job, and the matchmaker looks at available resources to create a list of resources that match the requirements expression. With the list of resources that match, it sorts them according to the rank expressions within ClassAds. If a machine prefers a job, the job is assigned to that machine, potentially preempting a job that might already be running on that machine. Otherwise, give the machine to the job that the job ranks highest. If the machine ranked highest is already running a job, we may preempt running job for the new job. A default policy for preemption states that the user must have a 20% better priority in order for preemption to succeed. If the job has no preferences as to what sort of machine it gets, matchmaking gives it the first idle resource to meet its requirements.

This matchmaking cycle continues until the user has received all of the machines in their pie slice. The matchmaker then contacts the next highest priority user and offers that user their pie slice worth of machines. After contacting all users, the cycle is repeated with any still available resources and recomputed pie slices. The matchmaker continues spinning the pie until it runs out of machines or all the condor_schedd daemons say they have no more jobs.


3.4.7 Group Accounting

By default, HTCondor does all accounting on a per-user basis, and this accounting is primarily used to compute priorities for HTCondor's fair-share scheduling algorithms. However, accounting can also be done on a per-group basis. Multiple users can all submit jobs into the same accounting group, and all jobs with the same accounting group will be treated with the same priority. Jobs that do not specify an accounting group have all accounting and priority based on the user, which may be identified by the job ClassAd attribute Owner. Jobs that do specify an accounting group have all accounting and priority based on the specified accounting group. Therefore, accounting based on groups only works when the jobs correctly identify their group membership.

The preferred method for having a job associate itself with an accounting group adds a command to the submit description file that specifies the group name:

  accounting_group = group_physics
This command causes the job ClassAd attribute AcctGroup to be set with this group name.

If the user name of the job submitter should be other than the Owner job ClassAd attribute, an additional command specifies the user name:

  accounting_group_user = albert
This command causes the job ClassAd attribute AcctGroupUser to be set with this user name.

The previous method for defining accounting groups is no longer recommended. It inserted the job ClassAd attribute AccountingGroup by setting it in the submit description file using the syntax in this example:

+AccountingGroup = "group_physics.albert"

In this previous method for defining accounting groups, the AccountingGroup attribute is a string, and it therefore must be enclosed in double quote marks.

Much of the reason that the previous method for defining accounting groups is no longer recommended is that the name of an accounting is that it used the period (.) character to separate the group name from the user name. Therefore, the syntax did not work if a user name contained a period.

The name should not be qualified with a domain. Certain parts of the HTCondor system do append the value $(UID_DOMAIN) (as specified in the configuration file on the submit machine) to this string for internal use. For example, if the value of UID_DOMAIN is example.com, and the accounting group name is as specified, condor_userprio will show statistics for this accounting group using the appended domain, for example

                                    Effective
User Name                           Priority
------------------------------      ---------
group_physics@example.com                0.50
user@example.com                        23.11
heavyuser@example.com                  111.13
...

Additionally, the condor_userprio command allows administrators to remove an entity from the accounting system in HTCondor. The -delete option to condor_userprio accomplishes this if all the jobs from a given accounting group are completed, and the administrator wishes to remove that group from the system. The -delete option identifies the accounting group with the fully-qualified name of the accounting group. For example

condor_userprio -delete group_physics@example.com

HTCondor removes entities itself as they are no longer relevant. Intervention by an administrator to delete entities can be beneficial when the use of thousands of short term accounting groups leads to scalability issues.


3.4.8 Accounting Groups with Hierarchical Group Quotas

An upper limit on the number of slots/machines allocated for a group of users can be specified with group quotas. The use of these group quotas modifies the negotiation for available resources (machines) within an HTCondor pool. When accounting groups together with group quotas are specified, the priorities and usage information are calculated per user. Numbers of slots/machines can be negotiated for preferentially by group. This may be useful when different groups (of varying size) own computers, and the groups choose to combine their computers to form an HTCondor pool. Consider an imaginary HTCondor pool example with thirty computers; twenty computers are owned by the physics group and ten computers are owned by the chemistry group. One notion of fair allocation could be implemented by configuring the twenty machines owned by the physics group to prefer (using the RANK configuration macro) jobs submitted by the users identified as associated with the physics group. Likewise, the ten machines owned by the chemistry group are configured to prefer jobs from users associated with the the chemistry group. This routes jobs to execute on specific machines, perhaps causing more preemption than necessary. The (fair allocation) policy desired is likely somewhat different, if these thirty machines have been pooled. The desired policy does not tie users to specific sets of machines, but caps the maximum number of slots/machines (a quota). Given thirty similar machines, the desired policy allows users within the physics group to have preference on up to twenty of the machines within the pool, and the machines can be any of the machines that are available.

The implementation of quotas is hierarchical, such that quotas may be described for groups, subgroups, sub subgroups, etc. The hierarchy is described by adherence to a naming scheme set up in advance. Configuration defines the quotas in terms of limiting the number of cores allocated for a group or subgroup. And, the jobs state which group or subgroup they are part of.

A quota for a set of users requires an identification of the set; members are called group users. Group users' jobs identify themselves as desiring negotiation under the group by setting the accounting_group command in the submit description file, as specified in section 3.4.7.

The hierarchy is identified by using the period ('.') character to separate a group name from a subgroup name from a sub subgroup name, etc. Group names are case-insensitive for negotiation. The topmost level group name is not required to begin with the string "group_", as in the examples "group_physics.experiment1" and "group_chemistry.organic", but it is a useful convention, because group names must not conflict with subgroup names.

Configuration controls the order of negotiation for groups, subgroups within the hierarchy defined, and individual users, as well as sets quotas.

When at least one group quota is defined, HTCondor must still do negotiation for jobs that are not party to the groups. Any job that is not part of any group is considered part of the invented "<none>" group that is the top or root of the hierarchical tree of groups. This string will appear in the output of condor_status.

Quotas are categorized as either static or dynamic. A static quota specifies an integral numbers of machines (slots), independent of the size of the pool. A dynamic quota specifies a percentage of machines (slots) calculated based on the current number of machines in the pool. It is intended that only one of static or a dynamic quotas are defined for a specified group and for the pool. If both are defined for a group, then the static quota is implemented, and the dynamic quota is ignored. The behavior is not defined and not documented for pools with both static and dynamic quotas defined.

Static Quotas
In the hierarchical implementation, there are two cases defined here, to specify for the allocation of machines where there is both a group and a subgroup. In the first case, the sum for the numbers of machines within all of a group's subgroups totals to fewer than the specification for the group's static quota. For example:
  GROUP_QUOTA_group_physics = 100
  GROUP_QUOTA_group_physics.experiment1 = 20
  GROUP_QUOTA_group_physics.experiment2 = 70
In this case, the unused quota of 10 machines is assigned to the group_physics submitters.

In the second case, the specification for the numbers of machines of a set of subgroups totals to more than the specification for the group's quota. For example:

  GROUP_QUOTA_group_chemistry = 100
  GROUP_QUOTA_group_chemistry.lab1 = 40
  GROUP_QUOTA_group_chemistry.lab2 = 80
In this case, a warning is written to the log for the condor_negotiator daemon, and each of the subgroups will have their static quota scaled. In this example, the ratio 100/120 scales each subgroup. lab1 will have a revised (floating point) quota of 33.333 machines, and lab2 will have a revised (floating point) quota of 66.667 machines. As numbers of machines are always integer values, the floating point values are truncated for quota allocation. Fractional remainders resulting from the truncation are summed and assigned to the next higher level within the group hierarchy.

Dynamic Quotas
A dynamic quota specifies a percentage of machines (slots) calculated based on the quota of the next higher level group within the hierarchy. For groups at the top level, a dynamic quota specifies a percentage of machines (slots) that currently exist in the pool. The quota is specified for a group (subgroup, etc.) by a floating point value in range 0.0 to 1.0 (inclusive).

Like static quota specification, there are two cases defined: when the dynamic quotas of all sub groups of a specific group sum to a fraction less than 1.0, and when the dynamic quotas of all sub groups of a specific group sum to greater than 1.0.

Here is an example configuration in which dynamic group quotas are assigned for a single group and its subgroups.

  GROUP_QUOTA_DYNAMIC_group_econ = .6
  GROUP_QUOTA_DYNAMIC_group_econ.project1 = .2
  GROUP_QUOTA_DYNAMIC_group_econ.project2 = .15
  GROUP_QUOTA_DYNAMIC_group_econ.project3 = .2
The sum of dynamic quotas for the subgroups is .55, which is less than 1.0. If the pool has 100 slots, then the project1 subgroup is assigned a quota that equals (100)(.6)(.2) = 12 machines. The project2 subgroup is assigned a quota that equals (100)(.6)(.15) = 9 machines. The project3 subgroup is assigned a quota that equals (100)(.6)(.2) = 12 machines. The 60-33=27 machines unused by the subgroups are assigned for use by job submitters in the parent group_econ group.

If the calculated dynamic quota of the subgroups resulted in non integer numbers of machines, integer numbers of machines are assigned based on the truncation of the non integer dynamic group quota. The unused, surplus quota of machines resulting from fractional remainders resulting from the truncation are summed and assigned to the next higher level within the group hierarchy.

Here is another example configuration in which dynamic group quotas are assigned for a single group and its subgroups.

  GROUP_QUOTA_DYNAMIC_group_stat = .5
  GROUP_QUOTA_DYNAMIC_group_stat.project1 = .4
  GROUP_QUOTA_DYNAMIC_group_stat.project2 = .3
  GROUP_QUOTA_DYNAMIC_group_stat.project3 = .4
In this case, the sum of dynamic quotas for the subgroups is 1.1, which is greater than 1.0 . A warning is written to the log for the condor_negotiator daemon, and each of the subgroups will have their dynamic group quota scaled for this example. .4 becomes .4/1.1=.3636, and .3 becomes .3/1.1=.2727 . If the pool has 100 slots, then each of the project1 and project3 subgroups is assigned a dynamic quota of (100)(.5)(.3636), which is 18.1818 machines. The project2 subgroup is assigned a dynamic quota of (100)(.5)(.2727), which is 13.6364 machines. The quota for each of project1 and project3 results in the truncated amount of 18 machines, and project2 results in the truncated amount of 13 machines, with the 0.1818 + .6364 + .1818 = 1.0 remaining machine assigned to job submitters in the parent group, group_stat.


next up previous contents index
Next: 3.5 Policy Configuration for Up: 3. Administrators' Manual Previous: 3.3 Configuration   Contents   Index
htcondor-admin@cs.wisc.edu