Time-Based Fair Share for Kubernetes Cluster GPU Allocation to Maintain Balance

GPU Multitenancy in Kubernetes: Strategies & Best Practices

Time-based fairshare is a new scheduling mode that makes fair-share scheduling with time awareness for over-quota resources available to Kubernetes clusters in NVIDIA Run:ai v2.24. This capability, based on the open-source KAI Scheduler that powers NVIDIA Run:ai, addresses a problem with shared GPU infrastructure that has existed for a long time. Consider two teams sharing a cluster with equal priority. While Team A submits smaller jobs on a regular basis, Team B must manage a larger job that calls for more resources.

The smaller jobs from Team A get scheduled as soon as resources become available. Team B’s bigger project continues to wait for sufficient resources to become available. The subsequent small task from Team A takes the freed capacity before that occurs. As a result, Team A completes job after job while Team B’s job sits in the queue indefinitely, despite having identical entitlements and priority. This issue is resolved by providing the scheduler with memory through time-based fairshare. The scheduler now adjusts the fair share of each queue based on previous resource consumption, rather than calculating it all at once.

Teams that have been waiting receive a boost while teams that have used more resources recently receive lower scores for over-quota allocation. Fair share based on time results in proportional computation time spread out over days and weeks. True time-sharing of GPU resources, burst access for occasional large jobs, and resource planning that matches GPU-hour budgets on a weekly or monthly basis are all made possible by this. Importantly, queue priorities and guaranteed quotas operate exactly as before. This post provides a more in-depth explanation of the issue, walks you through a real-world use case, and demonstrates how to enable time-based fairshare in KAI Scheduler and NVIDIA Run:ai.

Contents

1 What is the significance of GPU resource fairness over quotas?
2 What are the workings of stateless fair share scheduling?
3 How is time-based fair share implemented?

What is the significance of GPU resource fairness over quotas?

Cluster usage becomes significantly more dynamic when organizations switch from static GPU allocation to dynamic scheduling, according to enterprise deployments. One of the most frequently utilized resource types is the shared pool of over-quota resources, which go beyond guaranteed quotas. Teams frequently go over their promised amounts, which means that researchers use more compute time and the GPU. Over-quota fairness is essential because of this. When this shared pool accounts for a significant portion of the cluster’s value, it must be divided fairly over time.

What are the workings of stateless fair share scheduling?

Two phases are used in the traditional stateless fair share algorithms to divide cluster resources. Deserved Quota, the guaranteed resources that each queue is entitled to, is first allocated. This allocation is unaffected by previous usage and always takes place first. This behavior does not change with time-based fairshare. Any remaining capacity becomes the Over-Quota Pool, a shared surplus that queues compete for based on their weights after deserved quotas have been met. Point-in-time fairness breaks down at this point.

The issue is located here. Take into consideration the two categories of queues that are vying for over-quota resources :

When queues have equal weight : their calculated fair share is the same for both. Both queues are in the same state when resources become available after a job is finished: they both have pending jobs, the same allocation (zero), and the same fair share. The scheduler uses tie-breakers (queue creation timestamp and alphabetical order) because there is no difference between them. The same queue always wins. When queues have different weights, the fair share that goes to the queue with the higher weight is correct. However, the point-in-time calculation does not keep track of whether queues actually get their fair share over time.

For instance, if Queue A has weight 3 and Queue B has weight 1, the scheduler correctly determines that A is entitled to 75% of the resources above the quota (3/4) and that B is entitled to 25% of the resources above the quota (1/4). However, if Queue A submits a lot of smaller workloads while Queue B submits a lot of larger ones, Queue B will be able to fit into its fair share more easily, while Queue A’s large workloads will push it above its fair share. Because Queue B appears to be “underallocated” at each decision point, the scheduler still prefers it. Queue B runs far more workloads than its 25% entitlement over time. The scheduler has no memory in either scenario. It is unaware that one team has just completed a task while the other has been waiting for a long time.

How is time-based fair share implemented?

Time-based fairshare is based on comparing each queue’s actual consumption of over-quota resources over the specified time window to the proportion it ought to have received based on its weight. Then adjust accordingly.
For instance, if Queue A has weight 3 and Queue B has weight 1, then 75% of the over-quota resources should go to Queue A and 25% to Queue B. The scheduler will increase Queue B’s effective weight and decrease Queue A’s, balancing future allocations toward the 75/25 split, if it finds that Queue A actually consumed 90% of the food while Queue B only received 10%. Everything else remains unchanged. First priority is still given to deserved quotas. Priority ordering still applies. Queue hierarchies function as previously. Time-based fairshare only alters the distribution of the over-quota pool.

Droidoo

Droidoo – A New WEB

Time-Based Fair Share for Kubernetes Cluster GPU Allocation to Maintain Balance

What is the significance of GPU resource fairness over quotas?

What are the workings of stateless fair share scheduling?

How is time-based fair share implemented?

More Articles

Follow us