How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings

Summary

The article discusses how the Data 360 Compute Fabric team at Salesforce optimized Kubernetes scheduling to enhance resource efficiency and reduce costs. By evolving the default kube-scheduler behavior, the team addressed issues such as node fragmentation and the inefficiencies of reactive autoscaling. They implemented a custom scheduler using a MostAllocated approach to consolidate executor pods, which improved CPU and memory utilization by approximately 15% and resulted in a 13% reduction in compute costs. The changes also enhanced operational reliability by reducing node disruption rates and ensuring that Spark applications experienced fewer executor losses and more predictable runtimes.

Key Learnings

1Understanding the limitations of the default kube-scheduler in high-scale Spark environments and the need for a custom scheduling strategy.
2The trade-off between resource utilization and operational stability, especially in bursty workloads.
3The importance of proactive scheduling and intelligent resource placement to minimize costs while maintaining reliability.
4How the MostAllocated scoring strategy can effectively eliminate fragmentation and improve resource efficiency.
5The impact of scheduling decisions on job-level SLA stability and the overall cost-to-serve.

Who Should Read This

Senior Cloud Engineers with experience in Kubernetes and distributed systems, focusing on optimizing resource utilization and cost management in large-scale data processing environments.

Test Your Knowledge

What are the core reasons the default kube-scheduler's LeastAllocated strategy is ineffective in a high-scale Spark environment?

How does the MostAllocated approach differ from the default scheduling strategy, and what are its advantages?

What challenges arise from using Karpenter for node consolidation in a Spark workload context?

In what ways does proactive scheduling contribute to both cost efficiency and operational stability?

How can the design of a custom scheduler impact the overall performance of Spark applications?

Topics

High Availability Load Shedding Service Discovery Sharding Replication

Read Full Article at Salesforce

More from Salesforce Engineering

View Salesforce engineering blogs →

Salesforce

How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings

Summary

Key Learnings

Who Should Read This

Test Your Knowledge

Topics

More articles about High Availability

Scaling Jira cloud Migrations, One Bottleneck at a Time

How we rebuilt the search architecture for high availability in GitHub Enterprise Server

Best Practices for High QPS Model Serving on Databricks

My Journey to Airbnb — Anna Sulkina

Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

More from Salesforce Engineering

Engineering Platform Trust: Cutting Customer Case Volume 20x with Petabyte-Scale Health Signals

Delivering Accurate, Low-Latency Voice-to-Form AI in Real-World Field Conditions

Hyperforce Migration at Scale: How Deterministic Automation Replaced Manual Spreadsheets Across 95,000 Organizations

Building an AI-Accelerated Compliance Automation Platform for 24x Faster Audits

From Audio to Action: How Speech Invocable Action Powers Native AI Automation Across Salesforce

Related topics