SalesforceHow Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings
Read Full ArticleSummary
The article discusses how the Data 360 Compute Fabric team at Salesforce optimized Kubernetes scheduling to enhance resource efficiency and reduce costs. By evolving the default kube-scheduler behavior, the team addressed issues such as node fragmentation and the inefficiencies of reactive autoscaling. They implemented a custom scheduler using a MostAllocated approach to consolidate executor pods, which improved CPU and memory utilization by approximately 15% and resulted in a 13% reduction in compute costs. The changes also enhanced operational reliability by reducing node disruption rates and ensuring that Spark applications experienced fewer executor losses and more predictable runtimes.
Key Learnings
- 1Understanding the limitations of the default kube-scheduler in high-scale Spark environments and the need for a custom scheduling strategy.
- 2The trade-off between resource utilization and operational stability, especially in bursty workloads.
- 3The importance of proactive scheduling and intelligent resource placement to minimize costs while maintaining reliability.
- 4How the MostAllocated scoring strategy can effectively eliminate fragmentation and improve resource efficiency.
- 5The impact of scheduling decisions on job-level SLA stability and the overall cost-to-serve.
Who Should Read This
Senior Cloud Engineers with experience in Kubernetes and distributed systems, focusing on optimizing resource utilization and cost management in large-scale data processing environments.
Test Your Knowledge
What are the core reasons the default kube-scheduler's LeastAllocated strategy is ineffective in a high-scale Spark environment?
How does the MostAllocated approach differ from the default scheduling strategy, and what are its advantages?
What challenges arise from using Karpenter for node consolidation in a Spark workload context?
In what ways does proactive scheduling contribute to both cost efficiency and operational stability?
How can the design of a custom scheduler impact the overall performance of Spark applications?
Topics
More articles about High Availability
Explore High Availability engineering →Scaling Jira cloud Migrations, One Bottleneck at a Time
The article chronicles the Jira Migrations team's journey in scaling their migration platform from handling 20,000 to 50,000 Monthly Paid Enabled Users (PEUs). It discusses the transition from an...
How we rebuilt the search architecture for high availability in GitHub Enterprise Server
The article discusses the architectural improvements made to the search functionality in GitHub Enterprise Server to enhance high availability (HA). It highlights the transition from a clustered...
Best Practices for High QPS Model Serving on Databricks
The article outlines best practices for achieving high queries per second (QPS) performance in model serving on Databricks. It emphasizes the importance of low latency and high throughput for...
My Journey to Airbnb — Anna Sulkina
Anna Sulkina's journey to Airbnb highlights her extensive experience in engineering, particularly in application and cloud infrastructure. She transitioned from hardware diagnostics to software...
Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters
The article discusses the implementation of backend aggregation (BAG) in Meta's Prometheus AI clusters, highlighting its role in interconnecting thousands of GPUs across multiple data centers. BAG...
More from Salesforce Engineering
View Salesforce engineering blogs →Engineering Platform Trust: Cutting Customer Case Volume 20x with Petabyte-Scale Health Signals
The article details the development of a Technical Health Score system at Salesforce, aimed at quantifying platform trust through analytics pipelines that handle petabytes of telemetry data. By...
Delivering Accurate, Low-Latency Voice-to-Form AI in Real-World Field Conditions
The article explores the development of a hybrid architecture for a voice-to-form AI system used in field service applications. It highlights the integration of on-device speech-to-text capabilities...
Hyperforce Migration at Scale: How Deterministic Automation Replaced Manual Spreadsheets Across 95,000 Organizations
The article outlines the development of the Migration Intake and Processing Service (MIPS) at Salesforce, which automates the migration of over 95,000 organizations to Hyperforce. It highlights the...
Building an AI-Accelerated Compliance Automation Platform for 24x Faster Audits
The article outlines the development of FastTrack, a compliance automation platform by Salesforce, which significantly reduces audit execution time through AI-assisted development and API-based...
From Audio to Action: How Speech Invocable Action Powers Native AI Automation Across Salesforce
The article explores the creation of the Speech Invocable Action by Salesforce's Agentforce Speech Foundations team, which enables secure, native speech automation within the Salesforce platform....