Salesforce
5 min read

How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings

Read Full Article

Summary

The article discusses how the Data 360 Compute Fabric team at Salesforce optimized Kubernetes scheduling to enhance resource efficiency and reduce costs. By evolving the default kube-scheduler behavior, the team addressed issues such as node fragmentation and the inefficiencies of reactive autoscaling. They implemented a custom scheduler using a MostAllocated approach to consolidate executor pods, which improved CPU and memory utilization by approximately 15% and resulted in a 13% reduction in compute costs. The changes also enhanced operational reliability by reducing node disruption rates and ensuring that Spark applications experienced fewer executor losses and more predictable runtimes.

Key Learnings

  • 1Understanding the limitations of the default kube-scheduler in high-scale Spark environments and the need for a custom scheduling strategy.
  • 2The trade-off between resource utilization and operational stability, especially in bursty workloads.
  • 3The importance of proactive scheduling and intelligent resource placement to minimize costs while maintaining reliability.
  • 4How the MostAllocated scoring strategy can effectively eliminate fragmentation and improve resource efficiency.
  • 5The impact of scheduling decisions on job-level SLA stability and the overall cost-to-serve.

Who Should Read This

Senior Cloud Engineers with experience in Kubernetes and distributed systems, focusing on optimizing resource utilization and cost management in large-scale data processing environments.

Test Your Knowledge

?

What are the core reasons the default kube-scheduler's LeastAllocated strategy is ineffective in a high-scale Spark environment?

?

How does the MostAllocated approach differ from the default scheduling strategy, and what are its advantages?

?

What challenges arise from using Karpenter for node consolidation in a Spark workload context?

?

In what ways does proactive scheduling contribute to both cost efficiency and operational stability?

?

How can the design of a custom scheduler impact the overall performance of Spark applications?

Topics

Read Full Article at Salesforce

More from Salesforce Engineering

View Salesforce engineering blogs →
Salesforce
6m

Engineering Platform Trust: Cutting Customer Case Volume 20x with Petabyte-Scale Health Signals

The article details the development of a Technical Health Score system at Salesforce, aimed at quantifying platform trust through analytics pipelines that handle petabytes of telemetry data. By...

Salesforce
6m

Delivering Accurate, Low-Latency Voice-to-Form AI in Real-World Field Conditions

The article explores the development of a hybrid architecture for a voice-to-form AI system used in field service applications. It highlights the integration of on-device speech-to-text capabilities...

Salesforce
7m

Hyperforce Migration at Scale: How Deterministic Automation Replaced Manual Spreadsheets Across 95,000 Organizations

The article outlines the development of the Migration Intake and Processing Service (MIPS) at Salesforce, which automates the migration of over 95,000 organizations to Hyperforce. It highlights the...

Salesforce
5m

Building an AI-Accelerated Compliance Automation Platform for 24x Faster Audits

The article outlines the development of FastTrack, a compliance automation platform by Salesforce, which significantly reduces audit execution time through AI-assisted development and API-based...

Salesforce
5m

From Audio to Action: How Speech Invocable Action Powers Native AI Automation Across Salesforce

The article explores the creation of the Speech Invocable Action by Salesforce's Agentforce Speech Foundations team, which enables secure, native speech automation within the Salesforce platform....