Salesforce
9 min read

Migration at Scale: Moving Marketing Cloud Caching from Memcached to Redis at 1.5M RPS Without Downtime

Read Full Article

Summary

The article discusses a complex migration of the Marketing Cloud Caching infrastructure from Memcached to Redis, achieving a zero-downtime transition while handling approximately 1.5 million cache events per second. It details the challenges faced, including the need for high availability, security, and performance, alongside the implementation of a Dynamic Cache Router to manage traffic between the two caching systems. The migration involved meticulous planning to ensure behavioral parity and operational control, with a focus on maintaining low latency and stable performance metrics throughout the process. The article also highlights the proactive measures taken to identify and mitigate hot-key issues, ensuring the reliability of the Redis Cluster under production conditions.

Key Learnings

  • 1Achieving zero-downtime migration requires careful planning and the implementation of a Dynamic Cache Router to manage traffic between legacy and new systems.
  • 2Maintaining behavioral parity during migration is critical to avoid application disruptions, necessitating a compatibility layer for TTL and key handling.
  • 3Proactive hot-key detection and mitigation strategies are essential for maintaining performance and stability in high-throughput caching scenarios.
  • 4The use of real production traffic for validation is crucial to accurately assess performance metrics and ensure that the new system meets operational expectations.
  • 5Segregating read and write traffic can significantly enhance overall cluster stability and performance, especially under high load conditions.

Who Should Read This

Senior Data Engineers managing high-throughput caching solutions in production environments

Test Your Knowledge

?

What are the implications of using Memcached's lack of replication compared to Redis's intrinsic primary-replica replication during a migration?

?

How does the Dynamic Cache Router facilitate a seamless transition between caching systems without requiring application code changes?

?

What specific challenges arise from maintaining Time-to-Live (TTL) semantics when migrating from Memcached to Redis?

?

In what ways can proactive hot-key detection improve the performance of a Redis Cluster under high request rates?

?

Why is it important to validate performance metrics using real production traffic rather than synthetic benchmarks during a migration?

Topics

Read Full Article at Salesforce

More from Salesforce Engineering

View Salesforce engineering blogs →
Salesforce
6m

Engineering Platform Trust: Cutting Customer Case Volume 20x with Petabyte-Scale Health Signals

The article details the development of a Technical Health Score system at Salesforce, aimed at quantifying platform trust through analytics pipelines that handle petabytes of telemetry data. By...

Salesforce
5m

How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings

The article discusses how the Data 360 Compute Fabric team at Salesforce optimized Kubernetes scheduling to enhance resource efficiency and reduce costs. By evolving the default kube-scheduler...

Salesforce
6m

Delivering Accurate, Low-Latency Voice-to-Form AI in Real-World Field Conditions

The article explores the development of a hybrid architecture for a voice-to-form AI system used in field service applications. It highlights the integration of on-device speech-to-text capabilities...

Salesforce
7m

Hyperforce Migration at Scale: How Deterministic Automation Replaced Manual Spreadsheets Across 95,000 Organizations

The article outlines the development of the Migration Intake and Processing Service (MIPS) at Salesforce, which automates the migration of over 95,000 organizations to Hyperforce. It highlights the...

Salesforce
5m

Building an AI-Accelerated Compliance Automation Platform for 24x Faster Audits

The article outlines the development of FastTrack, a compliance automation platform by Salesforce, which significantly reduces audit execution time through AI-assisted development and API-based...