Databricks
13 min read

High-Availability Feature Flagging at Databricks

Read Full Article

Summary

The article discusses Databricks' in-house feature flagging platform, SAFE, which decouples code deployment from feature enablement, enhancing the reliability and speed of software rollouts across multiple services. It details the architecture of SAFE, which supports over 25,000 active flags and achieves high performance with microsecond-scale latency through strategies such as static dimension pre-evaluation and multi-tiered global delivery. The system incorporates resilience mechanisms to ensure continued operation during delivery pipeline failures, allowing engineers to manage feature rollouts effectively while maintaining operational stability.

Key Learnings

  • 1SAFE allows for independent feature rollout and binary deployment, enhancing operational safety and flexibility.
  • 2The architecture employs pre-evaluation of static dimensions to achieve sub-millisecond evaluation latency, crucial for high-throughput services.
  • 3Multiple layers of resilience, including fail-static behavior and out-of-band delivery, ensure system stability during configuration delivery failures.
  • 4The integration of a custom DSL for flag configuration allows for complex use cases while maintaining ease of use for engineers.
  • 5Extensive pre-merge validation processes safeguard against unsafe flag changes, reducing operational risks.

Who Should Read This

Senior Software Engineers specializing in feature flagging systems and operational resilience strategies

Test Your Knowledge

?

What architectural principles underpin the design of the SAFE SDK, and how do they contribute to performance?

?

How does the separation of configuration delivery from evaluation impact the overall system reliability?

?

What are the implications of using a custom DSL for flag configuration in terms of usability and complexity?

?

In what ways does the fail-static approach enhance the resilience of the SAFE system during delivery failures?

?

What trade-offs were considered when designing the multi-tiered global delivery system for SAFE?

Topics

Read Full Article at Databricks