Building a Multi Region Compliant Customer Data Lake at Scale
Read Full ArticleSummary
The article outlines Atlassian's approach to building a unified customer data lake to address limitations in analytics and data export as their cloud customer base expanded. It details the challenges faced, including the need for instant analytics, secure data export paths, and the strain of analytical workloads on operational databases. The solution involved implementing a logical replication strategy for change data capture (CDC) across various product teams, utilizing a decentralized data mesh architecture. The article further explains the design of a streaming ETL pipeline using Apache Flink and Delta Lake, emphasizing the importance of schema evolution and the collaborative process between product and data modeling teams.
Key Learnings
- 1A unified customer data lake can significantly enhance analytics capabilities and data export processes in multi-tenant environments.
- 2Logical replication for change data capture provides technology independence and stability, allowing product teams to manage their data as a product.
- 3Implementing a streaming ETL pipeline with tools like Apache Flink and Delta Lake enables near real-time data processing and analytics.
- 4The use of Protobuf for event schema ensures backward compatibility and efficient data handling across diverse systems.
- 5Collaborative schema management between product teams and data engineers is crucial for maintaining data integrity and usability in analytics.
Who Should Read This
Senior Data Engineers designing scalable data lakes and analytics solutions in cloud environments
Test Your Knowledge
What are the trade-offs between logical and physical replication for change data capture in a multi-product environment?
How does the implementation of a data mesh architecture influence data ownership and governance across teams?
What challenges might arise when evolving schemas in a streaming ETL pipeline, and how can they be mitigated?
Why was Apache Flink chosen for the streaming ETL pipeline, and what advantages does it offer for processing event streams?
How does the 'last write wins' versioning strategy impact data consistency in a distributed system?
Topics
More articles about Data Lake
Explore Data Lake engineering →Transforming Healthcare Referrals with Fivetran, Agentic AI, and Databricks Genie
The article outlines how healthcare organizations can address fragmented data challenges by leveraging Fivetran for seamless data extraction and Databricks for data unification and AI deployment. It...
The Professional Impact of Becoming Databricks Certified
The article highlights the significance of Databricks certifications in enhancing professional credibility and career opportunities for data and AI practitioners. It emphasizes that these...
Building a near real-time application with Zerobus Ingest and Lakebase
The article discusses the integration of Zerobus Ingest and Lakebase within the Databricks platform to facilitate the development of near real-time applications. It highlights how Zerobus Ingest...
New in Migrations: Faster and More Predictable
The article outlines the latest enhancements in Lakebridge, a tool designed to streamline the migration of legacy data warehouses to the Databricks platform. Key features include an automated...
Turning Insight Into Impact with Databricks and Global Orphan Project
The article outlines the collaboration between the Global Orphan Project and Databricks to enhance data-driven operations through a centralized Lakehouse architecture. By consolidating various data...
More from Atlassian Engineering
View Atlassian engineering blogs →Scaling Jira cloud Migrations, One Bottleneck at a Time
The article chronicles the Jira Migrations team's journey in scaling their migration platform from handling 20,000 to 50,000 Monthly Paid Enabled Users (PEUs). It discusses the transition from an...
How we catch and mitigate performance regressions at scale in Jira Cloud
The article discusses the complexities of detecting and mitigating performance regressions in Jira Cloud, a multi-tenant product. It highlights the challenges posed by diverse tenant configurations...
Get started on your work 30% faster with Rovo in Jira
The article discusses the implementation and analysis of Rovo, an AI tool integrated within Jira, aimed at enhancing user productivity. It presents a quasi-experimental study comparing two cohorts of...
How Rovo solves search challenges through entity linking
The article discusses how Atlassian addresses search challenges through advanced entity linking, transforming unstructured text into actionable knowledge. It highlights the importance of accurately...
How We Unlocked Performance at Scale with Jira Platform
The article discusses the significant rearchitecture of the Jira Cloud platform, transitioning from a single-tenant database to a cloud-native, multi-tenant architecture designed for scalability,...