Building a Multi Region Compliant Customer Data Lake at Scale

Summary

The article outlines Atlassian's approach to building a unified customer data lake to address limitations in analytics and data export as their cloud customer base expanded. It details the challenges faced, including the need for instant analytics, secure data export paths, and the strain of analytical workloads on operational databases. The solution involved implementing a logical replication strategy for change data capture (CDC) across various product teams, utilizing a decentralized data mesh architecture. The article further explains the design of a streaming ETL pipeline using Apache Flink and Delta Lake, emphasizing the importance of schema evolution and the collaborative process between product and data modeling teams.

Key Learnings

1A unified customer data lake can significantly enhance analytics capabilities and data export processes in multi-tenant environments.
2Logical replication for change data capture provides technology independence and stability, allowing product teams to manage their data as a product.
3Implementing a streaming ETL pipeline with tools like Apache Flink and Delta Lake enables near real-time data processing and analytics.
4The use of Protobuf for event schema ensures backward compatibility and efficient data handling across diverse systems.
5Collaborative schema management between product teams and data engineers is crucial for maintaining data integrity and usability in analytics.

Who Should Read This

Senior Data Engineers designing scalable data lakes and analytics solutions in cloud environments

Test Your Knowledge

What are the trade-offs between logical and physical replication for change data capture in a multi-product environment?

How does the implementation of a data mesh architecture influence data ownership and governance across teams?

What challenges might arise when evolving schemas in a streaming ETL pipeline, and how can they be mitigated?

Why was Apache Flink chosen for the streaming ETL pipeline, and what advantages does it offer for processing event streams?

How does the 'last write wins' versioning strategy impact data consistency in a distributed system?

Topics

Data Lake Etl Pipelines Data Governance Data Quality Data Warehousing

Read Full Article at Atlassian

More from Atlassian Engineering

View Atlassian engineering blogs →

Atlassian

13m

Scaling Jira cloud Migrations, One Bottleneck at a Time

The article chronicles the Jira Migrations team's journey in scaling their migration platform from handling 20,000 to 50,000 Monthly Paid Enabled Users (PEUs). It discusses the transition from an...

Atlassian

14m

23m

How We Unlocked Performance at Scale with Jira Platform

The article discusses the significant rearchitecture of the Jira Cloud platform, transitioning from a single-tenant database to a cloud-native, multi-tenant architecture designed for scalability,...

Building a Multi Region Compliant Customer Data Lake at Scale

Summary

Key Learnings

Who Should Read This

Test Your Knowledge

Topics

More articles about Data Lake

Transforming Healthcare Referrals with Fivetran, Agentic AI, and Databricks Genie

The Professional Impact of Becoming Databricks Certified

Building a near real-time application with Zerobus Ingest and Lakebase

New in Migrations: Faster and More Predictable

Turning Insight Into Impact with Databricks and Global Orphan Project

More from Atlassian Engineering

Scaling Jira cloud Migrations, One Bottleneck at a Time

How we catch and mitigate performance regressions at scale in Jira Cloud

Get started on your work 30% faster with Rovo in Jira

How Rovo solves search challenges through entity linking

How We Unlocked Performance at Scale with Jira Platform

Related topics