🚀 Introduction
Modern data platforms struggle with separating data lakes and warehouses. Databricks solves this using the Lakehouse Architecture, which combines both worlds into one unified system.
This blog explains how Databricks architecture works from a high-level perspective.
🏗️ Core Architecture Overview
Databricks follows a two-layer architecture:
1️⃣ Control Plane
- Managed by Databricks (SaaS)
- Handles:
- User authentication
- Workspace management
- Job scheduling
- Metadata storage
👉 Think of it as the brain of Databricks
2️⃣ Compute Plane
- Runs in your cloud (AWS / Azure / GCP)
- Handles:
- Data processing
- Spark execution
- Cluster workloads
👉 Think of it as the engine that processes data
🧱 Key Components
✅ Workspaces
- Collaborative environment where users run notebooks and jobs
✅ Clusters
- Compute resources to process data (auto-scaling supported)
✅ Delta Lake
- Storage layer providing:
- ACID transactions
- Schema enforcement
✅ Unity Catalog
- Central governance (security, access control, lineage)
🔄 Data Flow in Databricks
- Data ingested from source (APIs, logs, databases)
- Stored in cloud storage (S3, ADLS, GCS)
- Processed via Spark clusters
- Stored as Delta Tables
- Consumed by:
- BI dashboards
- ML models
🎯 Conclusion
Databricks architecture simplifies big data by combining:
- Storage + Processing + Governance
into a single unified platform.
Databricks uses a managed control plane for
workspaces, jobs, and security, while Spark clusters in the compute plane
process data stored in Delta Lake and governed by Unity Catalog.
Startup
code – notebook setup
# Common imports
from pyspark.sql import functions as F
from pyspark.sql.types import *
# Helpful session configs for interactive work
spark.conf.set('spark.sql.shuffle.partitions', '200')
spark.conf.set('spark.databricks.delta.optimizeWrite.enabled', 'true')
spark.conf.set('spark.databricks.delta.autoCompact.enabled', 'true')
print('Spark version:', spark.version)
Converted
notes
·
Control plane manages
notebooks, jobs, workspace collaboration, and security settings.
·
Compute plane runs Spark
workloads on clusters or SQL warehouses.
·
Delta Lake provides reliable
ACID storage and versioning, while Unity Catalog centralizes governance and
access control.
Core
code examples
Create
a DataFrame
data = [('Alice', 25), ('Bob', 30)]
df = spark.createDataFrame(data, ['name', 'age'])
df.show()
Write
to Delta Lake
df.write.format('delta')
.mode('overwrite')
.save('/mnt/delta/users')
Read
from Delta and query with SQL
df =
spark.read.format('delta').load('/mnt/delta/users')
df.createOrReplaceTempView('users')
spark.sql('SELECT * FROM users WHERE age > 25').show()
# Create DataFrame
data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data,
["name", "age"])
df.show()
# Write to Delta
df.write.format("delta").mode("overwrite").save("/mnt/delta/users")
# Read Delta
df = spark.read.format("delta").load("/mnt/delta/users")
# SQL Query
df.createOrReplaceTempView("users")
spark.sql("SELECT * FROM users WHERE
age > 25").show()
No comments:
Post a Comment