HTML Dropdown

Tuesday, 9 June 2026

Databricks Architecture Explained – A Complete Beginner Guide


🚀 Introduction

Modern data platforms struggle with separating data lakes and warehouses. Databricks solves this using the Lakehouse Architecture, which combines both worlds into one unified system. 

This blog explains how Databricks architecture works from a high-level perspective.





🏗️ Core Architecture Overview

Databricks follows a two-layer architecture:

1️⃣ Control Plane

  • Managed by Databricks (SaaS)
  • Handles:
    • User authentication
    • Workspace management
    • Job scheduling
    • Metadata storage 

👉 Think of it as the brain of Databricks


2️⃣ Compute Plane

  • Runs in your cloud (AWS / Azure / GCP)
  • Handles:
    • Data processing
    • Spark execution
    • Cluster workloads 

👉 Think of it as the engine that processes data


🧱 Key Components

✅ Workspaces

  • Collaborative environment where users run notebooks and jobs 

✅ Clusters

  • Compute resources to process data (auto-scaling supported)

✅ Delta Lake

  • Storage layer providing:
    • ACID transactions
    • Schema enforcement 

✅ Unity Catalog

  • Central governance (security, access control, lineage)

🔄 Data Flow in Databricks

  1. Data ingested from source (APIs, logs, databases)
  2. Stored in cloud storage (S3, ADLS, GCS)
  3. Processed via Spark clusters
  4. Stored as Delta Tables
  5. Consumed by:
    • BI dashboards
    • ML models

🎯 Conclusion

Databricks architecture simplifies big data by combining:

  • Storage + Processing + Governance
    into a single unified platform.

Databricks uses a managed control plane for workspaces, jobs, and security, while Spark clusters in the compute plane process data stored in Delta Lake and governed by Unity Catalog.


Startup code – notebook setup

# Common imports
from pyspark.sql import functions as F
from pyspark.sql.types import *

# Helpful session configs for interactive work
spark.conf.set('spark.sql.shuffle.partitions', '200')
spark.conf.set('spark.databricks.delta.optimizeWrite.enabled', 'true')
spark.conf.set('spark.databricks.delta.autoCompact.enabled', 'true')

print('Spark version:', spark.version)

Converted notes

·        Control plane manages notebooks, jobs, workspace collaboration, and security settings.

·        Compute plane runs Spark workloads on clusters or SQL warehouses.

·        Delta Lake provides reliable ACID storage and versioning, while Unity Catalog centralizes governance and access control.

Core code examples

Create a DataFrame

data = [('Alice', 25), ('Bob', 30)]
df = spark.createDataFrame(data, ['name', 'age'])
df.show()

Write to Delta Lake

df.write.format('delta')
  .mode('overwrite')
  .save('/mnt/delta/users')

Read from Delta and query with SQL

df = spark.read.format('delta').load('/mnt/delta/users')
df.createOrReplaceTempView('users')
spark.sql('SELECT * FROM users WHERE age > 25').show()


# Create DataFrame

 data = [("Alice", 25), ("Bob", 30)]

df = spark.createDataFrame(data, ["name", "age"])

df.show()

 

# Write to Delta

 df.write.format("delta").mode("overwrite").save("/mnt/delta/users")

 

# Read Delta

 df = spark.read.format("delta").load("/mnt/delta/users")

 

# SQL Query

 df.createOrReplaceTempView("users")

spark.sql("SELECT * FROM users WHERE age > 25").show()

No comments:

Post a Comment