🚀 Introduction
To manage data quality and scalability, Databricks uses a design pattern called Medallion Architecture.
It organizes data into layers based on quality and refinement.
🥉 Bronze Layer – Raw Data
- Stores raw, unprocessed data
- No transformations applied
- Preserves original data for auditing
✅ Example:
- Logs, API responses, streaming data
🥈 Silver Layer – Cleaned Data
- Data is:
- Cleaned
- Deduplicated
- Validated
✅ Purpose:
- Build an “enterprise view” of data
- Prepare data for analytics
🥇 Gold Layer – Business Data
- Aggregated and optimized for:
- BI dashboards
- Reporting
- Machine learning
✅ Example:
- Revenue reports
- Customer insights
🔁 Data Flow
Raw Data → Bronze → Silver → Gold → Analytics
✅ Benefits
- Improved data quality at each stage
- Easy debugging (trace back to raw data)
- Better performance for BI and ML
- Reprocessing capability
🎯 Conclusion
Medallion architecture ensures: 👉 Clean, reliable, and scalable data pipelines.
Medallion Architecture organizes data into
Bronze, Silver, and Gold layers, improving data quality progressively from raw
ingestion to curated business-ready outputs.
Startup
code – define layer paths and helpers
bronze_path = '/mnt/delta/bronze'
silver_path = '/mnt/delta/silver'
gold_path = '/mnt/delta/gold'
raw_json_path = '/mnt/raw/json_data'
from pyspark.sql.functions import current_timestamp, col
print('Medallion paths ready')
Converted
notes
·
Bronze preserves the source
data with minimal transformation and adds useful ingestion metadata.
·
Silver cleans, deduplicates,
validates, and standardizes data.
·
Gold applies business logic and
aggregates for reports, dashboards, or ML consumption.
Bronze
/ Silver / Gold code examples
Bronze
layer – ingest raw JSON
bronze_df =
spark.read.format('json').load(raw_json_path)
bronze_df = bronze_df.withColumn('ingest_time', current_timestamp())
bronze_df.write.format('delta').mode('append').save(bronze_path)
Silver
layer – clean and validate
silver_df =
(spark.read.format('delta').load(bronze_path)
.dropDuplicates(['id'])
.filter(col('amount').isNotNull()))
silver_df.write.format('delta').mode('overwrite').save(silver_path)
Gold
layer – aggregate for business use
spark.read.format('delta').load(silver_path).createOrReplaceTempView('silver')
gold_df = spark.sql('SELECT customer_id, SUM(amount) AS total_spend FROM silver
GROUP BY customer_id')
gold_df.write.format('delta').mode('overwrite').save(gold_path)
# Bronze Layer
from pyspark.sql.functions import
current_timestamp
bronze_df =
spark.read.format("json").load("/mnt/raw")
bronze_df =
bronze_df.withColumn("ingest_time", current_timestamp())
bronze_df.write.save("/mnt/delta/bronze")
# Silver Layer
from pyspark.sql.functions import col
silver_df =
bronze_df.dropDuplicates().filter(col("amount").isNotNull())
silver_df.write.save("/mnt/delta/silver")
# Gold Layer
gold_df = spark.sql("SELECT category, COUNT(*) FROM silver GROUP BY category")
gold_df.write.save("/mnt/delta/gold")
No comments:
Post a Comment