HTML Dropdown

Tuesday, 9 June 2026

Medallion Architecture in Databricks (Bronze, Silver, Gold)

 

🚀 Introduction

To manage data quality and scalability, Databricks uses a design pattern called Medallion Architecture.

It organizes data into layers based on quality and refinement. 




🥉 Bronze Layer – Raw Data

  • Stores raw, unprocessed data
  • No transformations applied
  • Preserves original data for auditing 

✅ Example:

  • Logs, API responses, streaming data

🥈 Silver Layer – Cleaned Data

  • Data is:
    • Cleaned
    • Deduplicated
    • Validated 

✅ Purpose:

  • Build an “enterprise view” of data
  • Prepare data for analytics

🥇 Gold Layer – Business Data

  • Aggregated and optimized for:
    • BI dashboards
    • Reporting
    • Machine learning 

✅ Example:

  • Revenue reports
  • Customer insights

🔁 Data Flow

Raw Data → Bronze → Silver → Gold → Analytics

✅ Benefits

  • Improved data quality at each stage
  • Easy debugging (trace back to raw data)
  • Better performance for BI and ML
  • Reprocessing capability

🎯 Conclusion

Medallion architecture ensures: 👉 Clean, reliable, and scalable data pipelines.



Medallion Architecture organizes data into Bronze, Silver, and Gold layers, improving data quality progressively from raw ingestion to curated business-ready outputs.

Startup code – define layer paths and helpers

bronze_path = '/mnt/delta/bronze'
silver_path = '/mnt/delta/silver'
gold_path = '/mnt/delta/gold'
raw_json_path = '/mnt/raw/json_data'

from pyspark.sql.functions import current_timestamp, col
print('Medallion paths ready')

Converted notes

·        Bronze preserves the source data with minimal transformation and adds useful ingestion metadata.

·        Silver cleans, deduplicates, validates, and standardizes data.

·        Gold applies business logic and aggregates for reports, dashboards, or ML consumption.

Bronze / Silver / Gold code examples

Bronze layer – ingest raw JSON

bronze_df = spark.read.format('json').load(raw_json_path)
bronze_df = bronze_df.withColumn('ingest_time', current_timestamp())
bronze_df.write.format('delta').mode('append').save(bronze_path)

Silver layer – clean and validate

silver_df = (spark.read.format('delta').load(bronze_path)
    .dropDuplicates(['id'])
    .filter(col('amount').isNotNull()))

silver_df.write.format('delta').mode('overwrite').save(silver_path)

Gold layer – aggregate for business use

spark.read.format('delta').load(silver_path).createOrReplaceTempView('silver')
gold_df = spark.sql('SELECT customer_id, SUM(amount) AS total_spend FROM silver GROUP BY customer_id')
gold_df.write.format('delta').mode('overwrite').save(gold_path)




 

# Bronze Layer

from pyspark.sql.functions import current_timestamp

bronze_df = spark.read.format("json").load("/mnt/raw")

bronze_df = bronze_df.withColumn("ingest_time", current_timestamp())

bronze_df.write.save("/mnt/delta/bronze")

 

# Silver Layer

from pyspark.sql.functions import col

silver_df = bronze_df.dropDuplicates().filter(col("amount").isNotNull())

silver_df.write.save("/mnt/delta/silver")

 

# Gold Layer

 gold_df = spark.sql("SELECT category, COUNT(*) FROM silver GROUP BY category")

gold_df.write.save("/mnt/delta/gold")

No comments:

Post a Comment