Success Stories of Women In Tech: Medallion Architecture in Databricks (Bronze, Silver, Gold)

🚀 Introduction

To manage data quality and scalability, Databricks uses a design pattern called Medallion Architecture.

It organizes data into layers based on quality and refinement.

🥉 Bronze Layer – Raw Data

Stores raw, unprocessed data
No transformations applied
Preserves original data for auditing

✅ Example:

Logs, API responses, streaming data

🥈 Silver Layer – Cleaned Data

Data is:
- Cleaned
- Deduplicated
- Validated

✅ Purpose:

Build an “enterprise view” of data
Prepare data for analytics

🥇 Gold Layer – Business Data

Aggregated and optimized for:
- BI dashboards
- Reporting
- Machine learning

✅ Example:

Revenue reports
Customer insights

🔁 Data Flow

Raw Data → Bronze → Silver → Gold → Analytics

✅ Benefits

Improved data quality at each stage
Easy debugging (trace back to raw data)
Better performance for BI and ML
Reprocessing capability

🎯 Conclusion

Medallion architecture ensures: 👉 Clean, reliable, and scalable data pipelines.

Medallion Architecture organizes data into Bronze, Silver, and Gold layers, improving data quality progressively from raw ingestion to curated business-ready outputs.

Startup code – define layer paths and helpers

bronze_path = '/mnt/delta/bronze'
silver_path = '/mnt/delta/silver'
gold_path = '/mnt/delta/gold'
raw_json_path = '/mnt/raw/json_data'

from pyspark.sql.functions import current_timestamp, col
print('Medallion paths ready')

Converted notes

· Bronze preserves the source data with minimal transformation and adds useful ingestion metadata.

· Silver cleans, deduplicates, validates, and standardizes data.

· Gold applies business logic and aggregates for reports, dashboards, or ML consumption.

Bronze / Silver / Gold code examples

Bronze layer – ingest raw JSON

bronze_df = spark.read.format('json').load(raw_json_path)
bronze_df = bronze_df.withColumn('ingest_time', current_timestamp())
bronze_df.write.format('delta').mode('append').save(bronze_path)

Silver layer – clean and validate

silver_df = (spark.read.format('delta').load(bronze_path)
.dropDuplicates(['id'])
.filter(col('amount').isNotNull()))

silver_df.write.format('delta').mode('overwrite').save(silver_path)

Gold layer – aggregate for business use

spark.read.format('delta').load(silver_path).createOrReplaceTempView('silver')
gold_df = spark.sql('SELECT customer_id, SUM(amount) AS total_spend FROM silver GROUP BY customer_id')
gold_df.write.format('delta').mode('overwrite').save(gold_path)

# Bronze Layer

from pyspark.sql.functions import current_timestamp

bronze_df = spark.read.format("json").load("/mnt/raw")

bronze_df = bronze_df.withColumn("ingest_time", current_timestamp())

bronze_df.write.save("/mnt/delta/bronze")

# Silver Layer

from pyspark.sql.functions import col

silver_df = bronze_df.dropDuplicates().filter(col("amount").isNotNull())

silver_df.write.save("/mnt/delta/silver")

# Gold Layer

gold_df = spark.sql("SELECT category, COUNT(*) FROM silver GROUP BY category")

gold_df.write.save("/mnt/delta/gold")

Success Stories of Women In Tech

HTML Dropdown

Tuesday, 9 June 2026

Medallion Architecture in Databricks (Bronze, Silver, Gold)