🚀 Introduction
To ensure data quality and scalability, Microsoft Fabric uses the Medallion Architecture.
This organizes data into layers of increasing quality.
🥉 Bronze Layer (Raw Data)
- Stores raw, unprocessed data
- No transformations
- Acts as source of truth
🥈 Silver Layer (Cleaned Data)
- Data is cleaned and validated
- Removes duplicates
- Standardizes formats
🥇 Gold Layer (Business Data)
- Aggregated and optimized
- Used for reporting and dashboards
- Supports ML and analytics
🔄 Data Flow
Raw Data → Bronze → Silver → Gold → Analytics
✅ Benefits
- Improved data quality
- Easy debugging (trace to raw data)
- Better performance
- Reprocessing capability
🎯 Conclusion
Medallion architecture ensures: 👉 Reliable, scalable, and maintainable data pipelines
Starter
code – Bronze → Silver → Gold in a Fabric notebook
Bronze
– ingest raw JSON with metadata
from pyspark.sql.functions import
current_timestamp
df_bronze = (spark.read.format('json')
.load('Files/raw/*.json')
.withColumn('ingestion_time',
current_timestamp()))
df_bronze.write.format('delta').mode('append').save('Tables/bronze_sales')
Silver
– clean and deduplicate
from pyspark.sql.functions import col
df_silver = (spark.read.format('delta').load('Tables/bronze_sales')
.dropDuplicates(['transaction_id'])
.filter(col('amount').isNotNull()))
df_silver.write.format('delta').mode('overwrite').save('Tables/silver_sales')
Gold
– aggregate for analytics
df_gold = spark.sql("""
SELECT customer_id, SUM(amount) AS total_spend
FROM silver_sales
GROUP BY customer_id
""")
df_gold.write.format('delta').mode('overwrite').save('Tables/gold_sales')
: Medallion Architecture Code
# Bronze
from pyspark.sql.functions import
current_timestamp
df_bronze = spark.read.format("json").load("Files/raw/*.json")
df_bronze.write.format("delta").save("Tables/bronze")
# Silver
df_silver = spark.read.format("delta").load("Tables/bronze").dropDuplicates()
df_silver.write.format("delta").save("Tables/silver")
# Gold
df_gold = spark.sql("SELECT category, COUNT(*) FROM silver GROUP BY category")
df_gold.write.format("delta").save("Tables/gold")
No comments:
Post a Comment