Name: Fine grained data provenance with Apache Spark
Start: 2026-05-28T10:30:00+0200
End: 2026-05-28T11:00:00+0200

Fine grained data provenance with Apache Spark

LIMITED

Thursday May 28, 2026 10:30 - 11:00 CEST

🤖 DATA/AI ARENA

Limited Capacity seats available

Decathlon's Data Lake is organized into progressive layers that transform data through increasing levels of complexity to power reporting, visualization (e.g., on interactive dashboards), and eventually advanced Machine Learning & AI (e.g., product recommendation, demand forecasting, dynamic pricing, …). To achieve this, we build and maintain complex, distributed pipelines written in SQL, and we leverage Apache Spark’s engine to handle Big Data processing at scale on multi-node clusters. However, complexity comes at a cost: as we stack more and more data transformations, manually tracing the exact origin of a specific data item becomes increasingly difficult and unmanageable, creating a critical need for an automated solution. We have recently recruited a Data Engineer intern and partnered with academic experts from ENS - PSL and Université Grenoble Alpes to prototype a (Fine-Grained) Data Provenance tool compatible with Apache Spark.
The ability to track the provenance/lineage of granular data portions is critical for:
- Trust & Reliability: guaranteeing the accuracy of results for data consumers.
- Root Cause Analysis: diagnosing anomalies (e.g., aberrant turnover figures) to pinpoint the exact source of a problem.
- Impact Analysis: predicting how data updates will propagate through our versioned datasets.
- GDPR compliance: ensuring that sensitive data (PII) does not unintentionally "leak" into refined datasets.
- Testing: extracting representative subsets of data for lightweight integration tests and prototyping.
In this talk, we will present few concepts of data provenance and present where we currently stand and what we plan to build in the future.

Speakers