kedro

Building Robust Data Science Pipelines at TomTom with Kedro

In this guest blog post, Toni Almagro, Senior Staff Data Scientist at TomTom, shares the transformative journey of Map Quality & Insights as the team transitioned from using Databricks notebooks to the Kedro framework for building data science pipelines. Initially prioritizing speed, the team faced challenges with technical debt, code repetition, and version control issues, which made their workflows unsustainable.

21 Apr 2025 (last updated 21 Apr 2025)

Opportunity

As we embarked on our journey to create Quality Metrics based on statistics and data science, our priority was speed and delivering insightful results. To provide some context, our main stakeholders were the value streams responsible for creating the map, and they lacked north star metrics to guide their improvements. This marked the dawn of Orbis Analytics, nowadays Map Quality & Insights.

Like any other software engineering project, the trade-off for speed and quick deliverables was the accumulation of technical debt. In our case, we were rapidly developing complex pipelines within Databricks notebooks. While these notebooks are convenient for data scientists and data engineers for exploration, experimentation, and proof of concepts, they are not ideal for creating robust, production-ready data engineering pipelines. We found ourselves building intricate production and development pipelines in Databricks, trying to keep the house of cards from collapsing. At times, even fixing bugs in our code would take 1-2 days of work. Additionally, we struggled with code repetition in our pipelines and version control issues due to working with notebooks. Simply put, this situation was unsustainable.

Solution

A bit of a digression here is needed... During my PhD, I was exploring data and running experiments using Matlab, and I did not know Python at all. After two years of working on my research, I checked Coursera to improve my scientific work and discovered iPython notebooks. I fell in love with these notebooks. Why? Because they addressed my daily issues: difficulty tracking the history of my experiments, no proper way of documenting my work, and the extra effort required to present any progress to my tutors (stakeholders). Thus, I invested a couple of months in learning a new programming language just because I really wanted to use the new "iPython notebooks" (by the way, just out of curiosity, the first notebook concept was implemented by Mathematica).

Fast forward 8 years, and I discovered Kedro.

Kedro is an open-source Python framework designed to help build robust, scalable, and maintainable data science and machine learning pipelines. It provides a standardized approach to structuring data science code, making it easier to collaborate, test, and deploy projects. In a nutshell, it was very different from notebooks. Even with a frown on my face, I gave it a try. Initially, it was challenging because you need to adapt to the concepts of catalog, parameters, pipelines, and nodes. After a couple of days, I was able to run my pipelines from the terminal, explore the results at any step, and even visualize the pipelines! I fell in love again; I guess I am a hopeless romantic.

Returning to the origins of Orbis Analytics, because most of our team was strong in data science and lacked Senior Software Engineers, we decided to focus on our strengths and reinforce best software engineering practices by using a Python-based data science framework.

This new framework forced us to better organize our code, facilitate collaboration, and create production-ready quality metrics and data science projects. Slowly but surely, we migrated our original quality metrics from notebooks to Kedro projects.

Impact

At the beginning, we faced some hiccups. The temptation of trying to copy-paste the code from the notebooks to the Kedro project was too big, and with that approach, you end up with similar issues that you faced with the notebooks. Moreover, the learning curve of these best data engineering practices included in the framework was not as steep as we had anticipated. Also, we obviously needed to juggle this tech-debt reduction with time pressure to deliver new metrics.

However, after a few months, the adoption of this framework had a profound impact on our workflow, both quantitatively and qualitatively. Quantitatively, we saw a significant reduction in the time required to fix bugs in our code. What used to take 1-2 days was now accomplished in just a few hours. This improvement in efficiency was complemented by a notable increase in code reusability. With the refactoring from notebooks to Kedro, we reduced code repetition by more than 50% in some pipelines, while achieving more efficient and maintainable code.

Qualitatively, the benefits were equally impressive. The structured approach led to more reliable and accurate data processing, significantly improving data quality. The standardized project structure made our codebase easier to understand, maintain, and extend, enhancing overall code maintainability. Furthermore, the framework facilitated smoother collaboration among team members, as we were all "speaking the same language." This improvement in communication and collaboration boosted our overall productivity. Lastly, the modular nature of the framework allowed us to scale our pipelines effortlessly as the project grew, ensuring that our solutions remained robust and adaptable.

Learnings

Our journey taught us several valuable lessons. First and foremost, we learned the importance of embracing change. Transitioning from notebooks to a structured framework required an initial adjustment period, but the long-term benefits far outweighed the challenges. This experience reinforced the value of investing in best practices. Adopting a framework highlighted the importance of adhering to sound software engineering principles, leading to more robust and maintainable projects.

We also learned to leverage our team's strengths. By focusing on our expertise in data science and complementing it with the framework's capabilities, we achieved significant improvements in our workflows. For those considering using a similar framework for their own data science projects, we have a few recommendations. Start small by beginning with a small project to familiarize yourself with the framework's concepts and gradually scale up. Provide adequate training to ensure team members can maximize its benefits. Finally, commit to continuous improvement by regularly reviewing and refining your pipelines to maintain high standards of quality and efficiency.

By adopting this framework, we transformed our approach to building data science pipelines, leading to more robust, scalable, and maintainable solutions. We hope our experience inspires others to explore the potential of such frameworks in their own projects.

On this page:

Toni Almagro

Senior Staff Data Scientist @ TomTom

All blog posts

GenAI — 10 min read

Building a GenAI-powered chatbot using Kedro and LangChain

This post shows how to use Kedro to build and organize GenAI applications with a real-world example: a Retrieval-Augmented Generation (RAG) chatbot trained on Kedro Slack conversations. You'll learn how to structure your pipeline, manage LLMs and prompts, and apply practical Kedro tricks to streamline GenAI workflows - plus see why RAG outperforms plain LLMs in real use cases.

Elena Khaustova

25 Apr 2025

News — 5 min read

Deprecating Experiment Tracking in Kedro Viz

Kedro-Viz will phase out its Experiment Tracking feature in the upcoming release of Kedro-Viz 11.0, with complete removal in version 12.0 due to low user adoption and the availability of robust alternatives like MLflow. This blog post includes detailed guidance on migrating to kedro-mlflow, a plugin that seamlessly integrates Kedro with MLflow.

Juan Luis Cano Rodríguez

28 Jan 2025