Opportunity
As we embarked on our journey to create Quality Metrics based on statistics and data science, our priority was speed and delivering insightful results. To provide some context, our main stakeholders were the value streams responsible for creating the map, and they lacked north star metrics to guide their improvements. This marked the dawn of Orbis Analytics, nowadays Map Quality & Insights.
Like any other software engineering project, the trade-off for speed and quick deliverables was the accumulation of technical debt. In our case, we were rapidly developing complex pipelines within Databricks notebooks. While these notebooks are convenient for data scientists and data engineers for exploration, experimentation, and proof of concepts, they are not ideal for creating robust, production-ready data engineering pipelines. We found ourselves building intricate production and development pipelines in Databricks, trying to keep the house of cards from collapsing. At times, even fixing bugs in our code would take 1-2 days of work. Additionally, we struggled with code repetition in our pipelines and version control issues due to working with notebooks. Simply put, this situation was unsustainable.
Solution
A bit of a digression here is needed... During my PhD, I was exploring data and running experiments using Matlab, and I did not know Python at all. After two years of working on my research, I checked Coursera to improve my scientific work and discovered iPython notebooks. I fell in love with these notebooks. Why? Because they addressed my daily issues: difficulty tracking the history of my experiments, no proper way of documenting my work, and the extra effort required to present any progress to my tutors (stakeholders). Thus, I invested a couple of months in learning a new programming language just because I really wanted to use the new "iPython notebooks" (by the way, just out of curiosity, the first notebook concept was implemented by Mathematica).
Fast forward 8 years, and I discovered Kedro.
Kedro is an open-source Python framework designed to help build robust, scalable, and maintainable data science and machine learning pipelines. It provides a standardized approach to structuring data science code, making it easier to collaborate, test, and deploy projects. In a nutshell, it was very different from notebooks. Even with a frown on my face, I gave it a try. Initially, it was challenging because you need to adapt to the concepts of catalog, parameters, pipelines, and nodes. After a couple of days, I was able to run my pipelines from the terminal, explore the results at any step, and even visualize the pipelines! I fell in love again; I guess I am a hopeless romantic.
Returning to the origins of Orbis Analytics, because most of our team was strong in data science and lacked Senior Software Engineers, we decided to focus on our strengths and reinforce best software engineering practices by using a Python-based data science framework.
This new framework forced us to better organize our code, facilitate collaboration, and create production-ready quality metrics and data science projects. Slowly but surely, we migrated our original quality metrics from notebooks to Kedro projects.
Impact
At the beginning, we faced some hiccups. The temptation of trying to copy-paste the code from the notebooks to the Kedro project was too big, and with that approach, you end up with similar issues that you faced with the notebooks. Moreover, the learning curve of these best data engineering practices included in the framework was not as steep as we had anticipated. Also, we obviously needed to juggle this tech-debt reduction with time pressure to deliver new metrics.
However, after a few months, the adoption of this framework had a profound impact on our workflow, both quantitatively and qualitatively. Quantitatively, we saw a significant reduction in the time required to fix bugs in our code. What used to take 1-2 days was now accomplished in just a few hours. This improvement in efficiency was complemented by a notable increase in code reusability. With the refactoring from notebooks to Kedro, we reduced code repetition by more than 50% in some pipelines, while achieving more efficient and maintainable code.
Qualitatively, the benefits were equally impressive. The structured approach led to more reliable and accurate data processing, significantly improving data quality. The standardized project structure made our codebase easier to understand, maintain, and extend, enhancing overall code maintainability. Furthermore, the framework facilitated smoother collaboration among team members, as we were all "speaking the same language." This improvement in communication and collaboration boosted our overall productivity. Lastly, the modular nature of the framework allowed us to scale our pipelines effortlessly as the project grew, ensuring that our solutions remained robust and adaptable.
Learnings
Our journey taught us several valuable lessons. First and foremost, we learned the importance of embracing change. Transitioning from notebooks to a structured framework required an initial adjustment period, but the long-term benefits far outweighed the challenges. This experience reinforced the value of investing in best practices. Adopting a framework highlighted the importance of adhering to sound software engineering principles, leading to more robust and maintainable projects.
We also learned to leverage our team's strengths. By focusing on our expertise in data science and complementing it with the framework's capabilities, we achieved significant improvements in our workflows. For those considering using a similar framework for their own data science projects, we have a few recommendations. Start small by beginning with a small project to familiarize yourself with the framework's concepts and gradually scale up. Provide adequate training to ensure team members can maximize its benefits. Finally, commit to continuous improvement by regularly reviewing and refining your pipelines to maintain high standards of quality and efficiency.
By adopting this framework, we transformed our approach to building data science pipelines, leading to more robust, scalable, and maintainable solutions. We hope our experience inspires others to explore the potential of such frameworks in their own projects.