kedro

GenAI — 10 min read

Building a GenAI-powered chatbot using Kedro and LangChain

This post shows how to use Kedro to build and organize GenAI applications with a real-world example: a Retrieval-Augmented Generation (RAG) chatbot trained on Kedro Slack conversations. You'll learn how to structure your pipeline, manage LLMs and prompts, and apply practical Kedro tricks to streamline GenAI workflows - plus see why RAG outperforms plain LLMs in real use cases.

25 Apr 2025 (last updated 25 Apr 2025)

🔍 Introduction

Generative AI is transforming how we build intelligent applications - but with great power comes great complexity. From managing data pipelines to integrating LLMs and retrieval systems, keeping your project maintainable, reproducible, and production-ready can quickly become a challenge.

That’s where Kedro shines. Traditionally used for structuring data science projects, Kedro’s modular pipeline architecture is also a perfect fit for organizing GenAI workflows. In this project, we use Kedro to build a Retrieval-Augmented Generation (RAG) chatbot that answers questions about Kedro itself - using real Slack conversations as the knowledge base.

But we didn’t stop there. We also integrated a plain LLM baseline so you can compare answers side-by-side with and without context retrieval-to truly see the power of RAG.

This blog post isn’t just about building a chatbot - it’s a blueprint for how to structure GenAI use cases with Kedro, including tips and tricks for:

Organizing prompts, models, and vector stores
Managing credentials and configurations
Keeping the pipeline modular and testable

Whether you’re working on a chatbot, summarization app, or document QA system, this project will show you how to use Kedro to keep your GenAI stack clean, flexible, and future-proof.

🔗 Resources

In this edition of ☕️ Kedro Coffee Chat 🔶, we dive into how to build a GenAI-powered chatbot using Kedro and LangChain.

Slack Channel: Join our Slack if you're not a member yet
GitHub Repository: Explore the full project code on GitHub

⚙️ Architecture Overview

At a high level, this chatbot runs in two main stages - two separate Kedro pipelines, each focused on a specific stage of the chatbot workflow.

1. create_vector_store pipeline

This pipeline prepares the data and builds the knowledge base for RAG. It transforms unstructured chat logs into a structured, searchable format ready for retrieval:

Loads and formats real Slack conversations from the Kedro support channel
Generates text embeddings using a lightweight hashing-based embedding function
Stores everything in a searchable Deep Lake vector store

2. agent_rag pipeline

This pipeline runs the chatbot and handles the GenAI interaction:

Initializes tools, prompts, and the language model
Creates a RAG-enabled agent using LangChain
Launches an interactive loop where users can:
- Ask Kedro-related questions
- Compare answers from a plain LLM vs. the RAG-enhanced agent
- View the retrieved context used by RAG
Save questions, retrieved context, and responses to a markdown report

This separation into pipelines makes each stage modular, testable, and easy to run.

🧠 Why Context Matters

If you’ve used LLMs to build apps before, you know they often "hallucinate" - confidently answering with something that sounds right but isn’t. Especially in domain-specific cases (like Kedro), this can be a deal-breaker.

By integrating a vector store, we ground the LLM in actual prior knowledge - in this case, questions and answers from the Kedro Slack. Here’s what the chatbot does differently when RAG is involved:

Approach	Behavior	Example Answer
❌ LLM only	Relies purely on model knowledge	“Datasets in Kedro can be structured in any way that works for your project.”
✅ LLM + RAG	Retrieves Slack discussions and uses them to inform the answer	“You can define datasets in `conf/base/catalog.yml` and subclass `AbstractDataset` for custom types - see the Slack discussion from Jan 2024.”

This means the LLM and RAG can provide more relevant, precise, and up-to-date answers than an LLM alone.

🧪 Kedro Tricks for GenAI Workflows

Kedro’s flexibility allowed us to integrate GenAI components smoothly into a modular and maintainable pipeline. Here are some of the techniques and dataset patterns we used to streamline the process.

1. Creating a Custom Dataset for the Vector Store

To interface with the Deep Lake vector store, we created a custom dataset: DeeplakeVectorStoreDataset.

1class DeeplakeVectorStoreDataset(AbstractDataset[None, VectorStore]):
2
3    """A Kedro dataset for interacting with DeepLake VectorStore.
4
5    This dataset provides a read-only interface for loading a 
6    VectorStore instance from a specified Deep Lake path. 
7    It is useful for integrating vector search capabilities 
8    into a Kedro pipeline, such as for Retrieval-Augmented 
9    Generation (RAG) applications.
10
11    More details: https://docs.activeloop.ai/examples/rag/tutorials/vector-store-basics
12    """
13
14    def __init__(self, path: str, **kwargs):
15        """Initializes the dataset with the given Deep Lake path and 
16        optional parameters.
17
18        Args:
19            path: Path to the DeepLake VectorStore.
20            **kwargs: Additional arguments for the VectorStore 
21              initialization.
22        """
23        self._path = path
24        self.kwargs = kwargs or {}
25
26    def load(self) -> VectorStore:
27        """Loads and returns the DeepLake VectorStore from the 
28        specified path.
29
30        Returns:
31            VectorStore: An instance of the Deep Lake Vector Store.
32        """
33        return VectorStore(path=self._path, **self.kwargs)
34
35    def save(self, data: None) -> NoReturn:
36        """Raises an error because this dataset type is read-only.
37
38        Args:
39            data: This argument is unused as saving is not supported.
40
41        Raises:
42            DatasetError: Always raised since saving is not allowed.
43        """
44        raise DatasetError(f"{self.__class__.__name__} is a read only dataset type")
45
46    def _describe(self) -> dict[str, Any]:
47        """Returns a dictionary describing the dataset configuration.
48
49        Returns:
50            A dictionary containing the dataset path and additional 
51              parameters.
52        """
53        return {"filepath": self._path, **self.kwargs}

This allows us to define the vector store in catalog.yml and use it just like any other dataset and makes the vector store reusable across both pipelines.

1vector_store_init:
2  type: kedro_rag_chatbot.datasets.DeeplakeVectorStoreDataset
3  path: data/02_intermediate/vector_store/
4  overwrite: True

2. Managing Prompts, Models and Credentials

One of the powerful aspects of Kedro is its ability to manage configuration and dependencies cleanly through ready-to-use datasets and this extends beautifully into GenAI workflows.

Instead of embedding prompts directly in code, we used Kedro’s built-in TextDataset to load system prompts from a simple `.txt` file.

1system_prompt:
2  type: text.TextDataset
3  filepath: data/01_raw/system_prompt.txt

This makes it super easy to version, tweak, and experiment with different prompt designs without touching the pipeline code.

Using the ChatOpenAIDataset from the kedro_datasets plugin, we were able to configure the model like any other Kedro dataset - cleanly separated from code.

1openai_llm:
2 type: kedro_datasets_experimental.langchain.ChatOpenAIDataset
3 kwargs:
4   model: "gpt-3.5-turbo"
5   temperature: 0.0
6 credentials: openai

Credentials go in conf/local/credentials.yml:

1openai:
2  openai_api_base: <openai-api-base>
3  openai_api_key: <openai-api-key>

This setup keeps credentials secure and supports easy model switching between environments.

3. Passing Custom Objects Between Nodes

By default, Kedro expects inputs and outputs to be serializable (e.g. saved as files). But for things like ChatOpenAIDataset models or LangChain agents, serialization doesn’t make sense.

To enable smooth in-memory handoff of objects like an AgentExecutor between nodes, we used Kedro’s ability to override the default dataset creation with dataset factories.

Kedro’s MemoryDataset is designed to store objects in memory during a pipeline run. By default, in most of the cases it makes copies of those objects. We redefined the default dataset behavior by customizing the dataset factory and setting copy_mode: assign which passes objects by reference between nodes without any copying or serialization:

1"{default_dataset}":
2  type: MemoryDataset
3  copy_mode: "assign"

1[
2    node(
3        func=create_agent,
4        inputs=["llm_with_tools", "chat_prompt"],
5        outputs="agent",
6        name="create_agent_node",
7    ),
8    node(
9        func=create_agent_executor,
10        inputs=["agent", "tools"],
11        outputs="agent_executor",
12        name="create_agent_executor_node",
13    )
14]

This change lets Kedro treat in-memory objects like first-class citizens, making it possible to:

Chain custom objects through multiple nodes
Avoid wrapping everything in custom datasets
Keep the pipeline declarative and clean

This pattern was key to making Kedro work naturally with LangChain’s dynamic objects.

🧩 Agent vs. LLM Output Comparison

This side-by-side comparison highlights the real value of RAG. While the base LLM often provides generic, incomplete and sometimes incorrect answers, the RAG-based agent delivers responses that are grounded in actual Slack conversations, tailored to real Kedro use cases. The retrieved context not only improves accuracy but also builds trust - users can trace answers back to known sources.

User Input

How can I force node execution order in the Kedro pipeline?

LLM Output ❌

To force node execution order in the Kedro pipeline, you can use the `before` and `after` arguments when defining your nodes in the pipeline.py file. For example, if you have two nodes `node1` and `node2`, and you want `node1` to run before `node2`, you can define them like this:

1from kedro.pipeline import node
2
3def create_pipeline():
4    return Pipeline(
5        [
6            node(func=node1, inputs="input_data", outputs="output_data", name="node1"),
7            node(func=node2, inputs="output_data", outputs="final_output", name="node2", before="node1")
8        ]
9    )

In this example, `node2` will run after `node1` because of the `before="node1"` argument. You can also use the `after` argument to specify that a node should run after another node. By using these arguments, you can control the execution order of nodes in your Kedro pipeline.

The generated response is incorrect as Kedro doesn't support setting the `before` and `after` arguments for nodes.

Agent Output ✅

In Kedro, the recommended approach is to rely on the topological sorting of nodes in the pipeline to determine the execution order. However, if you need to force a specific node execution order, you can create "pass-through" nodes to achieve this. Here is an example of how you can force node execution order by creating pass-through nodes:

1from kedro.pipeline import node
2
3def pass_through_function(input_data):
4    # Perform any necessary operations here
5    return input_data
6
7pass_through_node = node(pass_through_function, inputs="input_data", outputs="output_data")
8
9# Define your pipeline with the pass-through nodes to enforce execution order
10pipeline = Pipeline([
11    pass_through_node,
12    node(process_data, inputs="output_data", outputs="processed_data"),
13    node(analyze_data, inputs="processed_data", outputs=None)
14])

By inserting pass-through nodes between the nodes that need to be executed in a specific order, you can enforce the desired execution sequence in the pipeline.

It’s important to note that in a real-world application, the quality of the retrieved context and the size of the vector store play a major role in performance. In this demo, we intentionally used a small vector store to clearly illustrate the difference between vanilla LLM output and context-augmented responses. As the knowledge base scales, retrieval quality becomes even more critical to delivering relevant, high-quality answers.

🚀 Future Improvements

This project is a toy example, designed to explain how you can use Kedro to structure and manage GenAI workflows. While not production-ready, it provides a strong foundation for more advanced implementations.

Ideas for Future Enhancement:

Expand the knowledge base by pulling in Kedro GitHub discussions, documentation, or blog content
Upgrade retrieval quality using advanced embeddings (e.g. OpenAI, SentenceTransformers)
Use production-ready vector stores and searching algorithms like FAISS for faster, more scalable search
Run sentiment analysis on Slack threads to enrich metadata or improve filtering
Deploy as a Slack bot to enable direct user interaction and real-time answers

🧭 Final Thoughts

If you're building LLM-based applications, consider using Kedro as your framework - not just for data pipelines, but for structuring everything around GenAI workflows.

And with a RAG setup like this, you’ll drastically improve accuracy by grounding your chatbot in real-world knowledge - not just whatever the model “thinks” is right.

On this page:

Elena Khaustova

Senior Machine Learning Engineer

All blog posts

Success stories — 10 min read

Building Robust Data Science Pipelines at TomTom with Kedro

In this guest blog post, Toni Almagro, Senior Staff Data Scientist at TomTom, shares the transformative journey of Map Quality & Insights as the team transitioned from using Databricks notebooks to the Kedro framework for building data science pipelines. Initially prioritizing speed, the team faced challenges with technical debt, code repetition, and version control issues, which made their workflows unsustainable.

Toni Almagro

21 Apr 2025

News — 5 min read

Deprecating Experiment Tracking in Kedro Viz

Kedro-Viz will phase out its Experiment Tracking feature in the upcoming release of Kedro-Viz 11.0, with complete removal in version 12.0 due to low user adoption and the availability of robust alternatives like MLflow. This blog post includes detailed guidance on migrating to kedro-mlflow, a plugin that seamlessly integrates Kedro with MLflow.

Juan Luis Cano Rodríguez

28 Jan 2025