🔍 Introduction
Generative AI is transforming how we build intelligent applications - but with great power comes great complexity. From managing data pipelines to integrating LLMs and retrieval systems, keeping your project maintainable, reproducible, and production-ready can quickly become a challenge.
That’s where Kedro shines. Traditionally used for structuring data science projects, Kedro’s modular pipeline architecture is also a perfect fit for organizing GenAI workflows. In this project, we use Kedro to build a Retrieval-Augmented Generation (RAG) chatbot that answers questions about Kedro itself - using real Slack conversations as the knowledge base.
But we didn’t stop there. We also integrated a plain LLM baseline so you can compare answers side-by-side with and without context retrieval-to truly see the power of RAG.
This blog post isn’t just about building a chatbot - it’s a blueprint for how to structure GenAI use cases with Kedro, including tips and tricks for:
Organizing prompts, models, and vector stores
Managing credentials and configurations
Keeping the pipeline modular and testable
Whether you’re working on a chatbot, summarization app, or document QA system, this project will show you how to use Kedro to keep your GenAI stack clean, flexible, and future-proof.
🔗 Resources
Slack Channel: Join our Slack if you're not a member yet
GitHub Repository: Explore the full project code on GitHub
⚙️ Architecture Overview
At a high level, this chatbot runs in two main stages - two separate Kedro pipelines, each focused on a specific stage of the chatbot workflow.
1. create_vector_store
pipeline
This pipeline prepares the data and builds the knowledge base for RAG. It transforms unstructured chat logs into a structured, searchable format ready for retrieval:
Loads and formats real Slack conversations from the Kedro support channel
Generates text embeddings using a lightweight hashing-based embedding function
Stores everything in a searchable Deep Lake vector store
2. agent_rag
pipeline
This pipeline runs the chatbot and handles the GenAI interaction:
Initializes tools, prompts, and the language model
Creates a RAG-enabled agent using LangChain
Launches an interactive loop where users can:
Ask Kedro-related questions
Compare answers from a plain LLM vs. the RAG-enhanced agent
View the retrieved context used by RAG
Save questions, retrieved context, and responses to a markdown report
This separation into pipelines makes each stage modular, testable, and easy to run.
🧠 Why Context Matters
If you’ve used LLMs to build apps before, you know they often "hallucinate" - confidently answering with something that sounds right but isn’t. Especially in domain-specific cases (like Kedro), this can be a deal-breaker.
By integrating a vector store, we ground the LLM in actual prior knowledge - in this case, questions and answers from the Kedro Slack. Here’s what the chatbot does differently when RAG is involved:
Approach | Behavior | Example Answer |
❌ LLM only | Relies purely on model knowledge | “Datasets in Kedro can be structured in any way that works for your project.” |
✅ LLM + RAG | Retrieves Slack discussions and uses them to inform the answer | “You can define datasets in |
This means the LLM and RAG can provide more relevant, precise, and up-to-date answers than an LLM alone.
🧪 Kedro Tricks for GenAI Workflows
Kedro’s flexibility allowed us to integrate GenAI components smoothly into a modular and maintainable pipeline. Here are some of the techniques and dataset patterns we used to streamline the process.
1. Creating a Custom Dataset for the Vector Store
To interface with the Deep Lake vector store, we created a custom dataset: DeeplakeVectorStoreDataset
.
1class DeeplakeVectorStoreDataset(AbstractDataset[None, VectorStore]):
2
3 """A Kedro dataset for interacting with DeepLake VectorStore.
4
5 This dataset provides a read-only interface for loading a
6 VectorStore instance from a specified Deep Lake path.
7 It is useful for integrating vector search capabilities
8 into a Kedro pipeline, such as for Retrieval-Augmented
9 Generation (RAG) applications.
10
11 More details: https://docs.activeloop.ai/examples/rag/tutorials/vector-store-basics
12 """
13
14 def __init__(self, path: str, **kwargs):
15 """Initializes the dataset with the given Deep Lake path and
16 optional parameters.
17
18 Args:
19 path: Path to the DeepLake VectorStore.
20 **kwargs: Additional arguments for the VectorStore
21 initialization.
22 """
23 self._path = path
24 self.kwargs = kwargs or {}
25
26 def load(self) -> VectorStore:
27 """Loads and returns the DeepLake VectorStore from the
28 specified path.
29
30 Returns:
31 VectorStore: An instance of the Deep Lake Vector Store.
32 """
33 return VectorStore(path=self._path, **self.kwargs)
34
35 def save(self, data: None) -> NoReturn:
36 """Raises an error because this dataset type is read-only.
37
38 Args:
39 data: This argument is unused as saving is not supported.
40
41 Raises:
42 DatasetError: Always raised since saving is not allowed.
43 """
44 raise DatasetError(f"{self.__class__.__name__} is a read only dataset type")
45
46 def _describe(self) -> dict[str, Any]:
47 """Returns a dictionary describing the dataset configuration.
48
49 Returns:
50 A dictionary containing the dataset path and additional
51 parameters.
52 """
53 return {"filepath": self._path, **self.kwargs}
This allows us to define the vector store in catalog.yml
and use it just like any other dataset and makes the vector store reusable across both pipelines.
1vector_store_init:
2 type: kedro_rag_chatbot.datasets.DeeplakeVectorStoreDataset
3 path: data/02_intermediate/vector_store/
4 overwrite: True
2. Managing Prompts, Models and Credentials
One of the powerful aspects of Kedro is its ability to manage configuration and dependencies cleanly through ready-to-use datasets and this extends beautifully into GenAI workflows.
Instead of embedding prompts directly in code, we used Kedro’s built-in TextDataset
to load system prompts from a simple `.txt`
file.
1system_prompt:
2 type: text.TextDataset
3 filepath: data/01_raw/system_prompt.txt
This makes it super easy to version, tweak, and experiment with different prompt designs without touching the pipeline code.
Using the ChatOpenAIDataset from the kedro_datasets
plugin, we were able to configure the model like any other Kedro dataset - cleanly separated from code.
1openai_llm:
2 type: kedro_datasets_experimental.langchain.ChatOpenAIDataset
3 kwargs:
4 model: "gpt-3.5-turbo"
5 temperature: 0.0
6 credentials: openai
Credentials go in conf/local/credentials.yml:
1openai:
2 openai_api_base: <openai-api-base>
3 openai_api_key: <openai-api-key>
This setup keeps credentials secure and supports easy model switching between environments.
3. Passing Custom Objects Between Nodes
By default, Kedro expects inputs and outputs to be serializable (e.g. saved as files). But for things like ChatOpenAIDataset
models or LangChain
agents, serialization doesn’t make sense.
To enable smooth in-memory handoff of objects like an AgentExecutor
between nodes, we used Kedro’s ability to override the default dataset creation with dataset factories.
Kedro’s MemoryDataset
is designed to store objects in memory during a pipeline run. By default, in most of the cases it makes copies of those objects. We redefined the default dataset behavior by customizing the dataset factory and setting copy_mode: assign
which passes objects by reference between nodes without any copying or serialization:
1"{default_dataset}":
2 type: MemoryDataset
3 copy_mode: "assign"
1[
2 node(
3 func=create_agent,
4 inputs=["llm_with_tools", "chat_prompt"],
5 outputs="agent",
6 name="create_agent_node",
7 ),
8 node(
9 func=create_agent_executor,
10 inputs=["agent", "tools"],
11 outputs="agent_executor",
12 name="create_agent_executor_node",
13 )
14]
This change lets Kedro treat in-memory objects like first-class citizens, making it possible to:
Chain custom objects through multiple nodes
Avoid wrapping everything in custom datasets
Keep the pipeline declarative and clean
This pattern was key to making Kedro work naturally with LangChain’s dynamic objects.
🧩 Agent vs. LLM Output Comparison
This side-by-side comparison highlights the real value of RAG. While the base LLM often provides generic, incomplete and sometimes incorrect answers, the RAG-based agent delivers responses that are grounded in actual Slack conversations, tailored to real Kedro use cases. The retrieved context not only improves accuracy but also builds trust - users can trace answers back to known sources.
User Input
How can I force node execution order in the Kedro pipeline?
LLM Output ❌
To force node execution order in the Kedro pipeline, you can use the `before`
and `after`
arguments when defining your nodes in the pipeline.py
file.
For example, if you have two nodes `node1`
and `node2`
, and you want `node1`
to run before `node2`
, you can define them like this:
1from kedro.pipeline import node
2
3def create_pipeline():
4 return Pipeline(
5 [
6 node(func=node1, inputs="input_data", outputs="output_data", name="node1"),
7 node(func=node2, inputs="output_data", outputs="final_output", name="node2", before="node1")
8 ]
9 )
In this example, `node2`
will run after `node1`
because of the `before="node1"`
argument. You can also use the `after`
argument to specify that a node should run after another node.
By using these arguments, you can control the execution order of nodes in your Kedro pipeline.
The generated response is incorrect as Kedro doesn't support setting the `before`
and `after`
arguments for nodes.
Agent Output ✅
In Kedro, the recommended approach is to rely on the topological sorting of nodes in the pipeline to determine the execution order. However, if you need to force a specific node execution order, you can create "pass-through" nodes to achieve this. Here is an example of how you can force node execution order by creating pass-through nodes:
1from kedro.pipeline import node
2
3def pass_through_function(input_data):
4 # Perform any necessary operations here
5 return input_data
6
7pass_through_node = node(pass_through_function, inputs="input_data", outputs="output_data")
8
9# Define your pipeline with the pass-through nodes to enforce execution order
10pipeline = Pipeline([
11 pass_through_node,
12 node(process_data, inputs="output_data", outputs="processed_data"),
13 node(analyze_data, inputs="processed_data", outputs=None)
14])
By inserting pass-through nodes between the nodes that need to be executed in a specific order, you can enforce the desired execution sequence in the pipeline.
It’s important to note that in a real-world application, the quality of the retrieved context and the size of the vector store play a major role in performance. In this demo, we intentionally used a small vector store to clearly illustrate the difference between vanilla LLM output and context-augmented responses. As the knowledge base scales, retrieval quality becomes even more critical to delivering relevant, high-quality answers.
🚀 Future Improvements
This project is a toy example, designed to explain how you can use Kedro to structure and manage GenAI workflows. While not production-ready, it provides a strong foundation for more advanced implementations.
Ideas for Future Enhancement:
Expand the knowledge base by pulling in Kedro GitHub discussions, documentation, or blog content
Upgrade retrieval quality using advanced embeddings (e.g. OpenAI, SentenceTransformers)
Use production-ready vector stores and searching algorithms like FAISS for faster, more scalable search
Run sentiment analysis on Slack threads to enrich metadata or improve filtering
Deploy as a Slack bot to enable direct user interaction and real-time answers
🧭 Final Thoughts
If you're building LLM-based applications, consider using Kedro as your framework - not just for data pipelines, but for structuring everything around GenAI workflows.
And with a RAG setup like this, you’ll drastically improve accuracy by grounding your chatbot in real-world knowledge - not just whatever the model “thinks” is right.