The Data Backbone of LLM Systems

Any LLM application has four dimensions you must carefully engineer: the code, data, models and prompts. Each dimension influences the other. That's why you must learn how to track and manage each. The trick is that every dimension has particularities requiring unique strategies and tooling. That's why directly applying SWE and DevOps principles that apply to the code does not work for others.

This presentation will dig into the data dimension and how it looks when building LLM applications. We will start by exploring a general framework that acts as the foundation. We will look into how the data flows, focusing on the data and features pipeline and how the data should be stored to be correctly shared, versioned, processed and analyzed for RAG, training and inference. Next, we will zoom into how the data is accessed during LLM fine-tuning and within the inference pipeline, which can be implemented as an RAG workflow or as something more complex, such as an agent.

To fully understand how the framework works (for building LLM applications), we will look at two concrete use cases, architecting the data layer of the following systems:

  • An LLM Twin: Your Digital AI Replica (Use case used in our LLM Engineer's Handbook)
  • A Second Brain AI Assistant (Use case used in our latest open-source course, freely available on Decoding ML)

We will present specific implementation details, tooling, and problems during these two use cases to fully understand an LLM system's data-related generalities and particularities.
 

Interview:

What is the focus of your work?

I am actively working and building LLM, RAG, and informational retrieval systems.

What’s the motivation for your talk?

I want to show people a framework for designing the data layer of RAG and LLM systems using MLOps/LLMOps best practices.

Who is your talk for?

Software/ML/AI/Data engineers or data scientists.

What do you want someone to walk away with from your presentation?

Architect the data layer of an LLM/RAG system

What do you think is the next big disruption in software?

 Workflows and agents


Speaker

Paul-Emil Iusztin

Senior ML/AI Engineer, MLOps, Founder @Decoding ML

Paul Iusztin is a senior AI/ML engineer with over seven years of experience building GenAI, Computer Vision and MLOps solutions. His latest contribution was at Metaphysic, where he was one of the core AI engineers who took large GPU-heavy models to production. He previously worked at CoreAI, Everseen, and Continental.

He is the co-author of the LLM Engineer's Handbook, a bestseller on Amazon, which presents a hands-on framework for building LLM applications.

Paul is the Founder of Decoding ML, an educational channel on production-grade AI that provides code, posts, articles, and courses, inspiring others to build real-world AI systems. Through Decoding ML, he collaborated with companies such as MongoDB, Comet, Qdrant, ZenML and 11 other AI companies. 

Connect with him on LinkedIn.

Subscribe to Decoding ML for weekly content on AI.

Read more
Find Paul-Emil Iusztin at:

Date

Wednesday Apr 9 / 02:45PM BST ( 50 minutes )

Location

Fleming (3rd Fl.)

Share

From the same track

Session Data Architecture

Reliable Data Flows and Scalable Platforms: Tackling Key Data Challenges

Wednesday Apr 9 / 10:35AM BST

There are a few common and mostly well-known challenges when architecting for data. For example, many data teams struggle to move data in a stable and reliable way from operational systems to analytics systems.

Speaker image - Matthias Niehoff

Matthias Niehoff

Head of Data and Data Architecture @codecentric AG, iSAQB Certified Professional for Software Architecture

Session Data engineering

Building a Global Scale Data Platform with Cloud-Native Tools

Wednesday Apr 9 / 01:35PM BST

As businesses increasingly operate in hybrid and multi-cloud environments, managing data across these complex setups presents unique challenges and opportunities. This presentation provides a comprehensive guide to building a global-scale data platform using cloud-native tools.

Speaker image - George Hantzaras

George Hantzaras

Engineering Director, Core Platforms @MongoDB, Open Source Ambassador, Published Author

Session

Achieving Precision in AI: Retrieving the Right Data Using AI Agents

Wednesday Apr 9 / 11:45AM BST

In the race to harness the power of generative AI, organizations are discovering a hidden challenge: precision.

Speaker image - Adi Polak

Adi Polak

Director, Advocacy and Developer Experience Engineering @Confluent, Author of "Scaling Machine Learning with Spark" and "High Performance Spark 2nd Edition"

Session Data Architecture

Beyond the Warehouse: Why BigQuery Alone Won’t Solve Your Data Problems

Wednesday Apr 9 / 03:55PM BST

Many organizations mistake the adoption of a data warehouse, like BigQuery, as the golden ticket to solving all their data challenges. But without a robust data strategy and architecture, you’re simply shifting chaos into the cloud.

Speaker image - Sarah Usher

Sarah Usher

Data & Backend Engineer, Community Director, Mentor