Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qconlondon.com with any comments or concerns.
The presentation titled Building Embedding Models for Large-Scale Real-World Applications by Sahil Dua discusses the fundamental aspects and practical insights into deploying embedding models in large-scale applications. Here is a structured summary of the key points from the presentation:
-
Introduction to Embedding Models:
- Embedding models transform data into meaningful vector representations, useful in various applications like search, recommendations, and retrieval-augmented generation (RAG).
- Embeddings for similar inputs are close to each other in the vector space, while different inputs are far apart.
-
Applications:
- Retrieving the best matching documents, passages, images, or videos from vast data collections.
- Generating personalized recommendations based on user preferences.
- RAG applications for enhancing language model outputs with factual accuracy.
-
Model Lifecycle:
- Designing architectures that cater to specific serving requirements.
- Distilling large models into smaller, efficient productions models.
- Optimizing model serving through tools like post-training quantization.
-
Challenges and Solutions:
- Addressing query latency and document retrieval efficiency using dynamic batching and model quantization.
- Utilizing techniques like contrastive learning to train models effectively.
- Practical Strategies: The presentation also covered strategies for transitioning embedding models from research to production while ensuring high performance and scalability.
This is the end of the AI-generated content.
Embedding models are at the core of search, recommendation, and retrieval-augmented generation (RAG) systems, transforming data into meaningful representations. We can adapt state-of-the-art large language models (LLMs) into embedding models that generate high-quality embeddings, but deploying these models in large-scale applications presents significant challenges.
This talk explores the end-to-end lifecycle of embedding systems, including:
- Leveraging LLMs for high-quality embeddings and adapting them for domain-specific use cases using contrastive learning.
- Designing custom architectures optimized for use-case specific serving requirements.
- Distilling large embedding models into smaller, production-friendly sizes.
- Serving embeddings efficiently with optimization strategies like variable batch sizes and post-training quantization.
Attendees will leave with practical strategies for scaling embedding models from research to production, ensuring high performance and efficiency in real-world applications like retrieving best matching documents, passages or images, data de-duplication, generating personalized recommendations, content clustering, and grounding GenAI responses using RAG approach.
Speaker

Sahil Dua
Senior Software Engineer, Machine Learning @Google, Stanford AI, Co-Author of “The Kubernetes Workshop”, Open-Source Enthusiast
Sahil Dua is a Tech Lead focused on developing and adapting large language models (LLMs) with an expertise in Representation Learning. He oversees the full LLM lifecycle, from designing data pipelines and model architectures to optimizing models for highly efficient serving. Before Google, Sahil worked on the ML platform at Booking.com to scale machine learning model development and deployment.
A co-author of “The Kubernetes Workshop” book and an open-source enthusiast, Sahil has contributed to projects like Git, Pandas, and Linguist. As a frequent speaker at global conferences, he shares insights on AI, machine learning, and tech innovation, inspiring professionals across the industry.