Abstract
As LLM systems move from prototypes to production, the gap between benchmark performance and real-world reliability becomes impossible to ignore. Models that score well on benchmarks can still fail unpredictably when facing the complexity, ambiguity, and edge cases of real users. So how do we actually know if our AI systems are working?
In this practical, example-driven talk, we'll explore why robust evaluation, both before and after deployment, is the key to building trustworthy AI systems. We'll cover the full evaluation lifecycle: offline evaluation before release, from automated to human evaluation; and online evaluation in production, from observability to A/B testing. Drawing on examples from health AI, where safety, consistency, and reliability are non-negotiable, we'll show how these practices apply to any domain where AI needs to work reliably at scale.
By the end of this session, you'll walk away with an end-to-end framework for building a robust feedback flywheel that supports continuous, evaluation-driven development of LLM-powered products.
Speaker
Clara Matos
Director of Applied AI @Sword Health, Focused on Building and Scaling Machine Learning Systems
Clara enjoys working in the intersection of Machine Learning, Product, and Engineering, solving problems in a pragmatic and iterative way. She currently leads Applied AI at Sword Health, where her team is reinventing how patients access and receive care, by creating a more human, more clinically effective, and more scalable way to treat patients. She is focused on building and scaling machine learning systems that help achieve Sword's mission of freeing 2 million people from pain.