Abstract
Can you throw an LLM at a production incident and expect useful results? A candid look from someone who runs a distributed AI system and reaches for Claude before reaching for a dashboard. Surprises, failures, and why the answer matters for every engineer carrying a pager.
Speaker
Alex Palcuie
Member of Technical Staff in AI Reliability Engineering @Anthropic, Previously Staff Site Reliability Engineer on Google Cloud Platform
Alex Palcuie is a Member of Technical Staff in AI Reliability Engineering at Anthropic, where he works on keeping Claude reliable at scale. He has the unenviable task of having to fix Claude without Claude when it goes down. Previously, he was a Staff Site Reliability Engineer on Google Cloud Platform (GCP) and a member of Google's Tech IRT (Incident Response Team), handling large-scale infrastructure incidents including the kind where datacentres flood.