Postmortem reveals overlapping issues and outlines improvements for enhanced reliability
HM Journal
•
about 2 months ago
•

The technical report, released yesterday, aims to provide transparency into the complex engineering challenges faced by Anthropic as it scales its AI services. Claude is served to millions of users through various platforms, including Anthropic's own API, Amazon Bedrock, and Google Cloud's Vertex AI, across diverse hardware like AWS Trainium, NVIDIA GPUs, and Google TPUs. Maintaining consistent response quality across these varied environments is a significant undertaking, and these recent incidents highlight areas where that bar wasn't met.
Anthropic's investigation uncovered three separate, yet overlapping, infrastructure bugs. The first, a "context window routing error," was introduced on August 5th. This bug misrouted some requests for Claude's Sonnet 4 model to servers configured for the upcoming 1 million token context window. Initially affecting only 0.8% of requests, its impact was amplified significantly on August 29th due to a load balancing change. This change inadvertently increased the volume of short-context requests being sent to the long-context servers, leading to a peak of 16% of Sonnet 4 requests being affected during a single hour on August 31st. Notably, some users experienced more severe degradation because the routing was "sticky," meaning subsequent requests from the same user were likely to be misrouted as well. The fix for this involved correcting the routing logic, with deployment completed by September 16th across most platforms.
The second issue, an "output corruption" bug, stemmed from a misconfiguration on Claude API TPU servers deployed on August 25th. This error caused problems during token generation, where a runtime performance optimization would occasionally assign an inappropriately high probability to tokens that should rarely appear. This could manifest as unexpected characters or syntax errors in responses, even for simple English prompts. This corruption affected Opus 4.1, Opus 4, and Sonnet 4 models between late August and early September. The resolution involved rolling back the faulty configuration on September 2nd and implementing new detection tests for unusual character outputs.
The third bug, an "approximate top-k XLA:TPU miscompilation," also emerged from a code deployment on August 25th aimed at improving token selection. This change inadvertently triggered a latent bug within the XLA:TPU compiler, impacting Claude Haiku 3.5 and potentially other models. The bug's behavior was notoriously inconsistent, varying based on factors like preceding operations or debugging tool usage, making it exceptionally difficult to pinpoint. The complexity was further compounded by a previous workaround for a related issue that had been masking this deeper problem. Anthropic eventually rolled back the affected code and is collaborating with the XLA:TPU team on a compiler fix, while also transitioning to an exact top-k implementation for enhanced precision.
What made these incidents particularly challenging was their overlapping nature and the subtle ways they manifested. Initial user reports were difficult to distinguish from normal feedback variations. The load balancing change on August 29th, a seemingly routine update, acted as an accelerant, causing more users to experience issues while others saw normal performance, leading to confusing and contradictory feedback.
Anthropic's standard validation processes, which include benchmarks, safety evaluations, and performance metrics, proved insufficient in catching these specific degradations. The company acknowledges that their evaluations didn't fully capture the user-reported issues, partly because Claude can often recover from isolated errors. Furthermore, their own stringent privacy practices, which limit engineer access to user interactions unless explicitly reported as feedback, created hurdles in reproducing and diagnosing the bugs. This meant engineers couldn't directly examine problematic interactions without user consent, slowing down the investigative process.
The inconsistent symptoms across different platforms and models painted a picture of random degradation, obscuring the underlying infrastructure problems. The reliance on "noisy" evaluations meant that even an awareness of increased negative reports online didn't immediately translate into a clear connection to specific recent changes.
In response to these events, Anthropic is implementing several key changes to bolster its infrastructure and evaluation processes. The company is developing more sensitive evaluations designed to reliably differentiate between correct and incorrect model implementations, aiming for closer monitoring of overall quality. They will also expand the scope and frequency of quality evaluations, running them continuously on production systems to catch issues like the context window load balancing error in real-time.
Crucially, Anthropic is investing in faster debugging tooling that will allow for better analysis of community-sourced feedback without compromising user privacy. This includes developing bespoke tools to expedite remediation in future incidents.
/bug command in Claude Code or the "thumbs down" button in Claude apps. Developers and researchers with novel evaluation methods are also invited to share their insights by contacting feedback@anthropic.com. This collaborative approach, Anthropic believes, is essential for maintaining the high standards users expect from Claude.