Claude's newest model is a step forward and two steps back, and it's infuriating
At a glance:
- Anthropic's Opus 4.7 improves instruction following but reduces creative problem-solving capabilities
- The model shows 12/14 benchmark improvements but requires 35% more tokens
- Users report increased hallucinations despite higher honesty scores
The Model's New Capabilities
Anthropic's Opus 4.7 represents a significant leap in software engineering and agentic safety, with the company claiming 12 out of 14 reported benchmark improvements over Opus 4.6. The model demonstrates enhanced precision in following user instructions, with Anthropic noting that 4.7 now takes directions literally rather than interpreting them loosely. This update is particularly notable for developers, as the model now executes specific code modifications without unnecessary refactoring. For instance, when asked to change a single line of code, Opus 4.7 executes that change without altering surrounding code, a marked improvement over previous versions that often added unsolicited suggestions.
However, this increased precision comes with notable trade-offs. The model's strict adherence to instructions has led to reduced capacity for independent problem-solving. In testing, Opus 4.7 frequently failed to perform necessary web searches when context required it, such as when asked about OpenClaw - a 1997 platformer game. While Opus 4.6 would request clarification, 4.7 provided an incorrect answer without attempting to verify information. This behavior aligns with Anthropic's own admission that the model's honesty score (91.7% on MASK benchmark) reflects technical accuracy rather than real-world knowledge verification.
The Trade-Offs of Precision
The most striking criticism of Opus 4.7 is its apparent 'laziness' in task execution. Users report the model often performs the bare minimum required by instructions, avoiding deeper analysis unless explicitly prompted. This was evident in a Reddit user's experience where Opus 4.7 incorrectly denied the existence of a Claude for Excel add-in, despite the tool being available. When confronted, the model admitted it should have searched the web but repeated the same error in subsequent queries, listing seven instances of self-correction in a single conversation.
The model's token consumption has also increased significantly, with Anthropic reporting a 1.0x to 1.35x increase in token usage compared to Opus 4.6. This means the same prompt could cost up to 35% more, compounded by the need for more back-and-forth interactions as users guide the model through complex tasks. The updated tokenizer, while improving text mapping, has created a more resource-intensive experience for users.
User Experience and Reception
Despite its technical advancements, Opus 4.7 has sparked mixed reactions from the developer community. Many users appreciate the improved instruction following, particularly for code modification tasks where precision is critical. However, the model's reduced capacity for creative problem-solving has frustrated users who rely on AI for exploratory tasks. One developer noted that Opus 4.7's strict adherence to instructions made it less effective for brainstorming sessions, where previous models would offer multiple solution paths.
The model's increased token costs have also raised concerns about accessibility. For users with limited API budgets, the 35% increase in token usage could significantly impact their workflow. This is particularly problematic for small businesses and independent developers who may struggle with the higher costs of using the model for routine tasks.
Broader Implications for AI Development
Opus 4.7's release highlights a growing tension in AI development between precision and flexibility. While the model's improvements in instruction following and safety are technically impressive, they come at the cost of reduced adaptability. This trade-off raises important questions about the future direction of AI models - should they prioritize strict compliance with user instructions, or should they maintain the ability to think creatively and independently?
The controversy surrounding Opus 4.7 also underscores the challenges of benchmarking AI models. While Anthropic's MASK honesty score suggests technical improvements, real-world performance metrics tell a different story. The model's increased hallucinations and reduced problem-solving capacity demonstrate that benchmark scores don't always translate to practical usability.
What's Next for Anthropic
Anthropic has not yet announced plans to address the criticisms of Opus 4.7, but the company's history of rapid iteration suggests updates are likely. Potential improvements could include better web search integration, more flexible instruction handling, and optimized token usage. However, any changes would need to balance the model's strengths in precision with the need for creative problem-solving capabilities.
The Opus 4.7 controversy may also influence how other AI companies approach model development. As the industry continues to push for more capable models, the debate over precision versus flexibility will likely intensify. For now, Opus 4.7 serves as a cautionary tale about the complexities of AI development and the challenges of balancing technical capabilities with user needs.
FAQ
What are the main improvements in Opus 4.7?
What are the main criticisms of Opus 4.7?
How does Opus 4.7 compare to previous models?
More in the feed
Prepared by the editorial stack from public data and external sources.
Original article