AI can't even run a vending machine -- Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity