What?! o3, DeepSeek-R1, Claude, and Gemini Are Just Pretending to Think? — AI Innovations and Insights 51
Welcome back, let’s dive into Chapter 51 of this insightful series!
Since the second half of 2024, we've seen a wave of Large Reasoning Models (LRMs) (AI Exploration Journey: LLM Reasoning) — like OpenAI o1/o3, DeepSeek-R1, and Gemini Thinking — making impressive strides in reasoning tasks. They've even started becoming part of our everyday digital lives.
But have we really paused to ask a critical question: Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching.
In addition, most of the evaluations focus on math and coding benchmarks, with an emphasis on whether the final answer is right or wrong. The problem? These benchmarks are often contaminated by training data and don’t tell us much about how the model gets to its answer — or how “reasoned” that process actually is.
"The Illusion of Thinking" makes a compelling case: if we really want to understand how LRMs reason, we need controlled environments with adjustable complexity — and a way to actually look inside the model's thinking process, not just the final result.

Figure 1 breaks the four puzzle environments down visually.
Each column shows how the puzzle evolves — from the initial state (top), through a intermediate state, to the target state (bottom). The examples include Tower of Hanoi (moving disks across pegs), Checkers Jumping (swapping colored tokens), River Crossing (getting entities across the river), and Blocks World (rearranging stacked blocks).
Key Findings
We can skip showing the experimental results tables. Let's jump straight into the interesting findings.
Three performance regimes: On low complexity tasks, it’s actually the standard LLMs that do better — surprisingly. For medium complexity tasks, LRMs show a clear edge with their additional reasoning power. But once the complexity gets high, both types of models just break down completely.
Overthinking phenomenon: For simple problems, LRMs often keep reasoning after they’ve already found the correct answer — wasting time and effort on bad solutions. It's a classic case of overthinking.
Reasoning collapse point: As task complexity ramps up, there’s a sudden drop — accuracy doesn’t just decline, it crashes to zero. LRMs completely falls apart. It suggests they’ve failed to develop a problem-solving ability that truly generalizes.
Counterintuitive limits on reasoning effort: As task complexity increases beyond a certain point, LRMs start doing something strange — they actually put in less reasoning effort, measured by the number of tokens used, even when they still have plenty of budget left. This suggests a fundamental ceiling in how well current models can scale their reasoning as problems get harder.
Limitations in exact computation: LRMs fail to use explicit algorithms and reason inconsistently across puzzles.
Lack of general algorithmic ability: Even when the exact solution algorithm was handed to the model in the prompt, LRMs still fail to follow through — showing they can’t reliably handle rule-based problems in a systematic way.
Weak generalization across tasks: The model's performance varies significantly from task to task — likely a sign of gaps in its training data or a lack of real structural understanding.
Final Thoughts
Despite how powerful today's large reasoning models might look on benchmarks, this study shows there are deeper structural flaws in how they actually "reason." When faced with more complex problems, these models often fail to scale their reasoning strategies or reliably apply even explicitly provided algorithms — revealing a fundamental gap between current AI models and human-like logical thinking.
True reasoning, it seems, isn't just about throwing more tokens at the problem or fine-tuning harder. What's likely needed is a deeper integration of symbolic manipulation and a more general way of representing problems — and that may become a major design challenge for the next generation of reasoning models.
Interestingly, not everyone agrees with the findings. After the study was published, it sparked some debate. One notable response — "The Illusion of the Illusion of Thinking" — argues that many of these so-called reasoning failures actually stem from issues in task design, scoring methods, or evaluation frameworks — not from the models themselves. Interested readers can check it out.