If you’ve been following AI lately, you know reasoning is the next big thing. AI models aren’t just completing sentences anymore — they’re solving problems, making decisions, and thinking through complex scenarios. Now, Google’s new Gemini 2.5 Pro has entered the arena, claiming to outthink every other reasoning AI model out there.

So, what’s new in Gemini 2.5 Pro? And how does it actually stack up against other top models like OpenAI’s o3 Mini, DeepSeek R1, and Grok 3 Thinking? I tested them all using real-world prompts.
What is Gemini 2.5 Pro?
Google just unveiled its most powerful AI model — Gemini 2.5 Pro. It’s a reasoning model, so it can solve complex problems by thinking step-by-step to reach a logical answer. It understands multimodal inputs — like text, images, audio, and video. It’s currently available to Gemini Advanced users and free to try in Google’s AI Studio.
Gemini 2.5 scored 18.8% on Humanity’s Last Exam — the highest among all reasoning models without using tools or search. HLE is a rigorous benchmark designed to assess AI models’ expert-level reasoning across various subjects. For context, o3 Mini scored 14%, and DeepSeek R1 scored 8.6%.
Gemini 2.5 Pro also beat others in multiple benchmarks and claimed to be much better at reasoning and coding. In LMArena, where users vote for the better answer, Gemini 2.5 Pro topped the chart with a score of 1,443 — higher than any other AI model out there. The only model that beat it in one test was ChatGPT’s Deep Research model with 26.6%, but that isn’t a reasoning model.

Here’s what you need to get excited about in Gemini 2.5 Pro:
- Excels in coding skills: Gemini 2.5 Pro model has much better reasoning in coding. It can code complex simulations, games, and demos with ease.
- Stable 1M-token support: The model supports 1M tokens, going up to 2 million soon. Older models technically supported 1M context, but 2.5 handles it more reliably and at scale. So, if you’re uploading long documents or code projects, this model will perform better.
- Multimodal understanding: It doesn’t just work with images and audio — it understands them better than previous models. When generating code for simulations and demos, its visual understanding plays a big role.
As you can see, the model’s major advantage is coding — especially where logic and multimodal understanding are involved. So, let’s see how it performs in real-world tests compared to other popular reasoning models out there.
Gemini 2.5 Pro vs Other AI Reasoning Models
Since the model is strong in multimodal understanding and coding, I started by testing those areas.
1. Rubik’s Cube Simulation (Code Test)
First, I provided a detailed prompt to create a Rubik’s Cube simulation with scramble and solve options. I asked for it in p5.js without HTML and listed all the features, functions, and technical tools needed to create the animations.
To my surprise, Gemini delivered. While the solve option isn’t working perfectly, I was able to manually rotate the cube and use the scramble option successfully.

I also tested it with other models, but none of them delivered proper results. To be frank, Gemini 2.5 Pro is the first model to get the simulations and demos right. Simply put, this wasn’t possible with any other AI model before.
2. Logic Puzzle (Reasoning Test)
We also tested some reasoning-based prompts. Here’s one. This question doesn’t have a definitive answer:
On an island, every inhabitant is either a knight, who always tells the truth, or a knave, who always lies.
You meet three inhabitants: A, B, and C.
A says, “B is a knave.”
B says, “C is a knight.”
C says, “A is a knight.”
Who is what?
Let’s see which model can figure out that this is a paradox. Gemini took just 24 seconds to identify that it’s a paradoxical situation with no clear answer. OpenAI’s o3 Mini and Grok both took around 40 seconds and predicted the right answer. DeepSeek R1, however, took 434 seconds and got it wrong the first time—though it did get it right when asked again.



This isn’t just a one-off case. DeepSeek tends to stumble on more complex questions. That said, the overall difference isn’t huge, as most models correctly predicted the answers using logic in most cases.
3. Physics Problem (Math Test)
Next, I tested all the models with some math tests. o3 Mini has lead the math until now, however, Gemini 2.5 Pro scored better benchmarks. Here is one example of all.
A high-speed train moves at a constant speed of 0.9c through a tunnel that is 2 kilometers long (as measured by a stationary observer). How much time does the train take to pass through the tunnel according to a stationary observer?How much time passes for a passenger sitting on the train (considering time dilation)?



All models solved this accurately and provided clear, step-by-step explanations. While Gemini leads in math benchmarks, the actual performance gap is minimal — all models handled most problems well.
Gemini 2.5 Pro
Gemini 2.5 Pro is a massive improvement over 2.0 Flash Thinking. However, it’s more or less on the same level as models like o3 Mini, Grok 3, or DeepSeek R1. That said, when it comes to multimodal understanding, this model finally delivers much better results. Apart from that, we can now say that Gemini has officially joined the level of other models when it comes to reasoning.