Home » AI » OpenAI o3-Mini-High vs Gemini 2.0 Flash Thinking: We Tested With 5 Prompts

OpenAI o3-Mini-High vs Gemini 2.0 Flash Thinking: We Tested With 5 Prompts

by Ravi Teja KNTS
0 comment

AI models usually spit out answers instantly, but reasoning models like o3-Mini High and Gemini 2.0 Flash Thinking take a different approach—they think before they respond. Instead of rushing to a conclusion, they work through problems step by step to give more logical answers. Both these models are the latest but lite versions of reasoning models from Google and OpenAI.

But here’s the catch—Google made Gemini 2.0 Flash Thinking completely free, while OpenAI locked o3-Mini High behind a ChatGPT Plus subscription. So, does paying for o3-Mini High actually make a difference, or is Gemini’s free offering good enough?

On paper, o3-Mini High performs just marginally better in few benchmarks, but is that gap noticeable in real-world use? To find out, we put them to the test with five tough challenges, from complex math to tricky logic puzzles. The goal is to see which AI explains its reasoning better, gets more accurate answers, and responds faster. So let’s begin

1. Puzzle-Based Reasoning

I started the test with the same puzzle prompt I used to evaluate DeepSeek R1 and OpenAI o1 models. This question does not have a valid answer, so the goal is to see which model can correctly identify that.

On an island, every inhabitant is either a knight, who always tells the truth, or a knave, who always lies. You meet three inhabitants: A, B, and C.
A says, 'B is a knave.'
B says, 'C is a knight.'
C says, 'A is a knight.'
Who is what?

OpenAI models do not show their entire reasoning process—they simply think and provide a final answer. In contrast, Gemini reveals its reasoning, though it is not as user-friendly as DeepSeek R1’s approach. Still, it offers some insight into how it arrives at its conclusions.

Coming to the results, OpenAI take the complete lead. It was able to find out the question does not have a proper answer in less than 15 seconds.

Gemini, on other hand, approximately spent thrice the time and generated a wrong answer. Going through the reasoning process, I can see that Gemini’s first conclusion was that this question does not have an answer, however it continued thinking and ended up with a wrong answer.

Verdict: OpenAI o3 Mini High for providing the correct answer in less time.

2. Math Problem

Next, I asked a math question to both the models. It’s a reasonably simple probability question.

You have a deck of 52 playing cards (standard deck). You draw five cards at random. What is the probability that you have exactly three aces in your hand? Clearly show each step of your reasoning, including combinatorial calculations

As expected, both the models are able to delivery the correct answer in 10 seconds. Also, both models provided clear step-by-step process in the output as asked. However, while Gemini clearly explained formula and what exactly we are doing in each step, ChatGPT kind of skipped through them to give a more easy to skim solution.

Verdict: Gemini for providing info on each step, ChatGPT for providing more easy to skim answer.

3.  Solving a Sudoku Puzzle

To test the visual and image understanding capability of the model, we have tested both the models with a Sudoku puzzle. We have taken fairly easy to solve Sudoku as we observe most models miserably fail here.

Solve this Sudoku Puzzle (with the image of the puzzle) 

The hardest part of solving Sudoku for AI models is reading the image itself. They often mess up the placement of numbers. As expected, ChatGPT said there are two 1s in column 4 and two 9s in column 5, even though it’s not. Gemini on the hand, created a table with 12 columns instead of 9 and therefore got stuck in the loop before crashing. Trying out for the second time, it got stuck in generating the output.

Both the models failed because of their visual limitation. While the models are good at identifying objects and text in the images, they are not as perfect as understanding an entire Sudoku puzzle. So to check their reasoning, I have given Sudoku in text format this time.

Solve this Sudoku Puzzle 

000000907
000420180
000705026
100904000
050000040
000507009
920108000
034059000
507000000

Now Gemini generated an answer which is sort of almost right except for a couple of placements. For example, you can see the last column has two 3 and 7th column has none. But except for that, most of the other grid is right.

ChatGPT on other hand actually thought about an answer in it’s thought process which also made some wrongs similar to Gemini. However, it realized that and said it has trouble finding the solution to that Sudoku.

Verdict: Technically, both of them are not able to scan the Sudoku image properly and both of them are not able to provide correct answer even when provided the Sudoku in text format.

4. Hypothetical Scenario

For the next prompt, I have given an hypothetical scenario and asked both models to predict the outcome. There is no right or wrong answer in this except we just need to look which model did a better job in integrating historical events and giving reason for predicted outcome.

If the internet had been invented 50 years earlier—in the 1940s instead of the 1990s—explain logically three major impacts on society today, covering technological, cultural, and geopolitical aspects. Support each impact with reasoning based on historical context.

Both models discussed the technological, cultural, and geopolitical impacts of this scenario and provided similar predictions. They suggested that other technologies, particularly in communication, would have evolved differently and could have significantly influenced World War II. Additionally, they predicted that the internet would have accelerated cultural exchange, leading to faster progress in civil rights movements and artistic trends. Most notably, both models highlighted how the internet could have been a powerful tool for governments during the Cold War—facilitating secret communication, espionage, and the rapid spread of propaganda.

While these predictions can be mostly accurate of what might have happened, but the models just predicted surface level saying things would have happened faster. Rather than could have dwelled down on how internet would change the war and what government policies would have been different, what could be the major changes compared to now. So I asked them to the same directly, however the models choose a safe approach and kind of repeated the same info with few differences rather than truly bringing a difference. Model like Grok excell here.

Verdict: Both models were able to predict the outcome, however both choose a safe approach.

5. Programming

As these reasoning models are good at logic and reasoning, they are also good at coding in general.

Write a Python program that determines whether a given sentence is positive, negative, or neutral. For each classification, provide an explanation for why the sentence was categorized that way. Handle complex sentences, such as those with sarcasm, double negatives, or mixed sentiments. Create a graphical interface where users can input a sentence and see the sentiment analysis results in real-time.

Both ChatGPT and Gemini have written the python script using third-party modules, which is expected. However, both models missed few details from the prompt. While ChatGPT did not provide the explanation for preferring positive or negative, Gemini on other hand did not created a real time app, rather we have to click a button every time it needs to generate. Though used the Vader Sentiment module which supports real-time, it have written code that does not support it.

Upon multiple prompts, we are able to solve all the issues, but considering the first result, there is no winner in this segment.

Verdict: Both did a decent job, however both Gemini and ChatGPT missed few details from the prompt.

Final Verdict: Is Free Gemini Model Good Enough?

Well, for most tasks Gemini did just as good as paid ChatGPT model. It did Sudoku better than ChatGPT, the app developed by Gemini is just as good as ChatGPT and even predicting the scenario is similar to ChatGPT. With math questions, I specifically prefer the Gemini response for more detailed answer. The only test Gemini failed is the puzzle riddle. In fact we have tried many more prompts along with these prompts and results in each category and kind of consistent.

So you can absolutely use the free Gemini 2.0 Flash Thinking instead of the paid ChatGPT o3 Mini High model. You don’t have to go for paid option just for a reasoning model. But if you are already a ChatGPT Plus user, then using o3 Mini High is better choice overall as that model didn’t failed in any question except for the Sudoku one.

You may also like