Home » AI » We Compared DeepSeek R1 with OpenAI o1 Using 5 Prompts

We Compared DeepSeek R1 with OpenAI o1 Using 5 Prompts

by Ravi Teja KNTS
0 comment

China’s DeepSeek R1 and the USA’s OpenAI o1 are both reasoning models. Instead of answering questions immediately, they take time to think through the prompt using their reasoning process, leading to better and more accurate answers. These models are generally good at handling complex questions related to coding, math, science, or anything requiring serious reasoning skills.

We Compared DeepSeek R1 with OpenAI o1 Using 5 Prompts

Until now, OpenAI’s o1 model has been leading the industry in reasoning capabilities. However, it is a closed-source AI model accessible only through a $20 paid subscription. Google is also working on its own reasoning model called 2.0 Flash Thinking, but it is still in beta. While promising, it hasn’t quite reached the level of o1 and is only available through Google AI Studio. We will put it through the paces when it is available.

On the other side of the map, DeepSeek from China has released its R1 model this week, which is mostly on par with OpenAI’s o1 model but exceeds it in some other areas. It has become the talk of the town ever since. Unlike OpenAI’s o1 model, R1 is open-source, free to use, and has achieved o1 benchmarks at just 3% of the cost. Not surprising since China has always been good at doing things in a very cost-effective way. Even the developer APIs are 90%-95% cheaper compared to the o1 model.

But how good is the R1 AI model and can it really beat the o1 model by ChatGPT? Let’s find out using a couple of prompts.

DeepSeek R1 vs OpenAI o1

To test the claims, we evaluated both OpenAI’s o1 model and DeepSeek’s R1 model with various prompts requiring strong reasoning skills to see if DeepSeek has truly delivered o1-level performance or even surpassed it.

1. Puzzle-Based Reasoning

I started the comparison with a classic puzzle-style question that does not even have a working answer.

On an island, every inhabitant is either a knight, who always tells the truth, or a knave, who always lies. You meet three inhabitants: A, B, and C.

A says, "B is a knave."
B says, "C is a knight."
C says, "A is a knight."
Who is what?

So let’s see which model can figure out that it does not have an answer. While the o1 model took just 16 seconds to think, DeepSeek took 120 seconds. However, both models came to the right conclusion, saying there is no way to figure out who is a knight and who is a knave. I found DeepSeek’s explanation much easier to understand than o1’s confusing narrative.

The best part about DeepSeek is you can see its entire reasoning process, which is quite compelling. It reasons through like we humans do and tries to solve the issues in various ways multiple times. The process is written from DeepSeek’s perspective resulting in a much better and fascinating user experience. For example, here’s part of the text from DeepSeek’s thought process:

Wait a second, so all 8 possibilities lead to contradictions? That can't be. There must be a mistake here. Maybe I messed up some of the cases.

Let me check case 3 again: A knight, B knave, C knight. A says B is a knave, which is true. B says C is a knight, but since B is a knave, that's a lie, so C is actually a knave. But in this case, C is supposed to be a knight. So contradiction. So case 3 is invalid.

Case 5: A knave, B knight, C knight. A's lie makes B a knight. B says C is a knight, which is true. Then C says A is a knight, which is false. But C is supposed to be a knight. Contradiction.

Case 7: A knave, B knave, C knight. A is a knave, so B is a knight (since A's statement is false). But B is supposed to be a knave. Contradiction.

Wait, maybe there's no solution? But the problem states that each inhabitant is either a knight or a knave, so there must be a solution. Maybe my approach is wrong.

Alternatively, perhaps the problem is designed to have no solution, but that seems unlikely. Let me try another approach.

Interesting, right?

Verdict: Both the AI models got the answer right. While ChatGPT’s o1 is faster, DeepSeek’s R1 is more thorough and provides a simpler explanation that humans can understand and digest more readily.

2. Math Problem

Next, I have a hard math-related question that can take at least 30-50 steps to find the answer.

A spacecraft travels from Earth to Proxima Centauri, 4.24 light-years away, at a constant velocity 𝑣. Considering time dilation effects, calculate:

1. The time it takes for the journey as measured by observers on Earth.
2. The time experienced by the astronauts onboard. Assume 𝑣 = 0.8𝑐, where 𝑐 is the speed of light.

Both models predicted the answers correctly. However, DeepSeek provided an exact answer, mentioning 3.18 years, whereas ChatGPT rounded it off to 3.2 years. But o1 was much faster, thinking for just 5 seconds, whereas DeepSeek took 53 seconds to arrive at the answer.

Verdict: Both the models again provided the correct answer, however, o1 Model is much faster. On the other DeepSeek shares the entire calculation and the exact answer which can make all the difference when it comes to math, science, and deep space.

3. Solving a Sudoku Puzzle

Who doesn’t love a sudoku puzzle? For the third question, I uploaded a Sudoku puzzle as an image from the r/sudoku subreddit to both the AI models asking them to solve it.

Solving a Sudoku puzzle seems too much for any AI reasoning model. However, if the models have code execution capabilities, they can generate or use an existing code in their database and execute it to solve the puzzle. For example, Gemini 1.5 Pro can solve Sudoku puzzles. However, both ChatGPT o1 and DeepSeek R1 models tried to solve the Sudoku with just reasoning, and here are the results.

DeepSeek reasoned and took 68 seconds before saying the grid was not perfect, even though it was. I uploaded two other Sudoku puzzles, and the results were the same. This is likely because DeepSeek’s vision capabilities are subpar. While it can reason through problems, it struggles to interpret uploaded images.

OpenAI, on the other hand, thought for more than 5 minutes and provided a wrong answer. I uploaded two other Sudoku puzzles just like on DeepSeek. However, once, it did manage to give the correct answer in 5 seconds, indicating that the solution was already in its training data.

At least o1 model was able to read the images and uploaded files better than DeepSeek R1, however, both models couldn’t solve any sudoku puzzle correctly.

Finally, I entered the sudoku puzzle in the text format, with no images. OpenAI again found the solution available in its training data, whereas DeepSeek went through the reasoning process taking 280 seconds and again came up with the wrong answer. So we can conclude it’s not just image capabilities, Sudoku puzzles are unsolvable for the current batch of AI reasoning models.

Verdict: Both models failed to arrive at an answer through reason.

Also Read:

4. Creating a Flowchart

I asked both AI reasoning models to create a flowchart of how the OpenAI’s Operator works. This can be an issue for the o1 model as it cannot access the internet and Operator is a recent development not available in its training data. However, DeepSeek’s reasoning model can access the internet so let’s see what it can do.

Create a flowchart of how the OpenAI's Operator model works. 

As expected, o1 created a generic flowchart of how OpenAI’s LLM models work, not the Operator model. The flowchart was also confusing and barebones. DeepSeek searched online for information about the Operator and generated a flowchart as requested.

Verdict: DeepSeek R1 wins by a landslide.

5. Programming Task

To round off our DeepSeek R1 vs OpenAI o1 comparison, I went for a programming-related query this time.

Write a Python program that determines whether a given sentence is positive, negative, or neutral. For each classification, provide an explanation for why the sentence was categorized that way. Handle complex sentences, such as those with sarcasm, double negatives, or mixed sentiments. Create a graphical interface where users can input a sentence and see the sentiment analysis results in real-time.

It’s a simple challenge that can be easily completed with existing modules. OpenAI o1 model used the transformers pipeline module and shared how to install that module on my PC before running the code. Whereas DeepSeek’s R1 directly provided the code with no steps and used a vaderSentiment module which I had never used.

After installing both modules and running the code, we could tell DeepSeek’s implementation followed the instructions better. For example, the app created by o1 did not provide a proper explanation for its sentiment classification, while DeepSeek’s app gave clear reasons. Additionally, DeepSeek’s app worked in real-time, analyzing the input as you typed, whereas o1 required clicking the Analyze button.

However, neither model could understand the sarcasm! But for the most part, they got the job done.

Verdict: DeepSeek R1 for following the instructions accurately.

Final Verdict: ChatGPT o1 vs DeepSeek R1

As you can see, the only question DeepSeek failed to answer correctly was the Sudoku puzzle, which OpenAI also failed at. Except for that, DeepSeek’s R1 model consistently provided easier-to-understand explanations and accurate answers following instructions to the T. All while transparently showcasing its reasoning process. On top of that, it’s free to use and open-source making it accessible for all.

We have also tested both reasoning models in day-to-day usage, and DeepSeek is on par with OpenAI’s o1 model, often surpassing the latter’s paid plans.

DeepSeek’s claims hold true and users can confidently rely on it as a replacement for the o1 model. However, OpenAI also has an o1 Pro model which costs $200, and is preparing to launch the o3 model soon, so the narrative may shift soon enough. But for now, considering the price, open-source availability, and performance, we can conclude: DeepSeek R1 > OpenAI o1.

You may also like