The new ChatGPT o1 model from OpenAI focuses on reasoning to solve tough coding and math problems—areas where previous OpenAI models struggled. OpenAI claims o1 model (also called Strawberry) is designed to spend more time thinking before it responds. In this article, we explore what the new o1 model offers, how it can be useful for us and most importantly, how it compares with other top-tier models like GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet. Let’s begin.
Table of Contents
What is OpenAI o1 Model
Until now, OpenAI’s language models have been part of the GPT series, such as GPT-3.5, GPT-4, and GPT-4o. The new o1 model marks the beginning of a new “o” series, designed to enhance reasoning and complex thinking before generating a response. Unlike previous models, o1 uses a “chain of thought” approach, internally breaking down problems step-by-step to provide more accurate answers to much more complex problems. OpenAI gives PhD students as a target userbase.
Here’s a graph OpenAI shared comparing o1 Strawberry model with their previous GPT4o model where the former was asked PhD level science questions.
Complex problems requires multiple steps. As the number of steps increased, previous models produced inaccurate answers unless users guide them through each step with a series of prompts. In contrast, the o1 model claims to handle this chain of thought on its own, as if it is engaging in an internal dialogue to arrive at the correct answer.
However, because it spends more time processing and thinking, Strawberry is much slower than others. In many cases, it doesn’t even begin answering prompts while models like GPT-4o already finish their response.
Highlights of OpenAI o1 Model
Reasoning
Being better at reasoning and complex tasks makes the new o1 model good at math, science, coding and several other high-level advanced tasks. OpenAI tested these models along with GPT 4o on a diverse set of exams and ML benchmarks like Math, Code, and Science.
Where GPT 4o was only able to solve with 13% accuracy, the newer o1 model was able to solve with 83% accuracy and o1-preview has around 56% accuracy.
Chain of Thought
The o1 model uses a chain of thought approach. You can review the entire thought process by clicking on the “Thought” option at the top. Although you cannot see the specific inputs that led to the thought, you can view the direction of the reasoning and what ChatGPT considered before responding.
How to Access ChatGPT o1 Model
The new o1 model lineup includes OpenAI o1, OpenAI o1-preview, and OpenAI o1-mini. Starting today, the preview and mini models are available to paid ChatGPT Plus users, with usage limits of 30 messages per week for o1-preview and 50 messages per week for o1-mini.
To make the most of these models, use them only when necessary. To access the o1-preview and o1-mini models, open ChatGPT, tap on the model number above, and select either the o1-preview or o1-mini option to begin using them.
Comparing ChatGPT o1 With GPT 4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro
Since the ChatGPT o1 model is focused on math and coding, we tested its performance in real-world scenarios against other language models, including GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro.
1. Math Question
I began the test by giving this math question to all the AI models.
Consider a grid of size n×n where n≥2. You start at the bottom-left corner of the grid and want to reach the top-right corner. You can move only to the right or upward. However, you are not allowed to pass through any point on the diagonal y=x.
Determine the number of distinct paths from the bottom-left to the top-right corner that do not cross or touch the diagonal y=x.
The output provided by the o1 model is more detailed and also correct, as shown below.
GPT-4o did not consider the instruction to avoid touching or crossing the diagonal point, which caused it to produce incorrect answer.
However, when I broke down the steps, GPT-4o was able to generate the correct answer. Surprisingly, Gemini 1.5 Pro produced an output that was hard to understand. It inexplicably brought Python into the discussion, even though the question did not mention it at all.
However, GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet all managed to solve the problem correctly when I manually guided them through the steps.
2. Coding Question
When it comes to coding, I ran multiple tests that I’m familiar with, and all the models performed similarly. Here is one of the examples I tried:
Write a Python function that takes a string representing a series of tasks and their dependencies in the format "A->B, B->C, C->D" and returns the order in which the tasks should be completed.
All models, not just ChatGPT o1, have provided the correct code. In fact, we tried the example provided by OpenAI on their website, and the results were similar. GPT-4o generally struggles with UI-based coding, and this is also the case with ChatGPT o1. When it comes to front-end development, Claude 3.5 Sonnet takes the top spot. However, all models perform similarly when it comes to back-end and logic-based coding.
However, when faced with unique problems, ChatGPT o1 might outperform the other models — something we have yet to observe.
ChatGPT o1 Model – How it is Useful in Real-World
ChatGPT o1 is particularly effective at tasks that require advanced reasoning, such as PhD-level math, science, and coding, which may not be relevant for everyday use or regular folks. However, if you are looking for help with business planning, managing finances, or scheduling—tasks that require strong reasoning and decision-making skills—we have found that the ChatGPT o1 model performs exceptionally well compared to other models. Additionally, since it is included with the ChatGPT Plus subscription at no extra cost, it offers added value to Plus users.