Recently, OpenAI released its new o1 model (also known as Strawberry), which focuses on reasoning and logic. In some areas, such as math, science, and coding, it outperforms the GPT-4o by a large margin. However, GPT-4o still has its advantages and strengths compared to the new o1 model. Both ChatGPT models require ChatGPT plus subscription to access. This article puts ChatGPT 4o and o1 models through different prompts across various problems to determine which model is better suited for specific tasks. So let’s begin.
Table of Contents
1. Coding
Let’s kick-start our tests with coding. As an example, I provided a Python script with several errors, inefficient methods for solving the task, and issues preventing it from generating an output. I gave this code to both the o1 Strawberry and GPT-4o using the following prompt.
Review the code and correct any errors or omissions. Optimize all functions for better efficiency, using comments to understand and implement any missing functionality. Ensure the purpose of the {main} function is clear and fully realized. Focus strictly on code improvements without adding extra documentation or deviating from the original code's intent.
The results were quite surprising. The code generated by GPT 4o couldn’t produce an output, but it managed to fix 90% of the errors. In contrast, the o1 model generated a perfectly working solution. Additionally, the code from the o1 model was more concise, making use of list comprehensions and augmented assignments.
Notably, it also automatically added a main function which the GPT 4o version did not. However, an interesting point is that while GPT 4o imported only the necessary components, the o1 model imported the entire heapq
module. Although this approach is still efficient, it is less elegant.
This could be because we initially tested the AI models with a simple shopping cart program. To further evaluate their capabilities, we tested them again with more complex code that includes multi-threading, machine learning, and complex data structures like graphs and trees. This code had even more errors and was highly inefficient.
This is where the o1 model truly shined. While GPT 4o managed to fix around 40-50% of the errors, the o1 model again resolved all of them. Additionally, GPT 4o did not improve efficiency in any way; the generated code still used inefficient threading techniques, relied on a basic model like MLPClassifier
for fraud detection, and didn’t tune any machine learning models. In contrast, the o1 model implemented all of these aspects perfectly.
We have some cool ChatGPT tips for programmers that will help you get more out of AI prompts.
2. Generating Emails, Assignments, Articles, etc.
In the second testing phase, we focused on generating various texts, ranging from simple emails to 2,000-word articles. In this case, both models produced similar outputs, making it difficult to rank one over the other. The reason is straightforward: the o1 model excels at tasks that require high-level reasoning, whereas generating emails and assignments can be efficiently handled by standard language models. For example, you can see the test results in the screenshot below.
While the output was similar, GPT 4o generated the text three times faster than the o1 model. The o1 model may have conducted a chain of thought internally, spending more time on thinking and analyzing, but for tasks like generating text, GPT 4o is the better choice in terms of speed. Additionally, with only 30 messages per week available on the o1 model, it is more practical to reserve it for more advanced tasks rather than routine text generation.
3. Generating Script, Social Media Posts and Ideas
While generating plain emails and articles may not require heavy reasoning, one might assume that creative content would benefit from it. However, that’s not necessarily the case. For example, when generating a random script or a social media post, the o1 model does not show any significant advantage, aside from being slower. However, if your requirements are precise and involve a lengthy list of instructions, the o1 model performs marginally better.
For example, I provided a 2,000-word article to both models and asked them to create a Twitter thread. I also asked it to follow the character limit, use Twitter short forms, and adopt a conversational and friendly tone to generate more clicks on the link. There were several other minor instructions too.
As you can see, the GPT 4o model completely ignored the Twitter character limit. I also specified not to include any hashtags, but the GPT 4o model didn’t follow this instruction either. Also, o1 version added needed image tags to keep the audience engaging. While these might not seem like reasoning-related issues, the o1 model takes time to conduct a chain of thought in the background, giving more weight to all your instructions in its response.
When you review its chain of thoughts, you can see that it considered how to write in a way that could generate more clicks. So, even if you’re generating text, but have a long list of instructions that the GPT 4o version isn’t fully following, the o1 model can definitely come to the rescue.
4. Documents, PDFs, Images and Other Files
GPT 4o can identify objects and elements in images, summarize documents and PDFs, and handle various types of file uploads with ease. However, the o1 model currently lacks the capability to upload files. As soon as you switch to the o1 model, the option to upload files disappears. This limitation means that tasks involving visual recognition or document analysis can’t be performed directly with the o1 model. In this aspect, GPT 4o is the clear winner.
5. Solve Math Problems
I tested both models with some basic math questions, and GPT 4o got a few of them wrong. GPT 4o seems to be more focused on retrieving information from its training data. Whenever I posed a complicated question that wasn’t directly available on the internet, there was at least a 30% chance (limited sample size) that it would make a mistake.
The o1 model also made an error in a graph-related question. But overall, I asked both models around 12 math questions, and o1’s math-solving skills were impressive — a significant upgrade over the 4o model. In a math Olympiad test, the o1 model scored around 83%, while the 4o model only scored 13%.
6. Complicated Finance Split
If the o1 model excels at math, it’s likely to perform well in finance-related tasks too. To test this, I presented a scenario where my two friends and I were renting a new room and had spent money unevenly on various expenses like advance payment, rent, brokerage fees, and other purchases.
I provided all the details to both models and asked them to calculate how much each person would need to pay to ensure a fair split of all the money spent. In this situation, the model needed to understand both the mathematical calculations and the context to provide an accurate answer.
Three friends, Alice, Bob, and Charlie, are renting a new room together and have made several payments for different expenses. Alice paid $800 for the advance payment, Bob paid $500 for the rent, and Charlie paid $200 for brokerage fees. Additionally, Alice spent $150 on groceries, Bob bought furniture for $300, and Charlie spent $100 on kitchen supplies. I want to ensure that the total expenses are evenly split among the three friends.
Calculate how much each person needs to pay or be reimbursed to achieve a fair division of all the expenses. Provide a breakdown of the amount each person owes or should receive
Both GPT 4o and o1 models provide the right answer as the math was simple enough. Both the AI models have the same level of understanding of the context and o1 model’s reasoning isn’t at a huge advantage here. However, we like the o1 model’s response as it explains the solution better with a table. But you can easily get them in the 4o model with a prompt. So in this round, it’s a tie.
OpenAI’s GPT 4o vs o1 Model
We compared both models across various tests, such as scheduling a timetable, creating a financial plan for a business, and solving riddles. the o1 model excelled, particularly in tasks that required reasoning. However, for tasks that do not require much reasoning — like generating text or researching information — both models provided similar results, with the main difference being that o1 was much slower.