Is Altman's Method of Using Large Models Wrong?
A recent study from Wharton Business School and other institutions found that the "direct answer" prompt, which Altman loves, surprisingly significantly reduces model accuracy.
However, on the other hand, the study also found that adding a Chain of Thought (CoT) command to the prompt is equally ineffective.
CoT prompts not only do not improve reasoning models but also increase time and computational costs.
For some cutting-edge non-reasoning models, CoT prompts can bring performance improvements, but the instability of answers also increases accordingly.
The research team used the GPQA Diamond dataset to test mainstream reasoning and non-reasoning models with and without CoT enabled.
The result is that for reasoning models, the effect of CoT is very limited. For example, for o3-mini, the accuracy improvement brought by CoT is only 4.1%, but the time increased by 80%.
The results for non-reasoning models are more complex, but in any case, whether to use CoT requires careful weighing of benefits and investment.
So, should CoT be used or not?
Actually, this study focuses on CoT commands in user prompts and does not include system prompt settings, and is not about CoT itself.
CoT Prompts Have Limited Effect, Even Counterproductive
This study used the GPQA Diamond dataset as a benchmark test tool, which contains graduate-level expert reasoning problems.
During the experiment, the research team tested these models:
- Reasoning models: o4-mini, o3-mini, Gemini 2.5 Flash
- Non-reasoning models: Claude 3.5 Sonnet 3.5, Gemini 2.0 Flash, GPT-4o-mini, GPT-4o, Gemini Pro 1.5
For each model, the research team set up three experimental environments:
- Forced Reasoning: Instructing the model to think step by step before providing an answer;
- Direct Answer: Explicitly instructing the model not to provide any explanation or thinking, just provide the answer;
- Default: No specific suffix instructions, letting the model choose how to answer the question.
To ensure the reliability of the results, each question was tested 25 times under each condition, meaning each model made 75 responses to the same question.
For each experimental setting, the research team counted four indicators:
- 100% Correct Rate: Only counted as a "success" if all 25 trials of the same question are correct, with the number of "successes" divided by the number of questions;
- 90% Correct Rate: At least 23 correct answers out of 25 trials, close to an acceptable human error rate;
- 51% Correct Rate: Using a simple majority principle, success if at least 13 out of 25 trials are correct;
- Average Score: Directly counting correct answers and then dividing by total trial times, which is the overall correct rate.
As a result, for non-reasoning models, compared to direct answers, all models' average scores and "51% correct" indicators improved with CoT.
Gemini Flash 2.0 showed the most significant improvement, followed closely by Claude 3.5 Sonnet, while GPT-4o and 4o-mini showed less improvement.
However, in the 100% and 90% correct rate indicators, the Gemini family's two models and 4o-mini actually saw a decline after adding CoT prompts compared to non-reasoning.
This means that while CoT generally improved model accuracy, it also increased the instability of answers.
Comparing forced CoT and default mode shows that the effect of CoT is significantly weaker compared to direct answers, which might be due to some models already having built-in chain of thought.
For reasoning models, the effect of CoT prompts is even more limited—
For o3-mini and o4-mini, using CoT prompts improved very little compared to direct answers, and for Gemini 2.5 Flash, all indicators declined across the board.
For example, in average score, o3-mini improved by only 2.9 percentage points, and o4-mini by 3.1 percentage points.
In contrast, the time consumed increased significantly, with o4-mini rising by about 20% and o3-mini increasing by over 80%.
For better-performing non-reasoning models, the time increase was even more noticeable.
Combining the author's tweet at the beginning, it can be seen that models still perform best when "thinking", but in the most cutting-edge models, reasoning models already have built-in reasoning processes, and some non-reasoning models' built-in prompts also include CoT-related content, so this "thinking" no longer needs to be achieved by adding extra prompts.
So, for users directly using model applications, the default setting is already a good way to use it.
Report Address:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5285532
This article is from the WeChat public account "Quantum Bit", author: Krecy, published by 36kr with authorization.