ChatGPT Accuracy Rate Statistics 2025

From multiple usages of Large Language Models, all of us have seen the disclaimers that say they can make mistakes.

Accuracy in models like ChatGPT isn’t a fixed value either, it is known to change with user interactions and model updates.

Let’s dig further to determine how accurate the various GPT models can be by looking at multiple studies done on this topic.

Key ChatGPT Accuracy Statistics
OpenAI-o1-Mini Has Second Lowest Hallucination Rate in Top 25 LLMs
ChatGPT Versus Other Models
ChatGPT’s Accuracy in Medicine
OpenAI o1 Solved 83% of IMO Problems, Compared to 13% by the Previous GPT-4o Model
ChatGPT's Accuracy Can Fall Over Time
ChatGPT's Accuracy Can Also Improve Over Time in Some Cases
Statistics on ChatGPT’s Performance in Scholastic Exams
Open AI Shut Down Its AI Classifier Within Six Months of Launch Due to Low Accuracy Rate
Final Thoughts on The Future of LLMs

Key ChatGPT Accuracy Statistics

1. OpenAI o1 solved 83% of IMO problems.

2. In creating images for medical research papers, ChatGPT-4 performed poorly.

3. O1 performs at the same level as PhD students on various physics, chemistry, and biology benchmarks.

4. OpenAI-o1-Mini has one of the lower hallucination rates at 1.4%.

5. ChatGPT was 77% accurate in making the final diagnosis for patients.

OpenAI-o1-Mini Has Second Lowest Hallucination Rate in Top 25 LLMs

According to a GitHub post, where Vectera evaluated many large language models concerning their hallucination rate, OpenAI-o1-Mini came second.

It was found to have a hallucination rate of 1.4%.

GLM-4-98-Chat had the lowest hallucination rate at 1.3%, and is likely to give less inaccurate answers.

ChatGPT’s “False Citation Rates” Across Various Psychology Subfields Range From 6%-60%

The fake citations used legitimate researchers’ names and credentials, giving a false sense of legitimacy and confusing students/researchers. (Canadian Psychological Association, 2023)

Aritificial intelligence hence can’t be trusted, and it is better to rely on expert human feedback when it comes to serious matters.

ChatGPT Versus Other Models

How does ChatGPT, or the latest model GPT-4o, compare with other models like Anthropic’s Claude or Google’s Gemini?

It turns out that GPT-4o stands out in a league of its own when compared with its previous iterations of ai models.

GPT-4o Stands Out for Its Remarkable Performance Across a Broad Range of Languages

The languages where the LLMs- GPT-4o, Gemini 1.5, and Claude Opus were tested were Spanish, German, French, Portuguese, and Russian, as well as more niche languages.

Here is the list of languages, with their abbreviations:

Here’s the list extracted from the image:

Language Code	Language
de	German
en	English
es	Spanish
fi	Finnish
fr	French
no	Norwegian
pt	Portuguese
ru	Russian
sv	Swedish
nl	Dutch

GPT-4o outshines its previous models- GPT-4 and GPT-4 turbo. (Archive, 2024)

Claude 3.5 Sonnet (72%) Outperforms GPT-4o (65%) in Mean Accuracy

These models were compared to each other on tasks like data extraction, ticket classification, and verbal reasoning. (Vellum, 2024)

GPT-4o Has a Small Lead in Precision (86.21%), Compared to Claude 3.5 Sonnet (85%)

In model comparison tests done by Vellum, GPT-4o had a small lead in precision at 86.21%, compared to Claude 3.5 Sonnet (85%). (Vellum, 2024)

Here’s a table comparing the performance of Claude 3.5 Sonnet, Claude 3 Opus, and GPT-4o across the listed categories:

Category	Claude 3.5 Sonnet	Claude 3 Opus	GPT-4o
Graduate level reasoning (GPAQ, Diamond)	59.4% (0-shot CoT)	50.4% (0-shot CoT)	53.6% (0-shot CoT)
Undergraduate level knowledge (MMLU)	88.7% (5-shot), 88.3% (0-shot CoT)	86.8% (5-shot), 85.7% (0-shot CoT)	88.7% (0-shot CoT)
Code (HumanEval)	92.0% (0-shot)	84.9% (0-shot)	90.2% (0-shot)
Multilingual math (MGSM)	91.6% (0-shot CoT)	90.7% (0-shot CoT)	90.5% (0-shot CoT)
Reasoning over text (DROP, F1 score)	87.1% (3-shot)	83.1% (3-shot)	83.4% (3-shot)
Mixed evaluations (BIG-Bench-Hard)	93.1% (3-shot CoT)	86.8% (3-shot CoT)	—
Math problem-solving (MATH)	71.1% (0-shot CoT)	60.1% (0-shot CoT)	76.6% (0-shot CoT)
Grade school math (GSM8K)	96.4% (0-shot CoT)	95.0% (0-shot CoT)	—

Let me know if you need further modifications!

ChatGPT’s Accuracy in Medicine

There has been a lot of buzz about AI being used in healthcare, especially since there is a healthcare crisis where many regions face staffing shortages.

Although ChatGPT can’t replace actual doctors, many people have been consulting it for clinical situations.

Let’s look at some of the stats on this.

The Median Accuracy Score of ChatGPT in Medical Tests Was 5.5

This translates to almost or completely correct in terms of accuracy. (NCBI, 2023)

ChatGPT Was 77% Accurate in Making the Final Diagnosis

The author of the paper said that ChatGPT was comparable to the level of medical students just graduating from medical school.

It performed more poorly in making differential diagnosis, with an accuracy of only 60 percent. (JMIR, 2023)

GPT-4 Solves All Common Cases Within Two Suggested Diagnoses

Despite the good diagnostic accuracy, the authors of the paper claim that it can be used as a tool for doctors, but non-doctors should use it carefully. (medRxiv, 2023)

For Rare Disease Conditions, ChatGPT-4 Needed Eight or More Suggestions

It was only after 8 or more suggestions that it got an accuracy of 90%, for rare diseases.

This shows that suggestions (more data) can help ChatGPT give a more accurate diagnosis.

Yet the number of suggestions is still on the larger side, with 8 or more. (Research Gate, 2023)

Yet, ChatGPT can’t be blindly used in clinical decision making yet, as in another study ChatGPT had an 83% error rate.

ChatGPT Version 4 Rendered a Correct Final Diagnosis in 39% of (NEJM) Case Challenges

NEJM stands for New England Journal of Medicine.

Apart from answering medical questions, ChatGPT could also make admin work more efficient.

ChatGPT responses could even help in writing research articles or coming up with instruction sheets for patient aftercare. (JAMA Network, 2023)

ChatGPT-4 Demonstrated a 76.4% Accuracy Rate on the Korean General Surgery Board Exam

This was significantly higher than GPT-3.5’s 46.8%.

The testdataset contained 280 questions taken from the Korean General Surgery Board exams, which were administered during 2020-2022.

Across subspecialties, GPT-4 showed accuracy rates ranging from 63%-83%. (ASTR, 2023)

OpenAI o1 Solved 83% of IMO Problems, Compared to 13% by the Previous GPT-4o Model

This corresponds to the top 500 students in the US for math olympiads.

OpenAI’s o1 has recently been released and is already breaking records in quite a few LLM evals.

Let’s look at some of the main ones.

It Performs at the Same Level as PhD Students on Various Benchmarks in Physics, Chemistry, and Biology

According to OpenAI, this represents a statistically significant difference in improvement in AI capabilities, which is why they’ve restarted the level at 1 in their model name-OpenAI o1.

OpenAI’s o1 Reached the 89th Percentile in Codeforces Competitions

Codeforces is a platform that is used to practice for competitive coding tests, it also hosts its own contests and has 2+ million users.

The New Model Also Scored 84/100 in Important Jailbreaking Tests

Jailbreaking is basically a way to bypass the limitations that are set on LLMS, which can render them prone to manipulation.

On OpenAI’s hardest jailbreaking test, the previous model GPT-4o scored 22, while o1 scored 84.

Open AI says it will make its AI model o1-mini accessible to all ChatGPT Free users in the future.

The model’s history hasn’t been very rosy though, with some of their previous models losing their accuracy over time.

Here is the test-time accuracy graph for o1, note that the time-axis is a log scale.

ChatGPT’s Accuracy Can Fall Over Time

According to an Arxiv paper, the LLm drift is due to changes in ChatGPT’s

ability to follow user instructions.

GPT-4, in March 2023, was very good at following user instructions, but in June, it failed to follow even simple instructions.

Hence even after a model is released, it is important to monitor its usability over time, and until now this issue hasn’t been properly solved.

GPT-4’s Accuracy Dropped from 84.0% in March to 51.1% in June 2023

GPT-4 couldn’t follow Chain-of-Thought prompting anymore, which partially explains the accuracy drop.

In Creating Images for Medical Research Papers, ChatGPT-4 Performed Poorly

All the images were rated below average and couldn’t be used in the papers yet. (PubMed, 2024)

A user on the OpenAI Forum reported that GPT-4’s accuracy rate is much lower with image queries, at around 50%.

ChatGPT’s Accuracy Can Also Improve Over Time in Some Cases

LLM drift isn’t always negative, there can be positive changes in accuracy too.

ChatGPT-4o Is 99% Accurate in English

Claude-3 Opus outperforms it though, and scores an almost 100% accuracy in Swedish. (Archive, 2024)

Statistics on ChatGPT’s Performance in Scholastic Exams

Let’s see how ChatGPT performs on famous Scholastic exams such as the SAT and the USMLE.

ChatGPT’s performance on the USMLE Exams Showed accuracy rates ranging from 42% to 64.4%

USMLE stands for US Medical Licensing Examination.

The accuracy had a notable decrease in performance as question difficulty increased. (JMIR, 2023)

How does the GPT-4o model compare?

Let’s look at an Arxiv paper that explores GPT-4o’s accuracy on various scholastic metrics.

The Model Scores 90% Accuracy on USMLE

The USMLE is a three-step test series that medical graduates must pass to practice medicine in the U.S. It evaluates medical knowledge, principles, and patient-centered skills.

Here, GPT-4 dominates with an accuracy of 90.00% on USMLE, with GPT-4o at 83.05% follows.

GPT-4o Scores Around the 90th Percentile on the SAT

The SAT is the most important entrance exam used by most colleges and universities to make admissions decisions in the US.

GPT-4o is the best-performing model on the SAT with an accuracy 90.1% in Reading and Writing, while GPT-4 is the best in the Math Section with 87.50%.

GPT-4o Is Around 85% Accurate for the CFA Level 1 Exam

The CFA Program is a three-part exam that tests the fundamentals of investment tools, valuing assets, portfolio management, and wealth planning.

The CFA Program is typically completed by those with backgrounds in finance, accounting, economics, or business.

GPT-4o has the highest accuracy of 85.39% on the CFA level 1 Exam, followed by GPT-4 at 73.2% and GPT 3.5 at a much lower 58.8%.

Open AI Shut Down Its AI Classifier Within Six Months of Launch Due to Low Accuracy Rate

The classifier had a true positive rate of only 26%, where the reliability of the output to be correctly identified increased with input length. (OpenAI, 2023)

This shows that even AI detectors made from frontier companies like OpenAI, which themselves create AI generators, find it difficult to distinguish human versus AI written content.

OpenAI Claimed to Have an Internal Tool That Can Detect AI-generated Images With 99% Accuracy

Mira Murati, CTO of OpenAI, mentioned that OpenAI’s tool is 99% reliable for detecting AI images. (Business Standard, 2023)

Final Thoughts on The Future of LLMs

Here are some quick tips from Aidbase AI on how you can improve get more accurate responses from your large language model:

Be specific with prompts and guidelines, giving a few examples to learn from is also great.
Add context and background information, which would help set the perspective required for the scenario.
You can use chain-of-thought prompting, which goes through the reasoning step by step for more complex questions.

ChatGPT Accuracy Rate Statistics 2025

On this page

Key ChatGPT Accuracy Statistics

OpenAI-o1-Mini Has Second Lowest Hallucination Rate in Top 25 LLMs

ChatGPT’s “False Citation Rates” Across Various Psychology Subfields Range From 6%-60%

ChatGPT Versus Other Models

GPT-4o Stands Out for Its Remarkable Performance Across a Broad Range of Languages

Claude 3.5 Sonnet (72%) Outperforms GPT-4o (65%) in Mean Accuracy

GPT-4o Has a Small Lead in Precision (86.21%), Compared to Claude 3.5 Sonnet (85%)

ChatGPT’s Accuracy in Medicine

The Median Accuracy Score of ChatGPT in Medical Tests Was 5.5

ChatGPT Was 77% Accurate in Making the Final Diagnosis

GPT-4 Solves All Common Cases Within Two Suggested Diagnoses

For Rare Disease Conditions, ChatGPT-4 Needed Eight or More Suggestions

ChatGPT Version 4 Rendered a Correct Final Diagnosis in 39% of (NEJM) Case Challenges

ChatGPT-4 Demonstrated a 76.4% Accuracy Rate on the Korean General Surgery Board Exam

OpenAI o1 Solved 83% of IMO Problems, Compared to 13% by the Previous GPT-4o Model

It Performs at the Same Level as PhD Students on Various Benchmarks in Physics, Chemistry, and Biology

OpenAI’s o1 Reached the 89th Percentile in Codeforces Competitions

The New Model Also Scored 84/100 in Important Jailbreaking Tests

ChatGPT’s Accuracy Can Fall Over Time

GPT-4’s Accuracy Dropped from 84.0% in March to 51.1% in June 2023

In Creating Images for Medical Research Papers, ChatGPT-4 Performed Poorly

ChatGPT’s Accuracy Can Also Improve Over Time in Some Cases

ChatGPT-4o Is 99% Accurate in English

Statistics on ChatGPT’s Performance in Scholastic Exams

ChatGPT’s performance on the USMLE Exams Showed accuracy rates ranging from 42% to 64.4%

The Model Scores 90% Accuracy on USMLE

GPT-4o Scores Around the 90th Percentile on the SAT

GPT-4o Is Around 85% Accurate for the CFA Level 1 Exam

Open AI Shut Down Its AI Classifier Within Six Months of Launch Due to Low Accuracy Rate

OpenAI Claimed to Have an Internal Tool That Can Detect AI-generated Images With 99% Accuracy

Final Thoughts on The Future of LLMs

Keep Reading

AI Startup Funding Q1 2026: $290B Across 1,677 Deals

AI Funding Report March 2026: $174.2B Across 581 Deals

ChatGPT Accuracy Rate Statistics 2025

On this page

Key ChatGPT Accuracy Statistics

OpenAI-o1-Mini Has Second Lowest Hallucination Rate in Top 25 LLMs

ChatGPT’s “False Citation Rates” Across Various Psychology Subfields Range From 6%-60%

ChatGPT Versus Other Models

GPT-4o Stands Out for Its Remarkable Performance Across a Broad Range of Languages

Claude 3.5 Sonnet (72%) Outperforms GPT-4o (65%) in Mean Accuracy

GPT-4o Has a Small Lead in Precision (86.21%), Compared to Claude 3.5 Sonnet (85%)

ChatGPT’s Accuracy in Medicine

The Median Accuracy Score of ChatGPT in Medical Tests Was 5.5

ChatGPT Was 77% Accurate in Making the Final Diagnosis

GPT-4 Solves All Common Cases Within Two Suggested Diagnoses

For Rare Disease Conditions, ChatGPT-4 Needed Eight or More Suggestions

ChatGPT Version 4 Rendered a Correct Final Diagnosis in 39% of (NEJM) Case Challenges

ChatGPT-4 Demonstrated a 76.4% Accuracy Rate on the Korean General Surgery Board Exam

OpenAI o1 Solved 83% of IMO Problems, Compared to 13% by the Previous GPT-4o Model

It Performs at the Same Level as PhD Students on Various Benchmarks in Physics, Chemistry, and Biology

OpenAI’s o1 Reached the 89th Percentile in Codeforces Competitions

The New Model Also Scored 84/100 in Important Jailbreaking Tests

ChatGPT’s Accuracy Can Fall Over Time

GPT-4’s Accuracy Dropped from 84.0% in March to 51.1% in June 2023

In Creating Images for Medical Research Papers, ChatGPT-4 Performed Poorly

ChatGPT’s Accuracy Can Also Improve Over Time in Some Cases

ChatGPT-4o Is 99% Accurate in English

Statistics on ChatGPT’s Performance in Scholastic Exams

ChatGPT’s performance on the USMLE Exams Showed accuracy rates ranging from 42% to 64.4%

The Model Scores 90% Accuracy on USMLE

GPT-4o Scores Around the 90th Percentile on the SAT

GPT-4o Is Around 85% Accurate for the CFA Level 1 Exam

Open AI Shut Down Its AI Classifier Within Six Months of Launch Due to Low Accuracy Rate

OpenAI Claimed to Have an Internal Tool That Can Detect AI-generated Images With 99% Accuracy

Final Thoughts on The Future of LLMs

Keep Reading

AI Startup Funding Q1 2026: $290B Across 1,677 Deals

AI Funding Report March 2026: $174.2B Across 581 Deals

Subscribe to Monday AI Signal