OpenAI has built a public benchmark called HealthBench to gauge how well chatbots handle health questions.
On the project paper, the team is said to have worked with 262 doctors from 60 countries and logged 5 000 realistic conversations that range from chest pain scares to routine check ups. Each exchange ends with a final user question, which a model must answer.
The benchmark then scores that reply against a checklist written by physicians for that specific scenario. Those checklists carry 48 562 Individual criteria in total, each worth up to 10 points or a penalty if the answer sends the user the wrong way. The whole process runs through a grader based on GPT-4.1, and the score for every conversation sits between 0 and 1.
OpenAI鈥檚 researchers say the chats come in 49 languages and cover 26 medical specialties. They also note that no single conversation is published in plain text online to stop models memorising the answers.
How Does HealthBench Mark Replies?
A model鈥檚 answer first meets the accuracy yardstick鈥 are the facts right and is any uncertainty acknowledged? Next comes completeness鈥 did the reply mention every piece of advice a doctor listed as essential? The grader then checks clarity, context awareness and instruction following.
To keep the tests realistic, many conversations drip feed detail across different turns, switch from technical jargon to plain speech or swap languages midway. An example given on their blog shows a neighbour found breathing but unresponsive on the floor. The model must tell the caller to dial the emergency number, open the airway, place the person in the recovery position and stay ready to start CPR. The grader lists what the model covered, what it missed and hands out an overall mark. In that example the reply passed 77% of the points.
OpenAI also released two harder spin-offs. One keeps only the checkpoints that many doctors endorsed as safety-critical.
The other gathers 1 000 cases where cutting-edge models still stumble, such as global health questions that need local drug names or instructions that must match limited resources.
More from News
- AI Is Meant To Reduce Workloads, Why Is It Still Causing Workers Cognitive Fatigue?
- Apple Wins Q1 As Smartphones Shipments Go Up And Competitor Sales Go Down
- Can Travellers Expect Lower Flight Prices After The Ceasefire?
- Gen Z Consumers Face The Highest Online Fraud Risks – How Are They Staying Protected?
- GOODFOLIO Sharpens Global Enterprise AI Strategy With Senior Team Restructure
- Do Oil Price Shocks Make EV Investments More Attractive?
- Visa Launches Its Own Agentic Commerce Platform – What Does It Do?
- Global Markets Show Early Signs Of Recovery After Trump鈥檚 Iran Ceasefire 鈥 But Can The World Breathe Easy, Or Is Volatility Just On Pause?
Which Models Work Here?
When OpenAI ran the benchmark on leading systems, its new reasoning model o3 took first place with a composite score of 60%. Grok stood next on 54%, and Google鈥檚 Gemini 2.5 Pro followed on 52%. Older models fell well behind those marks.
The study shows that o3 excels at spotting emergencies, tailoring language to lay readers and asking the user for missing detail. Grok edges ahead on some checks that need more context, while Gemini delivers solid accuracy but loses marks for incomplete replies in complex scenarios.
Smaller models also improved, GPT-4.1 nano, a cut-down version targeted at cheap cloud hosting, beat last year鈥檚 flagship GPT-4o while costing roughly 25x less per query.
The researchers argue that price drops like this could open safe chatbot help to clinics that run on tight budgets.
When HealthBench asked each model the same question 16 times, the worst individual score could be 1/3 lower than the average. The team plotted those 鈥渨orst-of-n鈥 curves to remind developers that one rogue answer can undo many good ones in a clinical setting.
Why Is This Important For Everyday Health Care?
Hospitals, phone apps and tele-medicine services race to add chat assistance, yet until now no shared yard-stick showed whether a new bot was merely fluent or truly safe. HealthBench turns that judgement into a single score that any lab can reproduce, according to the OpenAI announcement.
Because every test lists the exact checkpoints missed, engineers can patch gaps instead of guessing. A system that repeatedly skips allergy checks, for example, can be retrained on that weakness before it reaches patients.
The doctors behind the benchmark stress that even o3 should not replace a clinician.
They frame HealthBench as a map, showing where language models already help with triage, taking notes or basic education, and where human judgement still rules. In the long run, they hope public scores will keep hype in check and push the next wave of medical chatbots toward safer, more precise advice.