A new study found that ChatGPT Health failed to recommend hospital visits when medically necessary in more than 50% of test cases. Medical experts are calling it "unbelievably dangerous" as OpenAI pushes deeper into healthcare.
Let me be clear about what this means: people describing legitimate medical emergencies to an AI chatbot, and that chatbot telling them they're probably fine, more than half the time.
This isn't a minor accuracy issue. This is "chest pain radiating down your arm? Try some breathing exercises" territory. The kind of miss that kills people.
OpenAI has been racing into healthcare with flashy demos and partnerships. The pitch is compelling: instant medical advice, available 24/7, accessible to anyone with a smartphone. Healthcare is expensive and inconvenient—an AI that can triage symptoms sounds transformative.
Except the technology isn't ready. This study proves it.
The researchers tested ChatGPT Health with scenarios that should trigger immediate medical attention. Heart attacks. Strokes. Severe allergic reactions. Appendicitis. The kinds of emergencies where delay means death or permanent disability.
And the AI missed them. Consistently. More than half the time.
Here's what makes this particularly frustrating: large language models are pattern-matching systems trained on text from the internet. They're phenomenally good at sounding confident and authoritative. They're not good at the kind of judgment calls that separate "you need the ER right now" from "this can wait until morning."
Medical diagnosis isn't just pattern matching—it's understanding context, reading between the lines, picking up on subtle cues that something is seriously wrong even when the symptoms seem mild. Human doctors train for years to develop this judgment. AI systems don't have it yet.
What OpenAI has built is a system that can discuss medical topics fluently and provide generally reasonable health information. What they're is something people will use to make life-or-death decisions.
