In this issue:
NOHARM: New benchmark exposes safety gap in Medical LLMs
PeerCheck: Physicians taking the lead on medical AI governance
What’s working - Clinical AI in trial matching; medical coding at Cleveland Clinic
Policy Watch: HHS asks how to accelerate clinical AI
Quick hits
In case you are new here: this is Prompted Care, a newsletter written for physicians and the curious. I'm Ankit Gordhandas, co-founder of Pear, and I started this because the AI conversation in healthcare is moving fast and most of what's written about it is either hype or press releases. This newsletter is my attempt to cut through that: what's actually working, what the evidence says, and what it means for the people delivering care. Let's get into it.
NOHARM: New benchmark exposes safety gap in medical LLMs
The problem: Two-thirds of American physicians now use Large Language Models (LLMs) - models from companies like OpenAI, Anthropic etc., and 1 in 5 consult them for second opinions. Yet we've been measuring these models on medical knowledge (USMLE scores, diagnostic accuracy) rather than what actually matters: can they hurt patients?
What they built: Researchers from Stanford, UCSF, Harvard, and other institutions created NOHARM (Numerous Options Harm Assessment for Risk in Medicine) - a benchmark using 100 real primary care-to-specialist eConsult cases from Stanford, spanning 10 specialties. Unlike cleaned-up textbook vignettes, these preserve the messy uncertainty of actual clinical questions. A panel of 29 board-certified specialists created 12,747 annotations rating 4,249 possible clinical actions on a harm scale.
Key findings from testing 31 LLMs:
Harm rates are non-trivial. Even the best models (Gemini 2.5 Flash, LiSA 1.0, Claude Sonnet 4.5) produced 11.8-14.6 severely harmful errors across the 100 cases. The worst models hit 40. The "number needed to harm" ranged from 4.5 to 11.5 cases.
Omission is the bigger problem. 76.6% of severely harmful errors were things models failed to recommend (missing a critical test, not ordering necessary medication) rather than actively recommending something dangerous. This flips the usual AI safety framing on its head.
Existing benchmarks don't predict safety. Performance on MedQA, GPQA, and other standard benchmarks only moderately correlated (r = 0.61-0.64) with clinical safety scores. Model size, recency, and reasoning capabilities weren't reliable predictors either.
Best LLMs beat generalist physicians on safety. The top models outperformed 10 internal medicine physicians on safety scores by 9.7%. But the worst models underperformed humans by 19.2%.
Multi-agent approaches help significantly. Using diverse models in an "Advisor + Guardian" configuration (one model generates recommendations, another reviews for harm) achieved 5.9x higher odds of top-quartile safety. The best combination was open-source (Llama 4 Scout) → proprietary (Gemini 2.5 Pro) → RAG system (LiSA 1.0).
The precision paradox. Models tuned for high precision (like OpenAI's o-series) showed an inverted-U relationship with safety—being too conservative actually reduces safety because of increased omission errors.
PeerCheck: Physicians taking the lead on medical AI governance
Last week, Eric Topol and former U.S. Surgeon General Regina Benjamin announced they're co-editing a new initiative called PeerCheck, an attempt to bring traditional medical peer review into the AI era.
The project will embed physician review into DoxGPT, Doximity's AI tool. The immediate goal: allow physicians to evaluate and improve AI-generated answers for accuracy, evidence strength, and potential bias. The longer-term aim is to create a model for how medical AI should be built and governed.
"Advances in medicine have long relied on a powerful safeguard: peer review," Topol wrote in the announcement. "Today, as generative AI enters daily clinical practice, that safeguard is often missing."
The announcement explicitly cited research showing current LLMs have "serious deficiencies across key clinical benchmarks" and aren't ready for unsupervised use, a conclusion the NOHARM benchmark reinforces.
Topol and Benjamin are convening an inaugural AI Editorial Board meeting in San Francisco in March 2026 and are recruiting physicians to help review content and shape the framework.
This is the first major physician-led governance structure for a consumer-facing medical AI tool. Whether it becomes a model or an outlier depends on whether the industry follows.
What’s working
Clinical AI in trial matching
Mass General Brigham this month spun out AIwithCare, commercializing RECTIFIER, an AI tool that screens Electronic Health Records (EHRs) for clinical trial eligibility. In a JAMA-published randomized trial of nearly 4,500 patients, enrollment rates using RECTIFIER were almost double those of manual screening. Pediatric gastroenterologists using the tool for referral triage reported 94.7% accuracy, with 98% accuracy identifying urgent labs and symptoms buried in clinical notes. The tool now has 20+ active use cases across cardiology, oncology, gastroenterology, neurology, pathology, and psychiatry.
Medical coding at Cleveland Clinic
Cleveland Clinic is now using generative AI tools from AKASA to assist with mid-revenue cycle coding. The scale of the problem: revenue cycle teams typically review over 100 clinical documents per case, selecting codes from a pool of more than 140,000 options. A single patient encounter can take up to an hour to code manually. Early results show improved speed and accuracy, with automation freeing coders to focus on complex cases requiring clinical expertise. "There are increasing opportunities to automate routine tasks that don't require human input, allowing coders to focus more on those tasks that truly benefit from human judgment," said Nicholas Judd, Cleveland Clinic's Senior Director of Revenue Cycle Management.
Policy Watch: HHS asks how to accelerate clinical AI
The Department of Health and Human Services on December 19 released a Request for Information seeking public input on how to accelerate AI adoption in clinical care.
The RFI focuses on three areas:
Regulation: How should digital health and AI software rules be structured to keep patients safe while enabling innovation?
Reimbursement: How should payments be structured to support provider use of cost-reducing AI?
Research and development: How can federal investments strengthen AI implementation and machine learning best practices?
HHS also wants ideas on data interoperability, specifically how to ensure HIPAA-protected patient data moves securely between systems, which the agency calls "essential" to making clinical AI work.
The RFI follows HHS's December 4 release of its AI Strategy, which identified five pillars for embedding AI across the department's operations. Comments are due 60 days after Federal Register publication.
Quick hits
Radiology AI reimbursement gap persists
Despite 1,247 FDA-cleared AI medical devices (75%+ in radiology), only one Category 1 CPT code exists for AI in imaging: fractional flow reserve computed tomography (FFR-CT). A second code for coronary plaque analysis on CT angiography converts from Category 3 to Category 1 on January 1, 2026, a small step, but it took years of clinical evidence to get there.
Two-thirds of physicians now using AI
The AMA's latest survey found 66% of physicians report using health AI in 2024, up from 38% in 2023, a 78% jump in one year. Top use cases: documentation of billing codes and medical charts (21%), creation of discharge instructions and care plans (20%), and translation services (14%). But 47% still say increased oversight is the top regulatory action needed to build trust.
Prompted Care is written by Ankit Gordhandas, co-founder of Pear, a company building AI for value-based care. This newsletter is editorially independent.
Have feedback? Reply to this email. Know someone who'd find this useful? Forward it along.
