Signals: Was the Diagnosis Already in the Chart?
One data point: A Stanford team including eminent AI researcher Fei-Fei Li published a paper that ran a simple experiment: they gave leading AI models radiology test questions — but withheld the actual images. Instead of a chest X-ray, the model only got the clinical question: "62-year-old male, smoker, persistent cough. What does the image show?"
Fei-Fei Li
In a classic AI-is-weird way, the models answered anyway — confidently, in detail, with elaborate reasoning about images they never saw. The kicker? On some radiology tests, the models scored up to 99% as well without the image as with it. A small text-only model, trained on radiology Q&A with images stripped out, outperformed radiologists by more than 10% on a chest X-ray test. Without looking at any images.
One implication: Most of the commentary has framed this as an argument against deploying radiology AI. The paper's own conclusion is narrower: the standardized tests are broken, and we need better ways to measure whether AI is actually using visual information. Makes sense.
But I think the data is also asking a more provocative question — one the authors don't raise directly. How often is the diagnosis already in the EHR before anyone ever looks at the image — making the imaging unnecessary for diagnostic purposes?
Experienced clinicians will tell you they frequently know the probable finding before the film comes back. 62-year-old smoker with chronic cough and weight loss? You have a strong prior. The imaging confirms it. What the Stanford team demonstrated — accidentally — is that an AI can formalize those priors so reliably it tops a radiology test blind.
This is very speculative on my part: the paper doesn't claim imaging is unnecessary, and neither do I. But if a text-only model can outperform radiologists on a visual test using nothing but clinical context, it's worth asking: what portion of imaging volume is confirmatory rather than diagnostic — a clinical ritual more than a clinical necessity? And confirmatory rituals are exactly the kind of that eventually just disappears.
The tests do need fixing — the scorecards should distinguish between visual grounding and text inference. (As I wrote in "We Want to Grade the AI. Did We Grade the Doctor?" — we have a long history of not testing rigorously enough.) But the deeper story may not be that the AI cheated. It may be that the clinical information was sufficient all along, and the imaging was unnecessary.