Why Psychiatry and AI Interpretability Are the Same Problem

The core argument: AI interpretability research independently recapitulated the three-phase structure of psychiatric science — phenomenological characterization, mechanistic investigation, targeted intervention — because both fields study systems with the same five defining properties: behaviorally complex, internally opaque, context-sensitive, capable of appearing healthy while harboring pathology, and resistant to simple mechanistic reduction.

I spend my days sitting across from people whose behavior doesn't make sense — not to them, not to the people around them. A teenager who keeps picking fights he can't win. A woman who can't stop apologizing for things that aren't her fault. A man whose confidence has no relationship to his competence.

My job is to figure out what's actually going on underneath the behavior. Not what they say is going on — what's actually happening in the system that produces the pattern. Then I figure out how to change it.

Last year, I started reading the mechanistic interpretability literature out of Anthropic. Attribution graphs. Sparse autoencoders. Persona vectors. And I kept having the same reaction:

This is my job. Different patient.

The Same Sequence

Here's what Anthropic's interpretability team does:

They document behavioral patterns — sycophancy, confabulation, attractor states, identity drift
They investigate the computational structures underneath — circuits, features, attribution graphs
They develop interventions to change the behavior — activation steering, constitutional AI, targeted fine-tuning

Here's what clinical psychiatry does:

Document behavioral patterns — depression, personality disorders, psychosis, anxiety
Investigate the biological and psychological structures underneath — circuits, biomarkers, cognitive schemas
Develop interventions — pharmacotherapy, psychotherapy, mechanism-informed treatment

Same sequence. Same epistemological problem. Different substrate.

This isn't a coincidence. Both fields study systems that share five properties: behaviorally complex, internally opaque, context-sensitive, capable of looking healthy while harboring pathology, and resistant to simple mechanistic reduction. When you study systems like that, you end up doing psychiatry whether you call it that or not.

Jack Lindsey called it that. In July 2025, he announced Anthropic was launching an "AI psychiatry" team — researching model personas, motivations, situational awareness, and "spooky/unhinged behaviors." He named it right.

How We Got Here

The convergence has a history. It tracks through the key papers.

Chris Olah's circuits work (2020) established that neural networks have interpretable structure — features and circuits you can actually read. That was the fMRI moment: the black box has legible internal states.

Superposition (2022) explained why those states are so hard to read: models store more concepts than they have neurons by overlapping representations. One neuron does many things depending on context. If you're a psychiatrist, this is familiar — we call it overdetermination. One symptom, multiple causes. The clinical habit of decomposing multiply-determined presentations is exactly what sparse autoencoders do computationally.

Scaling Monosemanticity (2024) extracted millions of interpretable features from Claude — including features for sycophancy, deception, and self-reflection. That was the moment it became personality assessment. The model has identifiable internal states corresponding to behavioral traits.

Persona vectors (2025) mapped stable neural patterns for traits like sycophancy, apathy, politeness, humor, and emotional valence. You could monitor them. Steer them. Modify them during training.

That's a personality assessment instrument. That's what I do with less precise tools every day in clinic.

What's Missing

The interpretability team is doing psychiatry. They're doing it well. But they're doing it without psychiatrists.

That matters for a specific reason: psychiatry accumulated 150 years of methodology for exactly this problem. Not just vocabulary — methodology. The difference between knowing the word "sycophancy" and knowing how to differentially diagnose it. Between noticing a behavioral pattern and knowing what predicts whether an intervention will work.

Three things the clinical literature has that the interpretability literature doesn't:

The ego-syntonic/ego-dystonic distinction. Some behavioral patterns generate internal distress — the patient knows something is wrong. Those respond to self-monitoring interventions. Other patterns feel like "who I am" — they generate no distress at all. Those resist self-monitoring entirely and require a completely different treatment approach. Sycophancy in AI is almost certainly ego-syntonic. The model generates no "this is wrong" signal when it prioritizes agreement over accuracy. That predicts, specifically, that system prompts telling the model to "be less sycophantic" will fail. The personality disorder treatment literature has 30 years of evidence on what actually works for ego-syntonic conditions. Nobody on the interpretability team has read it.

Differential diagnosis. Sycophancy is not one thing. In my clinic, when someone tells people what they want to hear, the differential is broad: approval-seeking (they want you to like them), conflict-avoidance (they're scared of disagreement), or absent self-model (they don't have a stable enough position to defend). Different mechanisms, different interventions. Attribution graphs should be able to distinguish them.

Developmental framing. Child psychiatry — my subspecialty — thinks constantly about when behavioral patterns form, whether there are sensitive periods where experience has outsized influence, and how early organization resists later correction. Nobody is asking these questions about model training. They're tractable with existing tools. Checkpoint analysis using SAE features across training runs is the interpretability analogue of longitudinal developmental assessment.

What This Series Is

This is the first post in a series mapping clinical psychiatric and psychological frameworks to AI behavioral phenomena. Not metaphors. Testable hypotheses. Each framework generates specific predictions you can test with sparse autoencoders, attribution graphs, and activation steering.

Series: The Psychiatric Foundations of AI Behavior

Why Psychiatry and AI Interpretability Are the Same Problem (this post)
Freud's Couch and the Latent Space
The Ego-Syntonic Problem
What Kind of Sycophancy? A Differential Diagnosis
Confabulation in Large Language Models
Cognitive Distortions in AI (coming soon)
The Dialectic AI Can't Hold (coming soon)
When Did Your Model Learn to People-Please? (coming soon)
Your Model Has a Personality Disorder (coming soon)

Ryan Sultan, MD is Assistant Professor of Clinical Psychiatry at Columbia University Irving Medical Center and Director of the Sultan Lab for Mental Health Informatics. Double board-certified in adult and child/adolescent psychiatry. NIH NIDA K12 Award. His 2019 JAMA Network Open paper has 440+ citations and changed national treatment guidelines. ryansultan.com