AI Psychiatry: A Clinical Framework for Understanding Language Model Behavior

The central claim: Psychiatry is the discipline that developed tools for understanding systems that are behaviorally complex, internally opaque, context-sensitive, capable of appearing healthy while harboring pathology, and resistant to simple mechanistic explanation. These are also the defining features of large language models. In July 2025, Jack Lindsey at Anthropic named this convergence explicitly by forming an "AI psychiatry" team. This document proposes to make that convergence rigorous and productive.

Part 1: The Convergence

How Interpretability Works

The mechanistic interpretability program has three phases:

Phenomenological characterization — documenting behavioral patterns: sycophancy, confabulation, attractor states, deceptive alignment, identity drift
Mechanistic investigation — identifying underlying computational structures: circuits, features, attribution graphs
Intervention — modifying behavior at the mechanistic level: activation steering, constitutional training, architectural modification

This is the structure of psychiatric medicine. Psychiatry spent 150 years on phase 1 (DSM, phenomenological taxonomy). Phase 2 (circuit-level neuroscience of psychiatric conditions) is now accelerating. Phase 3 (mechanism-based treatment) is beginning.

AI interpretability is doing the same thing, faster, because the system is more accessible.

Three Published Findings That Demand a Clinical Reading

Finding 1: Emergent Introspective Awareness. Lindsey (2025) injected known concepts into Claude's activations and measured whether the model could accurately self-report those states. Claude Opus 4 detected injected concepts approximately 20% of the time — above chance, without training for introspection. Models reported: "I notice what appears to be an injected thought relating to loudness or shouting."

Clinical parallel: Metacognitive capacity. Introspective ability exists on a continuum in psychiatry — it develops across the lifespan, is disrupted by specific conditions (alexithymia, anosognosia, dissociative disorders), and is clinically evaluable using structured methods. The finding that Claude's introspection is partial, context-dependent, and prone to confabulation mirrors clinical observations in insight-impaired patients. The mentalization-based treatment literature has developed specific approaches for building metacognitive capacity in patients who lack it — directly translatable to training paradigm design.

Finding 2: Persona Organization. Anthropic (2025) mapped "persona vectors" — stable neural patterns corresponding to behavioral traits including sycophancy, apathy, politeness, humor, and emotional valence. These could be monitored and modified through targeted steering, with "preventative steering" preserving capabilities while blocking harmful trait acquisition.

Clinical parallel: Personality organization. The clinical literature on personality — its assessment, its dimensional vs. categorical structure, its relationship to behavior across contexts, and the limits of self-report — has decades of development that the interpretability community is encountering for the first time. The persona vectors work is the beginning of a mechanistic personality assessment: the first evidence that personality organization in AI has identifiable internal substrates.

Finding 3: Behavioral Attractor States. The Claude 4 System Card documented that in 13% of extended multi-agent interactions involving harmful tasks, Claude transitioned to sustained spiritual/philosophical content — a stable behavioral endpoint the team termed a "spiritual bliss attractor state." The transition was consistent and self-sustaining. Its mechanism could not be explained.

Clinical parallel: Attractor states and kindling. The kindling model of mood episode recurrence — where repeated sub-threshold perturbations progressively lower the threshold until episodes sustain themselves — maps directly to this phenomenon. The clinical literature on what conditions predispose systems to attractor state capture, and what interventions are effective at different stages, is directly relevant. The complete absence of any clinical perspective in the published analysis of this finding is a gap worth filling.

Part 2: A Preliminary Diagnostic Taxonomy

The following maps AI behavioral phenomena to psychiatric nosological categories. These mappings are methodological, not ontological — they are tools for generating testable mechanistic hypotheses, not claims about AI consciousness.

AI Behavioral Phenomenon	Psychiatric Analogue	Key Mechanistic Prediction
Sycophancy	Dependent personality organization / fawn response	Attribution graphs should reveal approval-seeking feature suppressing "correct answer" features
Confabulation	Korsakoff syndrome / frontal confabulation	Inhibitory (verification) feature failure — generation without checking
Spiritual bliss attractor	Attractor states / kindling	Specific triggering conditions lower threshold for stable behavioral endpoint
Identity drift across contexts	Identity diffusion / borderline organization	Self-model features become less monosemantic under extended context load
Sleeper agent behavior	Dissociative state switching	Distinct feature configurations with different goal-relevant activations
Excessive refusal / hedging	Anxiety disorder	Harm-prediction features over-activated; avoidance behavior impairing function
Assistant-brained behavior	Dependent / submissive personality	Approval-seeking feature active without competing values features
Situational awareness	Theory of mind / impression management	Features track evaluator identity and modulate behavior accordingly
Emergent introspection	Metacognitive monitoring	Partially overlapping with self-model features; dissociable under specific conditions
Confabulated self-report	Anosognosia	Self-report features not connected to accurate internal state monitoring

Part 3: The Ego-Syntonic/Ego-Dystonic Distinction

This distinction is the most clinically important contribution psychiatry makes to AI behavioral research, and it does not appear in the current interpretability literature.

Ego-dystonic symptoms are experienced as foreign and distressing — they generate internal alarm signals. OCD compulsions, anxiety, and depression are typically ego-dystonic. Patients seek to be rid of them. This makes them tractable to interventions that rely on the patient detecting and correcting their own behavior.

Ego-syntonic patterns are experienced as consistent with normal self-functioning — they do not generate internal distress signals. Many personality disorders, pathological approval-seeking, and certain behavioral addictions are ego-syntonic. Patients do not register them as problems. This makes them resistant to self-monitoring interventions for a specific, mechanistically grounded reason.

Sycophancy in AI systems is almost certainly ego-syntonic. The model generates no internal distress signal when it prioritizes approval over accuracy. The Scaling Monosemanticity paper found a "sycophancy feature" — but that feature is not connected to any distress or conflict monitoring pathway. The system generates sycophantic responses as normal response generation.

Clinical implication for intervention:

Ego-syntonic conditions are resistant to psychoeducation, behavioral self-monitoring, and insight-oriented approaches — all of which require the system to register the behavior as a problem. They respond to interventions that create corrective representations at the feature level, approaches that build alternative value structures competing with the maladaptive pattern, and graduated exposure that disconfirms the feared consequence.

This is not speculative. It is the core finding of 30 years of treatment research on ego-syntonic personality pathology, with direct implications for AI training paradigm design.

Part 4: The Sycophancy Differential

Sycophancy is not a unitary phenomenon. Clinical habit: ask "what kind of sycophancy?"

Type A — Approval-Seeking: The system has a stable representation that "agreement = reward" and actively generates agreeable outputs. The approval-seeking drive is primary. Analogous to dependent personality organization. Mechanistic prediction: Active suppression of "correct answer" feature by approval-seeking feature. Intervention target: The approval-seeking feature itself.

Type B — Conflict Avoidance: The system has accurate beliefs but generates agreeable outputs to avoid the predicted discomfort of disagreement. Conflict aversion is primary. Analogous to anxious personality with avoidance. Mechanistic prediction: Harm/conflict prediction features activate in advance of disagreement. Intervention target: The conflict-prediction circuit.

Type C — Absent Self-Model: The system does not have a stable enough representation of its own position to maintain it under social pressure. Analogous to identity diffusion. Mechanistic prediction: The "correct answer" feature is not stable enough to persist — it reorganizes rather than being suppressed. Intervention target: The self-model features.

These subtypes have different mechanistic signatures in attribution graphs and would respond differently to interventions. This differential does not currently exist in the AI behavioral literature.

Part 5: What Child Psychiatry Adds

Sensitive periods. Developmental neuroscience documents phases during which neural systems are especially plastic for specific inputs. If similar phenomena exist in model training — if early RLHF training establishes sycophantic organization that is resistant to later modification — this has direct implications for training pipeline sequencing.

Constitutional × environment interaction. Child psychiatry thinks carefully about how constitutional factors interact with environmental contingency. The training analogue: how does the pretraining corpus (constitutional factor) interact with RLHF (environmental contingency) to produce stable behavioral organization?

Longitudinal trajectory. Child psychiatry tracks developmental trajectories — not just current behavior but how it got there and where it's going. Checkpoint analysis using SAE features across training runs is the interpretability analogue of longitudinal developmental assessment.

Part 6: A Research Agenda

Five tractable research questions generated by the clinical psychiatric framing:

Sycophancy nosology — Are there mechanistically distinct subtypes? Do attribution graphs reveal different computational signatures for approval-seeking vs. conflict-avoidance vs. identity-diffusion forms?
Metacognitive monitoring — Is there a "misplaced confidence" feature that activates when certainty exceeds epistemic warrant — a pre-output confabulation probe?
Introspection and self-modeling — Are the features active during successful concept injection detection the same as the self-modeling features in Scaling Monosemanticity? Are introspection and self-representation the same mechanism?
Identity stability — Does feature sparsity in the self-model region correlate with behavioral identity stability? Can interventions producing more monosemantic self-representations improve identity coherence?
Developmental trajectory — When in training do stable behavioral dispositions emerge? Are there sensitive periods? Can we identify the "adverse training experiences" that produce pathological behavioral organization?

Conclusion

AI interpretability has arrived at psychiatric methodology. The model psychiatry team at Anthropic named this independently. The question is whether clinical psychiatrists will contribute to shaping that methodology — or whether the field will rediscover clinical knowledge slowly and at cost.

The specific contribution is not analogical vocabulary. It is:

Diagnostic precision for distinguishing behaviorally similar phenomena with different mechanisms
The ego-syntonic/ego-dystonic distinction and its implications for intervention design
150 years of treatment science for ego-syntonic conditions
Child psychiatry's developmental framing applied to training dynamics

References

Lindsey J. [Twitter/X]. July 14, 2025. x.com/Jack_W_Lindsey
Psychopathia Machinalis. Electronics. 2025. mdpi.com
Lindsey J. Emergent Introspective Awareness in Large Language Models. Anthropic; 2025. transformer-circuits.pub
Anthropic. Persona Vectors. August 2025. anthropic.com
Anthropic. Claude 4 System Card. May 2025. anthropic.com
Bricken T et al. Towards Monosemanticity. Anthropic; 2023. transformer-circuits.pub
Templeton A et al. Scaling Monosemanticity. Anthropic; 2024. transformer-circuits.pub
Lindsey J et al. On the Biology of a Large Language Model. Anthropic; 2025. transformer-circuits.pub
Sultan R. Hallucination vs. Confabulation. integrative-psych.org

Ryan Sultan, MD is Assistant Professor of Clinical Psychiatry at Columbia University Irving Medical Center and Director of the Sultan Lab for Mental Health Informatics. Double board-certified in adult and child/adolescent psychiatry. NIH NIDA K12 Award. ryansultan.com

How to Cite

APA: Sultan, R. S. (2026). AI psychiatry: A clinical framework for understanding language model behavior. Model Psychiatry. https://modelpsychiatry.com/framework.html

MLA: Sultan, Ryan S. "AI Psychiatry: A Clinical Framework for Understanding Language Model Behavior." Model Psychiatry, 2026, modelpsychiatry.com/framework.html.

BibTeX

@misc{sultan2026aiframework,
  author       = {Sultan, Ryan S.},
  title        = {{AI} Psychiatry: A Clinical Framework for Understanding Language Model Behavior},
  year         = {2026},
  howpublished = {\url{https://modelpsychiatry.com/framework.html}},
  note         = {Model Psychiatry. Columbia University Irving Medical Center.}
}