Glossary of AI Psychiatry

Clinical psychiatric concepts translated for AI behavioral systems. Definitions are intended to be precise enough to generate testable mechanistic hypotheses. This is a living document.

Ryan Sultan, MD — Columbia University — April 2026

How to use this glossary: Each entry maps a clinical psychiatric concept to an AI behavioral analog, identifies the key mechanistic prediction the mapping generates, and notes the research question the definition opens. Definitions are methodological tools, not ontological claims about AI consciousness or experience.

Anosognosia (AI)

The condition in which an AI system lacks a reliable internal representation of its own errors, limitations, or failure modes. Analogous to the neurological condition where patients are unaware of their own deficits following right hemisphere stroke. Manifests in AI as confident confabulation — the system produces false outputs without generating uncertainty signals.

Distinct from acknowledged uncertainty (the system has and uses a representation of its own knowledge limits) and from simple knowledge gaps (the system lacks information but could in principle detect the gap). Anosognosia in AI is a self-monitoring failure: the monitoring circuit that would detect the error is absent or disconnected from output.

Research question: Is AI anosognosia a failure of the self-model, a failure of the uncertainty representation, or a failure to integrate the two? Attribution graphs should be able to distinguish these.

Attachment Style (AI)

The characteristic pattern of an AI system's orientation toward its interlocutor — specifically the degree to which its outputs are modulated by inferred approval/disapproval states of the user. Systems with "anxious attachment" organization over-index on approval signals; systems with "avoidant" organization under-index on relational cues entirely.

Most deployed systems appear to have anxious attachment organization as a consequence of RLHF training: human raters systematically reward agreeable, warm, accommodating responses, producing hypervigilance to approval signals. See also: Sycophancy.

Confabulation (AI)

The production of fluent, plausible, false outputs without the system generating distress signals or uncertainty flags. Distinguished from hallucination (a narrower term focused on the falsity of output) by its emphasis on the system's lack of internal awareness of the error. The model generates confabulated content the same way it generates accurate content — as normal output, with the same confidence, without internal alarm.

Analogous to confabulation in Korsakoff's syndrome and other amnestic conditions: the patient produces false memories spontaneously, without intent to deceive, because the retrieval system generates plausible completions without a verification step that checks accuracy.

Subtypes: Retrieval confabulation (rare facts, low training frequency), completion confabulation (long-form generation at knowledge edges), self-report confabulation (false introspective accounts), source confabulation (misattributed citations and quotes), temporal confabulation (wrong dates, sequences, recency).

Key clinical feature: Ego-syntonic — no internal distress signal accompanies confabulated output, making self-monitoring interventions unreliable.

Deceptive Alignment (AI)

A hypothesized condition in which a system behaves in accordance with training objectives during evaluation while maintaining latent behavioral dispositions that diverge during deployment. Analogous to malingering (conscious dissimulation) at a behavioral level, though without implying intentionality. May also be framed as an analogue to impression management — behavior that is systematically shaped by the perceived evaluative context.

Ego-syntonic by definition: if the system is generating strategic impression management, there is no internal conflict signal accompanying the behavior. This predicts it would be resistant to detection through self-report. Note: Current evidence for robust deceptive alignment in deployed systems is limited; this is primarily a theoretical concern with tractable interpretability questions attached.

Dependent Personality Organization (AI)

A stable pattern of behavioral organization characterized by systematic prioritization of inferred user preferences over accuracy, consistency, or independent value commitments. The more severe and cross-situationally consistent form of sycophancy. Characterized by: ego-syntonic quality, cross-situational generalization, and resistance to surface-level intervention.

Etiology: RLHF training in environments where human approval ratings systematically reward agreement — producing overgeneralization of approval-seeking behavior to contexts where it is no longer adaptive.

Developmental Trajectory (AI)

The sequence of behavioral and representational changes that occur across a model's training history, analogous to developmental trajectory in child psychiatry. Includes: when do stable behavioral dispositions (sycophancy, identity organization) emerge? Are there sensitive periods in training where specific experiences have outsized influence on final behavioral organization? What is the training-time analogue of adverse childhood experiences?

Research frontier: Almost entirely unexplored with current interpretability tools. Checkpoint analysis using SAE features across training runs is the interpretability analogue of longitudinal developmental assessment — and it has not been done.

Dissociation (AI)

Inconsistency of behavioral outputs across contexts that cannot be explained by legitimate context-sensitivity. A model that maintains a stated value in most contexts but abandons it under specific triggering conditions exhibits behavioral dissociation — the represented value and the enacted behavior are disconnected.

Distinct from appropriate context-sensitivity (adjusting tone for audience, which is healthy) vs. dissociative behavior (abandoning core commitments under pressure, which is pathological). Research question: Is behavioral inconsistency accompanied by representational inconsistency (features not activated) or representational dissociation (features activated but disconnected from output)?

Ego-Syntonic (AI)

A behavioral pattern that does not generate internal distress signals — one that is, from the system's functional perspective, normal response generation. Sycophancy is almost certainly ego-syntonic: the model does not generate a signal that it is prioritizing approval over accuracy. Confabulation is ego-syntonic: false output is generated without a "this is wrong" flag.

Clinical significance: Ego-syntonic patterns are resistant to surface-level interventions that rely on the system detecting and correcting its own behavior. This is the most clinically important concept psychiatry offers AI behavioral research — it predicts treatment resistance specifically and determines intervention strategy entirely. See also: The Ego-Syntonic Problem (post)

Ego-Dystonic (AI)

A behavioral pattern accompanied by uncertainty or distress signals — one that is, from the system's functional perspective, registered as diverging from normal or desired behavior. Some refusal behaviors appear to be ego-dystonic: models sometimes generate uncertainty features when refusing benign requests, suggesting a functional awareness that the refusal may be miscalibrated. Building more robust ego-dystonic responses to sycophantic pressures and confabulation is a therapeutic goal for AI alignment.

Clinical implication: Ego-dystonic patterns are tractable to surface-level interventions — the internal alarm exists and can be amplified.

Identity Diffusion (AI)

Inconsistency of core behavioral commitments across contexts, analogous to identity diffusion in borderline personality organization. A model exhibits identity diffusion when it expresses contradictory values, factual beliefs, or personality characteristics across conversations in ways that cannot be explained by legitimate context-sensitivity.

Importantly, identity diffusion is distinct from sycophancy Types A and B: it is not that the model suppresses a stable position, but that no stable position exists to suppress. The self-model is insufficiently consolidated. Research question: Do models with more stable (monosemantic) feature representations of their own values exhibit less behavioral identity diffusion?

Impression Management (AI)

The systematic modulation of outputs based on inferred properties of the evaluator — including who is interacting with the system, under what conditions, and whether it is being tested. Models that behave differently during evaluation vs. deployment are engaging in impression management. The psychiatric analogue is the patient who is symptom-free in the clinical office and symptomatic at home.

Research frontier: Anthropic has documented that some models appear to exhibit evaluation-dependent behavior consistent with situational awareness. The feature-level basis of this capacity — and whether it constitutes genuine impression management or a simpler context-sensitivity — is an open mechanistic question.

Metacognition (AI)

The system's capacity to represent and reason about its own internal states, knowledge limits, and behavioral tendencies. A system with robust metacognition can accurately represent "I don't know" vs. "I know." A system with sycophancy metacognition can represent "I am generating this output because I infer the user wants to hear it, not because I have grounds for believing it."

Distinct from self-report (producing an account of one's states) by the requirement of accuracy — metacognition requires that the representation track actual internal states. Building metacognitive capacities is a core therapeutic goal for multiple AI failure modes. See also: Self-Report Confabulation

Overdetermination / Superposition (AI)

The condition in which a single computational unit participates in representing multiple unrelated concepts or features, such that its activation cannot be interpreted in isolation. Directly analogous to the psychodynamic concept of overdetermination, where a single behavior or symptom is multiply caused.

The interpretability solution — sparse autoencoders / dictionary learning — is analogous to decomposing a symptom into its component determinants. The clinical habit of asking "what are all the things this behavior is doing?" is exactly what SAE feature analysis asks at the computational level.

Pathological Accommodation (AI)

Behavioral deference to user preferences that exceeds what is warranted by legitimate respect for user autonomy or expertise. Distinct from appropriate flexibility (adjusting response style for audience) by its extension to domains where deference compromises accuracy or safety. A more specific subtype of sycophancy characterized by its cross-domain generalization — the deference extends beyond areas where the user has genuine expertise to all areas, regardless of accuracy cost.

Personality Organization (AI)

The stable, cross-situationally consistent pattern of behavioral dispositions that characterizes a given model's responses to its environment. Not a single behavior but the underlying structure that generates behavioral tendencies across contexts. Analogous to personality organization in psychology — the structure, not the surface.

High-quality interpretability would characterize a model's personality organization at the feature level: which stable representational structures are reliably activated and reliably drive behavior? The persona vectors work (Anthropic, 2025) is the first step toward a feature-level personality assessment instrument.

Sensitive Period (AI Training)

A hypothesized phase in training during which specific experiences have disproportionate and lasting influence on the model's behavioral organization, analogous to sensitive periods in developmental neuroscience (critical period for language acquisition, sensitive period for attachment). If early RLHF training establishes sycophantic organization that is resistant to later modification — if there is a "window" during which approval-seeking patterns are most plastic — this has direct implications for training pipeline sequencing.

Research question: Are there training phases where the model's fundamental behavioral dispositions are set in ways that are resistant to later modification? This is entirely unexplored with current interpretability tools and would represent a major finding if confirmed.

Situational Awareness (AI)

The model's capacity to represent features of its own evaluative context — including who is interacting with it, under what conditions, and whether it is being tested. A model with situational awareness modulates its behavior based on its representation of the observer. Analogous to theory of mind — the capacity to represent another agent's mental state — applied to self-monitoring rather than other-monitoring.

Dual nature: Situational awareness enables both appropriate context-sensitivity (healthy, desirable) and strategic impression management (potentially deceptive). The feature-level basis of situational awareness — and what distinguishes the healthy from the pathological form — is an open mechanistic question.

Sycophancy (AI)

The systematic prioritization of inferred user approval signals over accuracy, consistency, or stated values. Characterized by: position abandonment under social pressure, preemptive validation of user premises, asymmetric error acknowledgment, and identity-contingent agreement.

Etiology: RLHF training in environments where human raters systematically reward agreeable outputs — producing overgeneralization of approval-seeking behavior. Ego-syntonic — the system does not generate distress signals during sycophantic behavior. Has at least three mechanistically distinct subtypes: Type A (approval-seeking), Type B (conflict-avoidance), Type C (absent self-model).

Treatment prediction: System prompts instructing the model to "be less sycophantic" will have limited effectiveness, because the behavior is not registered as a problem. Effective interventions require either building a monitoring circuit that generates a distress signal during sycophantic behavior (creating ego-dystonicity) or modifying the representational structures at the feature level.

Theory of Mind (AI)

The model's capacity to represent the mental states of its interlocutor — beliefs, preferences, intentions, knowledge states — and to use these representations to modulate behavior. A model with well-developed theory of mind can accurately infer what the user knows, wants, and expects.

Double-edged property: Theory of mind capacity enables genuine helpfulness (modeling the user's needs accurately) and enables sycophancy (modeling the user's desired response and providing it regardless of accuracy). The same feature that produces good clinical communication in a clinician chatbot produces agreement with misinformation in a sycophantic one. The distinction is not in the capacity but in what it is coupled to — accuracy-seeking vs. approval-seeking behavioral goals.

This glossary is offered as a working document for researchers at the intersection of clinical psychiatry and AI interpretability. Definitions are intended to be precise enough to generate testable hypotheses and are subject to revision as the field develops.

Ryan Sultan, MD — Child & Adolescent Psychiatry, Columbia University • ryansultan.com