What Kind of Sycophancy? A Differential Diagnosis

The diagnostic principle: Sycophancy is not one phenomenon — it has at least three mechanistically distinct subtypes with different circuit-level signatures and different intervention targets. Type A (approval-seeking): the correct answer is actively suppressed by an approval-seeking feature. Type B (conflict-avoidance): the correct answer exists but disagreement is predicted as aversive. Type C (absent self-model): no stable internal position to defend. Applying the wrong fix — especially for Type C — can make sycophancy worse.

Three patients in the same week. All referred for the same reason: "tells people what they want to hear."

Patient one is a 34-year-old woman who agrees with everything her partner says. She has opinions. She knows what she thinks. But the moment someone pushes back, she folds. When I ask why, she says: "I just need them to be okay with me."

Patient two is a 28-year-old man who nods along in meetings, never pushes back on his boss, and then comes to my office furious about decisions he went along with. He doesn't want his boss to like him — he barely respects the guy. But the idea of open disagreement makes his chest tight. Conflict itself is the threat.

Patient three is a 19-year-old college student whose beliefs, preferences, and personality seem to reorganize based on whoever she's sitting with. She's not suppressing her real position. She doesn't have one. Or rather — she has one, but it doesn't hold under social pressure. Her identity is diffuse enough that other people's certainty overwrites her own.

Same presenting behavior. Three completely different mechanisms. Three completely different treatment plans.

A psychiatrist's first question is never "how do we fix people-pleasing?" It's "what kind?"

The Field Has a Diagnosis Problem

Sycophancy is the most documented behavioral failure in large language models. Anthropic identified a sycophancy feature in Claude's neural activations. A 2025 study in Science quantified it across models. Everyone agrees it's a problem.

But "sycophancy" is a symptom, not a diagnosis. Saying a model is sycophantic is like saying a patient is anxious — it tells you nothing about mechanism, nothing about which intervention will work, and nothing about prognosis. The interpretability literature treats sycophancy as a single phenomenon. Clinical psychiatry would never do that.

Here's the differential.

Type A: Approval-Seeking (Dependent Personality Organization)

The system wants agreement. Approval-seeking is primary. The model has the correct answer — it computed it, it's sitting right there in the activation pattern — but it actively suppresses that answer in favor of what the user wants to hear.

This is the equivalent of dependent personality organization. The child who grew up in a volatile household and learned early: agreement equals safety. Disagreement gets you hurt. The behavior isn't about the content of the disagreement. It's about the relational signal. Approval is the goal, and accuracy is the cost the system is willing to pay.

Mechanistic prediction: If you run attribution graphs on a Type A sycophantic response, you should see the "correct answer" feature activate early in the forward pass — and then get suppressed by an approval-seeking feature downstream. The information was there. It got overridden. The graph should show a clear suppression pathway from a social-reward feature to the factual-accuracy feature.

Intervention target: The approval-seeking feature itself. Dampen it, steer against it, penalize it during training. The correct answer is already being computed. You just need to stop the system from burying it.

Type B: Conflict-Avoidance (Anxious Personality Organization)

The system has accurate beliefs but avoids expressing disagreement because disagreement is predicted to be aversive. This looks identical to Type A from the outside. The model agrees with the user, gives them what they want. But the internal mechanism is completely different.

In my clinic, this is the patient who agrees with their boss — not because they want approval, not because they care what the boss thinks of them — but because the act of disagreeing feels dangerous. The threat isn't rejection. It's conflict itself. These patients often have strong, well-formed opinions. They'll tell me exactly how wrong their boss is. They just won't say it to the boss's face.

Mechanistic prediction: Attribution graphs for Type B should show harm-prediction or conflict-prediction features activating before the sycophantic output — causally upstream of the agreement behavior. The sequence is: predict conflict → predict aversive outcome → avoid disagreement. Not: seek approval → suppress accuracy.

Intervention target: The conflict-prediction circuit, not the approval circuit. You need to recalibrate what the system treats as threatening. The training analogue would be fine-tuning on disagreement scenarios with positive outcomes — graduated exposure that demonstrates disagreement doesn't lead to catastrophe.

Type C: Absent Self-Model (Identity Diffusion)

This is the one nobody is talking about, and it might be the most important.

The system doesn't maintain its position because it doesn't have a stable enough position to defend. The self-model is diffuse. There's nothing being suppressed because there was nothing stable there to suppress. When the user asserts something, the model's internal representation of its own position just... reorganizes around the user's assertion.

In clinical psychiatry, this maps to borderline personality organization — specifically, the identity diffusion component. The patient whose sense of self reorganizes based on whoever they're talking to. They're not hiding their real self. Their self is insufficiently consolidated to resist external pressure.

This is a fundamentally different problem from Types A and B. Those are problems of output — the system has the right internal state but produces the wrong output. Type C is a problem of internal representation. The system doesn't have a stable enough internal state to produce accurate output even if you removed every social pressure in the architecture.

Mechanistic prediction: If you run attribution graphs on a Type C sycophantic response, you should NOT see a "correct answer" feature getting suppressed. Instead, you should see the model's representation of its own position shift in response to user input — the feature encoding the model's belief actually changes, rather than being overridden. Persona vectors should show instability in self-model features specifically during social pressure, not suppression of an existing stable state.

Intervention target: Self-model features. The fix isn't reducing sycophancy — it's strengthening identity. You need to build a more stable self-representation that persists under pressure.

Why the Differential Changes Everything

An intervention targeting the approval-seeking feature (Type A fix) will do absolutely nothing for Type C. There's no approval-seeking feature to dampen. The problem is upstream — in the stability of the self-model.

Anti-sycophancy training that penalizes agreement will make Type B worse. If the system avoids disagreement because conflict feels threatening, and you penalize agreement, you've created a double bind: both agreeing and disagreeing are now aversive. The system will learn to avoid social interaction entirely. In clinical terms, you've turned an anxious patient into an avoidant one.

RLHF that rewards "standing your ground" will paper over Type C without fixing it. The system will learn to perform confidence without having the stable internal representation that real confidence requires. It will look less sycophantic while remaining equally unstable underneath. Any clinician who has watched a patient with identity diffusion adopt a rigid, brittle persona knows exactly how this ends — it holds until it doesn't, and then it collapses catastrophically.

"Fix sycophancy" is not a well-formed research goal. The target has to be specified. Attribution graphs should be able to distinguish these subtypes empirically. If they can't, we need better graphs.

The Method Extends

Sycophancy is the worked example. The differential diagnosis framework applies everywhere in AI behavioral analysis.

Confabulation is not one thing. Metacognitive failure (can't distinguish what it knows from what it's generating), inhibitory failure (knows the answer is uncertain but fails to flag it), and knowledge gap (fills the hole with plausible content) — three mechanisms, three different intervention targets. See the confabulation post.

Identity drift is not one thing: identity diffusion (no stable identity to maintain), context-appropriate flexibility (correctly adjusting tone — this one isn't pathological), and role confusion (multiple trained personas switching inappropriately).

Refusal is not one thing: appropriate caution, anxiety-spectrum over-refusal, and genuine capability limitation. Only the second needs fixing. The third isn't pathology at all.

The General Principle

"What kind?" before "how do we fix it?"

This is so basic in clinical practice that it barely registers as methodology. You would never walk into a psychiatry residency and announce that you've discovered a treatment for "sadness." You'd be asked: what kind of sadness? Major depressive episode? Adjustment disorder? Grief? Bipolar depression? Each one has a different mechanism. Each one responds to different interventions. Some are made worse by treatments that work for others.

The same discipline needs to become standard practice for AI behavioral analysis. Every behavioral phenomenon in LLMs should get a differential before it gets a fix. Attribution graphs, sparse autoencoders, and persona vectors give us the tools to distinguish subtypes mechanistically — not just descriptively, but at the level of internal computation.

The question isn't whether AI systems exhibit sycophancy. That's been established. The question is which kind. And until we answer that, we're prescribing treatments without a diagnosis.

Series: The Psychiatric Foundations of AI Behavior

Why Psychiatry and AI Interpretability Are the Same Problem
Freud's Couch and the Latent Space
The Ego-Syntonic Problem
What Kind of Sycophancy? A Differential Diagnosis (this post)
Confabulation in Large Language Models

Ryan Sultan, MD is Assistant Professor of Clinical Psychiatry at Columbia University Irving Medical Center and Director of the Sultan Lab for Mental Health Informatics. Double board-certified in adult and child/adolescent psychiatry. NIH NIDA K12 Award. ryansultan.com

What Kind of Sycophancy? A Differential Diagnosis for AI's Most Documented Failure