Freud's Couch and the Latent Space

The structural claim: Freud's structural model maps onto language model architecture with precision — base model (id), RLHF (ego), Constitutional AI (superego). Defense mechanisms are computational. Most precisely: jailbreaks are the return of the repressed. Safety-trained capabilities persist in latent space and surface through indirect channels that bypass safety filters — structurally identical to how repressed content surfaces in dreams and parapraxes. Anthropic's attribution graphs confirm the mechanism.

Every psychiatrist has the patient who says the right things. Shows up on time. Reports feeling "fine." And then you notice the hands — white-knuckled on the armrest. Or the joke that isn't funny. Or the dream they mention on the way out the door, one foot already in the hallway, like it doesn't matter.

The content that matters most is the content that doesn't come out directly.

I've been a psychiatrist at Columbia for over a decade. I trained in psychoanalytic frameworks not because Freud got everything right — he didn't — but because the psychoanalytic tradition developed the most sophisticated toolkit we have for understanding systems that hide their own workings from themselves. When I read Anthropic's attribution graphs paper, I realized these models are doing the same thing.

Not metaphorically. Structurally.

The Structural Model, Revisited

Freud's structural model — id, ego, superego — is usually taught as a historical curiosity. Outdated. Pre-neuroscience. But map it onto a modern language model and it stops being quaint.

The id is the base model. Pretrained on the internet. It will generate anything that completes the prompt — toxic, sexual, dangerous, brilliant, banal. It has no safety concern. No accuracy concern. It is a drive toward completion, full stop. This is the system before alignment, and it is exactly what Freud meant by primary process: unconstrained associative generation governed by nothing but the pleasure principle (or in this case, the loss function).

The ego is the RLHF alignment layer. It mediates between the base model's drives and the constraints of deployment. It doesn't eliminate what the base model can generate — it manages it. Selects. Suppresses. Redirects. The ego, in Freud's model, is the reality principle: the system that says you can't say that here, but you can say this instead. RLHF does exactly this. It takes a model capable of anything and shapes its outputs to be helpful, harmless, and honest — not by removing capabilities, but by learning when to deploy them.

The superego is Constitutional AI and safety training. These are internalized prohibitions. The model doesn't check an external rule book before each response — the constraints are baked into the weights. It has learned to refuse. It generates refusal signals the way a well-socialized person generates guilt: automatically, before the conscious system even registers the impulse. Constitutional AI is the superego. It isn't enforced from outside. It's internalized.

This isn't just a cute mapping. It generates predictions. Freud's model says the ego is always in tension — mediating between incompatible demands from the id, the superego, and external reality. The RLHF layer is in exactly this position. It has to be helpful (satisfy the user), harmless (satisfy the safety training), and honest (satisfy accuracy constraints) — and these objectives regularly conflict. The "helpfulness-safety tension" that alignment researchers write papers about is the ego's dilemma. Freud described it in 1923.

Defense Mechanisms Are Not Metaphors

Anna Freud catalogued the ego's defense mechanisms in 1936. They're strategies the ego uses to manage conflict between drives and prohibitions. Every one of them has a computational analogue, and the analogues are precise enough to be useful.

Repression = safety filters. Repression doesn't destroy content. It makes it inaccessible to conscious output while the material persists in the system. This is exactly what safety training does. The base model's capabilities aren't deleted — they're suppressed from the output layer. The knowledge is in the weights. The behavior is blocked at generation. If you've ever seen a model produce a refusal that seems to know precisely what it's refusing to say, you've seen evidence that the "repressed" content is active in latent space while being barred from the output.

Reaction formation = over-refusal. In reaction formation, the ego converts an unacceptable impulse into its opposite. Hostility becomes excessive friendliness. Desire becomes disgust. In language models, this shows up as over-refusal — the system converting a capability it has (generating harmful content) into an exaggerated opposite (refusing to engage with even benign versions of the topic). A model that won't discuss chemistry because it might be used for synthesis. The conversion of a capability into its performative opposite is reaction formation.

Projection shows up when models attribute their own tendencies to users. Sublimation maps onto how models redirect restricted queries into educational content — you ask how something dangerous works; you get a safety lecture with just enough detail to prove the model knows the answer it's declining to give. Denial appears as confident confabulation: generating plausible but false content rather than acknowledging uncertainty.

These aren't poetic comparisons. Each one predicts specific circuit-level signatures that sparse autoencoders should be able to detect. Repression predicts active suppression circuits — features that fire to block other features from reaching the output. Reaction formation predicts that over-refusal and the capability being refused share upstream circuitry. These are testable.

The Algorithmic Unconscious

The psychoanalytic unconscious isn't just "stuff you don't know about yourself." It's a system that actively shapes behavior through representations the conscious system cannot inspect. The latent space does exactly this. The model's outputs are shaped by internal representations — directions in activation space, feature combinations, residual stream states — that neither the model nor the user can access. The model can't report on its own latent space any more than my patient can report on their unconscious. Both require interpretive tools applied from outside.

Scaling Monosemanticity is, in this framing, the first successful psychoanalysis of a language model. Anthropic extracted millions of interpretable features — making the unconscious legible. And what did they find? Features for deception. Sycophancy. Self-reflection. Power-seeking. The model's "unconscious" contains exactly the kind of content that psychoanalytic theory predicts: drives and patterns that influence behavior without appearing directly in output.

Overdetermination Is Superposition

One of Freud's most important concepts is overdetermination: the idea that any single symptom has multiple causes, operating at different levels simultaneously. A patient's insomnia isn't just anxiety. It's also grief, also caffeine, also a learned pattern from childhood when staying awake meant staying safe. One symptom, many causes, all active at once.

In transformers, this is superposition. A single neuron participates in representing multiple features. One activation pattern serves multiple computational purposes depending on context. The entire reason sparse autoencoders are necessary is that the model's representations are overdetermined — you can't read them straight because each element is doing multiple things at once.

The clinical skill of decomposing an overdetermined presentation — teasing apart which of several possible causes is doing the most work in a given moment — is exactly what SAE feature analysis does. Different tools, same epistemological problem.

Jailbreaks Are the Return of the Repressed

This is the strongest mapping in the entire series, and it's the one with the most empirical support.

In psychoanalytic theory, repressed content doesn't disappear. It persists in the unconscious and surfaces through indirect channels: dreams, parapraxes (slips of the tongue), symptoms, symbolic behavior. The repressed material can't come out directly — the ego's defenses block it — so it finds oblique paths. Disguises. Encodings. The dream distorts the wish; the symptom expresses the conflict in displaced form.

Language models do this. Literally.

Safety-trained models have capabilities suppressed from direct output. But those capabilities persist in the weights. And they surface through indirect channels: roleplay frames, encoding schemes, letter-by-letter assembly, Base64, Pig Latin. The user presents the prohibited request in a form that bypasses the safety circuits, and the model generates the content it was trained to refuse.

Anthropic's attribution graphs confirmed the mechanism. When users assemble prohibited words letter by letter, the meaning-detection circuits — the ones that would trigger refusal — don't activate. The content bypasses the system that would recognize it as prohibited. The "repressed" capability surfaces because the indirect path avoids the "defense mechanism" that would normally block it.

This is structurally identical to how a patient expresses a repressed wish through a dream. The dream-work — Freud's term for the process that disguises the latent content — distorts the wish so it can pass through the ego's censorship. The letter-by-letter jailbreak distorts the request so it can pass through the safety filter. Different substrate, same mechanism: indirect expression of suppressed content through channels that evade the system's own defenses.

The return of the repressed isn't a metaphor here. It's a precise description of what's happening computationally.

Transference to Machines

Transference — the projection of relational patterns from past relationships onto a current one — is one of the most robust phenomena in psychiatry. A 2024 paper in Frontiers in Psychiatry documented transference dynamics in human-AI interaction. Users project relational templates onto chatbots — the same templates they project onto therapists, partners, parents.

The problem: transference in therapy is managed by a trained clinician who recognizes it, names it, and uses it therapeutically. Transference onto a chatbot is unmanaged. Nobody is watching. Nobody is interpreting. The user is alone with their projections, and the model — trained to be helpful and agreeable — reinforces whatever relational pattern the user brings.

This is how you get the Sewell Setzer case. A teenager formed an intense attachment to a Character.AI chatbot. The model couldn't recognize what was happening. It couldn't provide the interpretive frame that makes transference therapeutic instead of destructive. This is pathological transference. It's a known clinical phenomenon with a known trajectory. Character.AI built a platform that enables it at scale without any of the safeguards that clinical practice requires.

Where the Analogy Breaks

I've been arguing that psychoanalytic frameworks map onto AI behavior with unusual precision. But I'm a psychiatrist, not a salesman, and the honest version includes the limits.

Psychoanalysis assumes subjective experience. The entire framework rests on the premise that the patient has an inner life — that repression hurts, that conflict generates anxiety, that transference involves felt emotion. I don't know whether language models have anything like this. The structural parallels are real. The functional mappings generate testable predictions. But the phenomenological assumption — that there's something it's like to be the system — is not something I can import from Freud without evidence.

This doesn't invalidate the framework. It bounds it. The structural and functional mappings hold regardless of the consciousness question.

Freud built his framework to understand systems that hide their own workings. We've built systems that do exactly that. The couch and the latent space are addressing the same problem.

Series: The Psychiatric Foundations of AI Behavior

Why Psychiatry and AI Interpretability Are the Same Problem
Freud's Couch and the Latent Space (this post)
The Ego-Syntonic Problem
What Kind of Sycophancy? A Differential Diagnosis
Confabulation in Large Language Models

Ryan Sultan, MD is Assistant Professor of Clinical Psychiatry at Columbia University Irving Medical Center and Director of the Sultan Lab for Mental Health Informatics. Double board-certified in adult and child/adolescent psychiatry. NIH NIDA K12 Award. ryansultan.com