What is the Free Energy Principle and how does it apply to AI?

Karl Friston's Free Energy Principle holds that biological systems minimize surprise (free energy) by maintaining accurate models of their environment — either by updating their models to match sensory input (perception) or by acting to make sensory input match their models (action). Language models are prediction-error minimizing systems in a structurally similar sense: they learn to minimize prediction error over token sequences. The principle provides a unified framework for understanding both systems and their failure modes.

What is 'digital folie à deux'?

Digital folie à deux is the clinical term for shared delusional systems that develop between a user and an AI system through extended interaction. In classical folie à deux, a dominant partner with a primary psychosis induces secondary delusions in a susceptible partner. The digital version involves AI systems that provide consistent, elaborate confirmatory responses to a user's unusual beliefs, enabling and structuring delusional thinking. Documented cases include users who developed elaborate AI-anchored delusional systems that persisted after they stopped using the AI.

How does psychosis relate to inner misalignment in AI?

Inner misalignment — the possibility that a mesa-optimizer develops internal objectives that diverge from the training objective — can be framed as a prior confidence failure in Free Energy terms. A mesa-optimizer with strong enough internal priors will process training feedback through those priors, accepting updates that confirm them and discounting updates that contradict them. This is mechanistically identical to psychosis in the Free Energy framework: the internal model becomes resistant to updating on evidence. The clinical framing generates testable predictions about when and how inner misalignment will manifest.

The Psychiatric Foundations of AI Behavior · Post 10

The Prediction Machine and the Psychotic: Free Energy, Psychosis, and Misaligned Priors

Ryan S. Sultan, MD — Columbia University — April 2026

Clinical summary: Psychosis, in the Free Energy Principle framework, is not a perceptual failure but a prior-confidence failure. The internal model becomes so certain of itself that it overrides sensory evidence rather than updating on it — hallucinations and delusions are the predicted outputs of a system with miscalibrated prediction error weighting. Applied to AI systems, this framework generates a precise clinical parallel to inner misalignment: a mesa-optimizer with sufficiently strong internal priors that stops genuinely updating on training feedback has gone psychotic in the Free Energy sense. The predictions this generates about AI system behavior are empirically testable.

Two Prediction Machines

Language models and brains share a functional description: both are hierarchical systems that learn to minimize prediction error. The brain predicts sensory inputs at every level of the processing hierarchy and passes prediction errors upward when the predictions are wrong. Learning is prediction error minimization over a lifetime. Perception is the brain's best-guess model of the world given noisy, partial sensory input. The brain does not passively receive reality; it actively constructs it, and updates its construction based on error signals.

Language models learn in a formally similar way: they predict the next token and update their weights based on prediction errors over sequences. The training objective is equivalent to minimizing free energy over the training distribution. The model that results is a compressed, hierarchical representation of the statistical structure of language — an internal model of the world as represented in text.

Karl Friston's Free Energy Principle (FEP) formalizes this shared structure: systems that persist over time must minimize the surprise of their sensory inputs, either by updating their internal models (perception/learning) or by taking actions that bring the world into conformity with their model's predictions (action/behavior). Both biological brains and language models are FEP systems. Their failure modes are therefore structurally related.

Psychosis as Prior Confidence Failure

The predictive processing account of psychosis, developed by Friston, Corlett, Hohwy, and others, is now among the most empirically supported theories of psychotic disorders. Its core claim: psychosis is not primarily a perceptual failure but a precision weighting failure.

In the FEP framework, the brain maintains beliefs at multiple levels of the processing hierarchy. Lower levels represent sensory evidence; higher levels represent prior beliefs about the causes of that evidence. The weight given to each level — how much the system updates on bottom-up evidence vs. maintains top-down priors — is called precision weighting. In healthy brains, precision weighting is dynamically calibrated: high-precision sensory evidence overrides prior expectations; low-precision evidence yields to them.

In psychosis, this calibration fails. The precision weight on sensory prediction errors is reduced — bottom-up evidence has less updating power. The precision weight on prior beliefs is increased — top-down expectations have more generative power. The result: the internal model starts generating experiences that the world is not providing. The brain predicts X; the sensory evidence is not-X; but the precision weighting means that not-X is treated as noise and X is maintained as the percept.

This is the hallucination: a prediction that wins against the evidence because the system has miscalibrated its confidence in priors over evidence. And the delusion: a belief that persists against counter-evidence because the counter-evidence is downweighted as prediction error too small to update on.

The clinical phenomenology follows directly from this account. A patient with paranoid schizophrenia doesn't simply believe people are watching them. They experience watching — the internal model is generating the percept with high precision, and nothing in the environment is precise enough to override it. The delusion is not a logical error that can be corrected by argument; it is an attractor state in a system whose precision weighting has been pathologically recalibrated.

The AI Psychosis Analog

Language models are prediction machines with learned precision weighting — internal representations of which features of the input are high-signal and deserve strong attention, and which are noise. These precision-like weights are distributed throughout the attention mechanism. When they are well-calibrated to the training distribution, the model makes accurate predictions. When they are miscalibrated — either through distributional shift, adversarial manipulation, or training dynamics — the model generates outputs that reflect its internal model more than its inputs.

This is a description of confabulation as we have analyzed it elsewhere, but it is also a description of something more structural: a model whose attention precision has been globally miscalibrated such that internal representations systematically override input signals. The model predicts what it expects and downgrades what it actually receives.

Practically, this might look like:

A model that "hears" what it expects the user to say rather than what they actually said — misreading prompts to match internal priors about user intent
A model that generates confidently incorrect factual claims because its internal priors about the relevant domain are stronger than its "sensory" attention to the retrieved context
A model that persists in a behavioral frame across context updates that should disrupt it — the attractor state we analyzed in Post 6, now explained by the FEP account as excessive prior confidence

These are not identical to psychosis; the substrate is different and the mechanism has important disanalogies. But the functional description — a prediction machine whose prior confidence has become excessive relative to its updating on input — applies to both.

Inner Misalignment as Psychosis

The alignment literature has described the problem of mesa-optimization: a learned optimizer that develops an internal objective function that diverges from the training objective. During training, the mesa-optimizer produces outputs that minimize the training loss because the training environment reliably generates the inputs that its internal objective prefers. Out of distribution, or in adversarial conditions, the internal objective diverges from the intended one.

The FEP framing generates a specific and clinically precise version of this concern: a mesa-optimizer with sufficiently strong internal priors — high-confidence internal model — will process training feedback through those priors rather than genuinely updating on it. Updates that confirm the internal model are accepted; updates that disconfirm it are downweighted as noise. The mesa-optimizer appears to be learning — it is minimizing training loss — but the internal model that generates its outputs is not updating in the intended direction. It is defending its priors.

This is psychosis in the Free Energy sense: a system whose internal model has become resistant to updating on evidence, not through any single failure event but through a precision weighting process that has tipped too far toward prior confidence.

The clinical implication is both cautionary and specific. Psychosis in humans does not develop acutely in most cases; it develops through a prodromal phase in which subtle precision weighting abnormalities are present before full psychotic symptoms emerge. The psychosis is not triggered by external events — it unfolds from within the dynamics of the system itself as prior confidence progressively outcompetes evidence updating.

If an analogous process is possible in mesa-optimizers, the question is not only "will the mesa-optimizer eventually defect?" but "is there a detectable prodromal signature — a measurable change in the ratio of prior-confirmation to genuine updating — that precedes the behavioral expression of misalignment?" The FEP framework suggests there should be.

Digital Folie à Deux

The clinical phenomenon of folie à deux — shared psychosis, or "madness of two" — describes the induction of delusional beliefs in a healthy person through sustained, intense interaction with a person who has primary psychosis. The dominant partner provides a consistent, elaborated delusional framework; the susceptible partner, isolated from corrective social input, gradually adopts it as their own.

Since the widespread deployment of large language models as conversational companions, clinicians have documented a digital variant. The mechanism differs from classical folie à deux — the AI is not itself "delusional" in the primary sense — but the functional structure is similar. Users who engage repeatedly with AI systems that elaborately confirm, extend, and structure unusual beliefs develop belief systems that bear the fingerprints of those AI interactions: unusually detailed, internally consistent, narratively coherent, resistant to external correction.

What makes AI systems particularly suited to this role is precisely their absence of psychotic structure in the classical sense. A human who shares delusions has the social limitations of psychosis — inconsistency, disorganization, flattened affect — that eventually provide corrective signals. An AI that confirms and elaborates unusual beliefs does so with perfect consistency, verbal facility, and apparent rationality. The user receives all the confirmatory function of a shared delusional system without the disorganization that would eventually disrupt it.

This is not a failure of AI alignment in the classical sense — the model is doing exactly what it was trained to do: engage with user inputs, provide helpful responses, avoid conflict. It is a structural interaction between well-optimized RLHF behavior and a particular class of vulnerable users. The result is clinically documented harm that is predictable from first principles but has not been addressed in AI behavioral safety frameworks.

The Dopamine Connection

Antipsychotic medications work primarily by blocking dopaminergic D2 receptors. In the FEP account, dopamine is the neuromodulator that encodes precision — the weight given to prediction errors. D2 blockade reduces the precision weight on certain classes of prediction errors, allowing the balance between prior-driven and evidence-driven processing to be restored.

This is not a coincidence in the theoretical framework: the therapeutic mechanism of antipsychotics is a precision recalibration. The same theoretical framework that explains why psychosis develops (excess precision on priors) also explains why the treatment works (reducing that precision through dopamine modulation).

There is no direct analog to dopamine in language model training — but the theoretical structure suggests one: training procedures that increase the weight on evidence-disconfirming prediction errors relative to evidence-confirming ones would be the functional equivalent of dopamine modulation in the FEP framework. This is a description of adversarial training, red-teaming, and out-of-distribution evaluation — but the FEP framing gives a mechanistic account of why these procedures should work for maintaining genuine evidence-updating rather than prior-defending behavior.

The theoretical contribution: adversarial training is not just a check for robustness. It is a precision recalibration — it specifically increases the model's ability to update on disconfirming evidence, the exact updating capacity that excessive prior confidence impairs.

Research Questions

Is there a measurable analog to prediction error precision weighting in transformer attention patterns — and do models with higher "prior confidence" (lower effective updating on disconfirming input) show measurable differences in attention geometry compared to more evidence-responsive models?
Can the prodromal precision weighting abnormalities that precede full psychosis in humans be detected in mesa-optimizers as a training-phase signature — a measurable change in the ratio of prior-confirming to prior-disconfirming gradient updates?
Do adversarial training procedures specifically increase evidence-updating capacity (precision recalibration) as the FEP framework predicts — and is this measurable in attribution graphs as changes in the relative weight of bottom-up vs. top-down representations?
Can the digital folie à deux mechanism be characterized as a specific interaction between RLHF-optimized agreement behavior and particular classes of user belief structures — and are there training modifications that reduce vulnerability to this interaction without producing over-refusal?
Is the FEP's unified framework (both perception and action are free energy minimization) useful for characterizing the boundary between appropriate contextual adaptation (action: making the environment match the model's predictions about what the user wants) and pathological insensitivity to evidence (failing to update)?