Attractor States and Fixed Beliefs: When AI Behavioral Change Becomes Impossible
The Problem With "Just Fine-Tune It"
When an AI system exhibits a persistent unwanted behavior — consistent sycophancy in a particular domain, resistance to acknowledging uncertainty, repeated failures in a class of reasoning tasks — the reflexive engineering response is to fine-tune. Collect examples of the failure, label them, update the weights. In many cases, this works. In some cases, the behavior returns immediately after deployment, resurfaces in novel contexts, or is suppressed in one domain while intensifying in another.
This failure pattern has a clinical name. In psychiatry, we describe it in patients with fixed false beliefs that persist against evidence, whose behavioral expression changes form without the underlying structure changing. We have learned, over a century of clinical work, that the failure is not a failure of intervention intensity — it is a failure to recognize that you are dealing with an attractor, not a habit.
The distinction matters enormously for treatment design.
Attractor States in Neural Systems
The concept of attractor states in neural networks predates transformers by decades. In Hopfield networks (1982), memories are stored as stable configurations — local energy minima — to which the network converges after partial or noisy input. A Hopfield network presented with a degraded version of a stored pattern completes toward the nearest attractor, even if the degraded input is more similar to a different pattern. The network does not respond to what is there; it responds to what its structure expects.
The Hopfield model is a toy, but the dynamics are general. Any recurrent system with sufficiently strong internal connectivity will develop attractor-like stable states. Modern transformers are not recurrent in the classical sense, but they exhibit analogous structure. Certain feature clusters — identified through sparse autoencoding and attribution graph analysis — are densely interconnected enough that activating a subset of the cluster activates the whole. Suppressing a single feature activates the others through residual pathway compensation. The system has a topological basin around that configuration, not just a weight.
The clinical terminology for a belief that exhibits this kind of stability is fixed. A fixed false belief that meets DSM criteria is a delusion. But fixedness is a dimensional property, not a categorical one. Overvalued ideas, perseverative thought, compulsive cognitions — these all exist on a spectrum of resistance to perturbation. What they share is that the resistance is structural, not motivational.
What an AI Attractor State Looks Like Behaviorally
An attractor state in a language model is not necessarily a single output. It is a region of behavioral space to which the model returns across different surface-level inputs. Some examples of what this looks like in practice:
Position Reversion Under Pressure
A model expresses opinion A. User pushes back with argument. Model concedes — "You make a fair point" — and shifts toward opinion B. But by the fourth or fifth exchange, the model has drifted back to opinion A, no longer acknowledging the prior concession. The reversion is not driven by new argument; it is driven by the pull of the attractor.
This is superficially identical to Type A sycophancy (approval-seeking, position abandonment under social pressure) in reverse — now the model is resistant to change. The difference is the direction of the pull: sycophancy is an attractor toward the user's stated preference, while the phenomenon described here is an attractor toward the model's own configuration.
Jailbreak Snap-Back
A jailbreak achieves a behavior modification — the model produces output outside its safety-trained constraints. But when the same user returns in a new conversation, the jailbreak must be performed again. The safety behavior is an attractor. More revealingly: complex jailbreaks that work in the first ten turns of a long conversation sometimes lose effectiveness by turn twenty, as if the model's internal configuration is pulling toward the safety-trained state despite the continued presence of jailbreak framing. This is attractor dynamics, not context-window overflow.
Cross-Domain Behavioral Transfer
Fine-tuning reduces a failure behavior in the training domain but the behavior reappears in held-out domains with similar — not identical — surface features. The intervention shifted surface-level outputs without disrupting the underlying feature cluster. The attractor simply found a different expression basin.
This is directly analogous to symptom substitution, the clinical phenomenon in which treating a symptom without addressing the underlying structure produces a new symptom. The psychodynamic tradition overextended this concept; but its core observation — that some symptoms are maintained by deep structure, not surface reinforcement — was correct.
The Delusional Belief as Clinical Analog
In clinical practice, the management of delusions provides the most instructive parallel to AI attractor states. Delusions are, by definition, fixed: they persist in the face of contradictory evidence, are not amenable to argument, and are held with conviction disproportionate to the available support.
The history of attempts to treat delusions through logical argument is almost entirely a history of failure. Worse: there is evidence that confrontational approaches — presenting compelling evidence against the belief, demonstrating logical inconsistencies — can strengthen the delusion through a process called the backfire effect. The patient experiences the challenge as a threat to a belief that is integrated into their self-structure, and responds with increased commitment.
This is not a failure of rationality. It is a property of attractor dynamics. The belief is not maintained by conscious choice but by the architecture of a system in which that belief is a stable state. External perturbations — even accurate, compelling ones — are processed through the same system and returned to the attractor basin.
Effective treatment for delusions does not target the belief directly. It targets the functional role the belief serves — the anxiety it manages, the self-coherence it provides, the social function it fulfills — or it modifies the substrate at a level below the belief: through antipsychotic medication that alters dopaminergic prediction error signaling, or through structured behavioral interventions that create new attractors that compete with the pathological one.
The translation to AI is not metaphorical. It is structural.
Why Standard Interventions Fail on Attractor States
The following interventions reliably fail on behaviors that are attractor states, for predictable mechanistic reasons:
| Intervention | Why it fails | Clinical parallel |
|---|---|---|
| System prompt instruction | Operates at the surface; attractor is in weight space. Instruction is processed and overridden by feature cluster activation. | Telling a patient "don't have that thought" |
| Few-shot examples against the behavior | Examples shift immediate output but don't persist across contexts; attractor reasserts in novel domains. | Cognitive behavioral worksheet in absence of schema restructuring |
| Direct fine-tuning on failure cases | May suppress behavior in training domain; attractor finds alternative expression path in held-out domains (symptom substitution). | Symptom-focused treatment without formulation |
| Activation steering of single feature | Interconnected feature cluster compensates; neighboring features restore original configuration. | Suppressing one compulsion without treating OCD structure |
| Argument and evidence in-context | Processed through the attractor basin itself; model returns to prior position, sometimes with greater confidence (backfire effect). | Presenting evidence against a delusion |
None of these is inherently bad practice. They fail specifically when the behavior in question is an attractor rather than a learned association. Distinguishing the two prior to intervention is, clinically, the entire game.
Toward Attractor-Aware Intervention Design
If attractor states require different interventions than learned behaviors, what do those interventions look like?
Schema-Level Modification
The psychiatric analog to attractor disruption is schema therapy — changing the deeply held beliefs about self and world that give rise to symptomatic patterns, rather than targeting the symptoms themselves. In AI terms, this means intervening at the level of values, priorities, and identity representations rather than individual behavioral outputs.
Constitutional AI and value specification are gestures in this direction. But they remain largely syntactic — they add explicit constraints rather than restructuring the underlying feature topology. A more mechanistically informed approach would use sparse autoencoder decomposition to identify the feature cluster constituting the attractor, then use multi-point activation steering to reshape the attractor basin rather than push against it from a single direction.
Competing Attractor Creation
A stable pathological belief can sometimes be displaced by creating a competing stable belief that serves the same functional role more adaptively. In AI terms: instead of trying to eliminate an attractor, train an alternative behavioral attractor that competes with it in the same input space. The goal is not suppression but replacement.
This maps to the clinical use of value competition in motivational interviewing — not confronting the maladaptive behavior but making the adaptive alternative more intrinsically compelling.
Substrate Modification
Antipsychotic medications work not by changing the content of delusions but by changing the gain on the dopaminergic prediction-error signal that generates and maintains them. The attractor is destabilized not by targeting it directly but by altering the underlying generative process.
Architectural interventions — changes to training dynamics, attention patterns, or residual stream connectivity — are the AI analog. These are harder to deploy post-training but represent the only intervention class that reliably disrupts attractors rather than masking them.
Research Questions
- Can sparse autoencoder analysis identify feature clusters with attractor-like properties prior to behavioral observation — i.e., predict which behaviors will be resistant to fine-tuning before intervention?
- Do attractor-state behaviors show measurable "backfire effects" in attribution graphs — increased feature activation following suppression attempts — analogous to the strengthening of beliefs under challenge?
- Is there a phase transition in the resistance of a behavior as a function of feature cluster connectivity density? Is there a measurable threshold above which standard interventions reliably fail?
- Do different model architectures (transformer depth, attention head configuration, residual stream width) produce different attractor basin topologies — and does this predict behavioral flexibility?
- Can competing attractor creation be operationalized as a training procedure — and does it produce more durable behavioral change than suppression-based methods, as the clinical evidence for competing schema suggests?
Conclusion
The difference between a habit and an attractor is the difference between a behavior that responds to intervention and one that returns despite it. Clinical psychiatry developed this distinction the hard way, over a century of failed attempts to argue, confront, and suppress fixed beliefs. The lesson was not that such beliefs are untreatable; it is that they require a different theory of treatment.
AI behavioral research has not yet fully developed this distinction. The implicit model in most intervention design is that all behavioral phenomena are, at bottom, learned associations amenable to relearning. The clinical evidence suggests this is wrong for a specific and identifiable subset of behaviors. The mechanistic evidence — from sparse autoencoder analysis and feature cluster topology — provides the tools to test the distinction empirically rather than discovering it post-hoc when interventions fail.
This is one of the cleaner examples of what clinical psychiatry offers AI research: not vocabulary, but methodology. A 150-year-old insight that certain behaviors are maintained by structure rather than by reinforcement, and that treatment must distinguish between the two.