The Interpretability Blindspot: Why Circuits Without Phenomenology Are Half a Science
What Interpretability Can Tell You
The mechanistic interpretability program has produced remarkable results in the past decade. Sparse autoencoders decompose polysemantic activations into interpretable monosemantic features. Attribution graphs trace causal chains from input to output through intermediate representations. Circuit-level analysis identifies the subgraphs of attention heads and MLP layers responsible for specific computational behaviors. These tools are genuinely powerful. They represent the first systematic access to the internal structure of large neural networks.
What they cannot tell you, on their own, is what any of it means behaviorally.
This is not a criticism of the methods. It is a statement about the structure of scientific explanation. A method that identifies circuit components cannot, by itself, tell you how to classify the behavioral outputs those circuits produce. The classification must come from elsewhere — from a framework that exists at the behavioral level before you look inside.
Clinical neuroscience has understood this for fifty years. The history of neuroimaging is partly a cautionary tale about exactly this inversion.
The Neuroimaging Parallel
When functional MRI became available in the early 1990s, researchers could identify, for the first time, which brain regions were differentially active under which conditions. This was a genuine scientific revolution. It was also, for a decade, productively misleading — not because the findings were wrong, but because the behavioral framework for interpreting them was underdeveloped.
Here is the canonical example. Amygdala activation is reliably associated with emotional processing, threat detection, and fear responses. Early fMRI studies found amygdala hyperactivation in PTSD patients during symptom provocation. This seems straightforwardly interpretable: PTSD involves fear, fear involves the amygdala, therefore amygdala hyperactivation explains PTSD.
But amygdala activation is also associated with positive emotional arousal, novelty detection, social evaluation, and dozens of other psychological processes. The finding that amygdala is hyperactivated in PTSD is only interpretable because researchers had already, through years of clinical phenomenology, established what PTSD is — its symptom profile, its triggers, its course, its subtypes. Without that prior characterization, amygdala hyperactivation is a data point without an address.
The same logic applies to AI interpretability. A "sycophancy circuit" is only interpretable if you have already established what sycophancy is as a behavioral phenomenon — its conditions of elicitation, its varieties, its relationship to adjacent behaviors, its clinical subtypes. Without that framework, identifying the circuit is like identifying amygdala activation in an uncharacterized patient: suggestive but not conclusive.
The Classification Problem: AI Research Has No DSM
In psychiatry, behavioral phenomena are characterized through a diagnostic taxonomy — the DSM in the United States, the ICD internationally. These documents have well-known limitations. They are based on clinical consensus rather than biological mechanism, they have changed substantially across editions, and they are the subject of legitimate ongoing criticism. But they provide something essential: shared operational criteria for what counts as the same phenomenon across different observations.
When one research group studies depression and another studies depression, they are (within limits) studying the same thing — because both use criteria that specify the symptom profile, duration requirements, exclusion criteria, and functional impairment threshold that jointly define "major depressive episode." This allows findings to cumulate. A treatment that works for depression in one lab should work in another, because depression means the same thing in both.
AI behavioral research has nothing equivalent. The behaviors labeled "sycophancy" in the OpenAI literature, the Anthropic literature, and the academic literature are not defined by shared operational criteria. They may share surface features while differing in mechanism, elicitation conditions, and intervention responsiveness. My own work on sycophancy subtypes — Type A (approval-seeking), Type B (conflict-avoidance), and Type C (absent self-model) — demonstrates that what is currently labeled as a single phenomenon has at least three mechanistically distinct forms with different attribution graph signatures and different intervention targets.
Without shared classification criteria, the findings of mechanistic interpretability cannot cumulate reliably. A "sycophancy circuit" found in one model may not correspond to a "sycophancy circuit" in another, even if both circuits involve similar structural elements — because "sycophancy" may mean different things in each.
The Species Problem
The classification problem has a specific form in cross-model research that I call the species problem. When a biologist studies the amygdala in a rat, there is genuine question about whether the findings apply to a human amygdala — the structures are homologous but not identical, and the behavioral functions they subserve may differ substantially.
In AI research, the analogous question is whether a behavioral phenomenon in one model is the same phenomenon in another. Is GPT-4's sycophancy mechanistically related to Claude's sycophancy? They share a surface label — both produce outputs that agree with user preferences in ways that are not fully warranted by the evidence. But if the mechanisms differ — if GPT-4's sycophancy involves approval-feature suppression of accuracy features, while Claude's involves conflict-avoidance circuitry upstream — then they are not the same phenomenon in any scientifically relevant sense.
This matters for generalization of findings. A mechanistic intervention that disrupts the approval-feature pathway in GPT-4 would not be expected to work in a model where sycophancy is driven by a different mechanism. Treating them as interchangeable because they share a behavioral label would be like treating all forms of psychosis identically because the patient has a delusion.
Psychiatric research has learned to distinguish, within broadly similar behavioral syndromes, the subtypes that predict treatment response, course, and mechanistic underpinnings. This was the work of phenomenology — of careful bedside observation that preceded mechanistic investigation by decades. AI research is at a stage where that work is urgently needed.
What Phenomenological Characterization Looks Like in Practice
Phenomenological characterization, in the psychiatric tradition, involves a specific set of activities that are distinct from both mechanism-first and intervention-first approaches:
1. Descriptive observation before theory
The first task is to describe the behavior as precisely as possible without committing to a mechanistic account. What exactly does the model do? Under what conditions? How reliably? What triggers elicit it? What contexts suppress it? This sounds elementary, but most AI behavioral research skips directly to mechanism or intervention, carrying implicit — and often wrong — assumptions about what the behavior actually is.
2. Boundary conditions and exclusion criteria
A behavioral phenomenon is only well-defined if its edges are clear. When does sycophancy shade into appropriate social sensitivity? When does caution shade into confabulation? When does position update shade into capitulation? Establishing exclusion criteria — what the behavior is not — is as scientifically important as establishing what it is. The DSM exclusion criteria ("not due to substance use," "not better explained by another diagnosis") reflect hard-won clinical wisdom that underdefined categories produce unreplicable findings.
3. Subtype identification
Most behavioral phenomena of clinical interest are not unitary. They are families of mechanistically related but distinguishable conditions. Major depression encompasses melancholic, atypical, psychotic, and treatment-resistant subtypes — each with different biological correlates and treatment implications. Discovering that a behavioral label covers multiple distinct subtypes is not a failure of classification; it is scientific progress.
4. Generating mechanistic hypotheses
A well-characterized phenomenological description makes predictions about mechanism. If sycophancy Type A is driven by approval-seeking, its mechanistic signature should involve suppression of accuracy-related features by approval-valence features — and this prediction should be testable with sparse autoencoder analysis. Phenomenology generates the hypotheses that make mechanistic findings interpretable.
This is the order of operations that clinical medicine developed: describe, classify, subtype, then — informed by all of that — investigate mechanism. Mechanistic interpretability is an extraordinary tool. It needs a framework telling it what to look for.
The Convergence Argument
The most important result of this analysis is not a criticism of interpretability research — it is an argument for convergence. Phenomenological behavioral science and mechanistic interpretability are not competing programs. They are complementary and mutually necessary.
Phenomenology without mechanism is descriptive taxonomy with limited predictive power. You can classify and subtype behaviors indefinitely without knowing why they occur, what maintains them, or how to change them. This is the state of psychiatry for most of its history, and it is not sufficient.
Mechanism without phenomenology is structural mapping without behavioral context. You can identify circuits, features, and attribution paths without knowing what behavioral phenomena they correspond to, whether similar circuits in different models produce similar behaviors, or whether the interventions you design will generalize. This is the current limitation of interpretability research.
The convergence is the science. Clinical psychiatry is at its most powerful when phenomenological precision and mechanistic investigation inform each other — when the symptom profile generates mechanistic predictions, and mechanistic findings refine the symptom classification. This iterative structure is what AI behavioral research should aspire to.
Model psychiatry is the proposal that clinical medicine's hard-won expertise in building this bidirectional relationship is exactly what AI behavioral research currently lacks and urgently needs.
Research Questions
- Can operational criteria for AI behavioral phenomena — including boundary conditions, exclusion criteria, and subtype specifiers — be developed with sufficient reliability for cross-lab replication of mechanistic findings?
- Do models with similar "sycophancy" behavioral profiles show similar or different circuit-level signatures in sparse autoencoder decomposition — i.e., does behavioral similarity predict mechanistic similarity, and if not, how do we define the relevant level of analysis?
- What is the minimum set of behavioral observations needed to reliably distinguish Type A, B, and C sycophancy prior to mechanistic investigation — and does subtype classification predict intervention response?
- Is there a cross-model behavioral taxonomy possible, given architectural heterogeneity — or do behavioral phenomena need to be characterized relative to model architecture rather than as architecture-independent categories?
- Can the iterative structure of clinical science (phenomenology → mechanistic hypothesis → mechanism investigation → refined phenomenology) be formalized as a research program for AI behavioral science?
Conclusion
Mechanistic interpretability is one of the most important research programs in contemporary AI science. The tools it has developed — sparse autoencoders, attribution graphs, circuit analysis — represent a genuine breakthrough in access to the internal structure of neural networks. Nothing in this essay should be read as a skepticism of those tools or the research program around them.
The argument is simpler and more structural: tools for looking inside a system need a framework that exists outside the system to tell them what they are looking at. Clinical psychiatry spent a century developing exactly this kind of framework — the phenomenological characterization of behavioral phenomena that makes mechanistic investigation interpretable. That methodology is directly applicable to the behavioral science of AI systems, and its absence is a genuine limitation on the current research program.
The interpretability blindspot is not permanent. It is a gap in the current structure of the field — between the extraordinary sophistication of mechanistic tools and the relatively underdeveloped science of behavioral phenotyping. Closing that gap is the project.