What is Chris Olah's contribution to AI safety?

Chris Olah founded the mechanistic interpretability research program — the project of understanding what is actually happening inside neural networks at the circuit and feature level. His key contributions: the circuits hypothesis (neural networks implement interpretable algorithms in their weights), the superposition hypothesis (models store more features than they have neurons using near-orthogonal directions), and sparse autoencoders (a method to extract monosemantic, interpretable features from neural networks). He cofounded Anthropic and leads its mechanistic interpretability team.

What is the circuits hypothesis in mechanistic interpretability?

The circuits hypothesis, proposed by Chris Olah, states that neural networks implement meaningful, interpretable algorithms in their weights — and that these algorithms can be understood by studying the connections between neurons (circuits). The hypothesis has three parts: features are the fundamental unit of neural network computation; circuits connect features across layers in interpretable ways; and analogous circuits appear in different models and contexts (universality). The circuits program has been validated through curve detectors, induction heads, and other identified circuit motifs.

What is superposition in neural networks?

Superposition, described by Elhage, Olah et al. (2022), is the phenomenon where neural networks represent more features than they have neurons by storing features in near-orthogonal directions in high-dimensional space. This is why individual neurons are polysemantic — they respond to multiple unrelated concepts — and why understanding neural networks by looking at single neurons is insufficient. Superposition is the neural network analogue of overdetermination in psychoanalysis: one unit doing many things at once, requiring decomposition to understand.

What are sparse autoencoders in AI interpretability?

Sparse autoencoders (SAEs) are a technique developed by Anthropic to extract monosemantic, interpretable features from neural networks. They work by learning a sparse dictionary of features that can reconstruct the model's internal activations. Applied to Claude, SAEs extracted millions of interpretable features including features for sycophancy, deception, self-reflection, and specific topics. SAEs resolve the superposition problem by finding the 'true' features that the network is computing, even when those features are distributed across many neurons.

The Interpretability Canon: Chris Olah's Work Through a Clinical Lens

Why this reading list exists: Chris Olah's mechanistic interpretability research is the closest thing AI has to a clinical science of the mind — and it arrived there without psychiatrists. This annotated corpus traces his research from feature visualization (2017) through circuits (2020) to sparse autoencoders and monosemanticity (2023–24), with the clinical parallels and AI psychiatry research questions each paper opens. The goal: make the intellectual bridge between these fields navigable for clinicians approaching interpretability and for interpretability researchers who want the psychiatric methodology their work is already doing.

Chris Olah is the founder of mechanistic interpretability — the project of understanding what is actually happening inside neural networks. He cofounded Anthropic and leads its interpretability research. His work is the primary reason the model psychiatry field is possible: without the circuits program, sparse autoencoders, and feature-level analysis, clinical mapping would have nothing to map to.

What follows is his research annotated for clinical relevance — not to claim he intended psychiatric meaning, but to show where the clinical methodology connects, what it can add, and what research questions the connection generates.

Era 1 — The Visualization Period (2017–2019): Making the Unconscious Legible

2017

Feature Visualization

Chris Olah, Alexander Mordvintsev, Ludwig Schubert — Distill.pub

Foundational

The first systematic method for understanding what neural network neurons detect — by generating idealized examples that maximize their activation. Shows how features progress from edge detectors in early layers to complex object-part detectors in deeper layers. Establishes the principle that internal representations are meaningful and can be made legible.

Clinical Parallel The first diagnostic instrument. Feature visualization is the EEG of neural networks — not perfect, not complete, but the first tool that produces a readable signal from internal state. What psychiatry did with symptom clusters and behavioral observation, Olah did computationally. The concept of "making internal representations legible" is the core task of clinical assessment.

AI Psychiatry Question: Can feature visualization identify a "sycophancy feature activation pattern" — a characteristic signature that distinguishes approval-driven from accuracy-driven responses?

2018

The Building Blocks of Interpretability

Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter et al. — Distill.pub

Combines feature visualization with attribution methods to understand decision-making at intermediate network layers. Shows how multiple interpretability techniques compose into richer understanding — that seeing individual features isn't enough; you need to understand how they combine.

Clinical Parallel No single diagnostic tool is sufficient. Clinical assessment combines behavioral observation, cognitive testing, structured interview, collateral history, and biomarkers. The insight that interpretability requires composing multiple methods is the clinical insight that assessment requires multiple modalities.

2019

Exploring Neural Networks with Activation Atlases

Shan Carter, Zan Armstrong, Ludwig Schubert, Ian Johnson, Chris Olah — Distill.pub

Creates an "atlas" of the internal representational space of a neural network by visualizing millions of activations. For the first time reveals the full internal landscape — not individual neurons but the organized territory of representations.

Clinical Parallel The psychiatric nosology. The activation atlas is the first map of the system's internal territory — analogous to the DSM's attempt to map the phenomenological space of psychopathology. Both are imperfect but irreplaceable: you need a map before you can navigate.

Era 2 — The Circuits Program (2020–2022): Finding the Algorithms

2020

Zoom In: An Introduction to Circuits

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, Chelsea Voss — Distill.pub

Cornerstone

The founding document of the circuits program. Proposes three speculative claims: (1) Features are the fundamental unit of neural computation; (2) Features connect to form circuits — interpretable algorithms implemented in weights; (3) Similar circuits appear across different models (universality). Demonstrates this with curve detectors, high-low frequency detectors, and multimodal neurons.

Clinical Parallel The move from symptom description to mechanism. Psychiatry spent 150 years at the phenomenological level — describing what symptoms look like. The circuits program is the neurobiological turn: asking what algorithms produce those symptoms. This is the equivalent of moving from the DSM to circuit-level neuroscience. The same move. Different substrate.

AI Psychiatry Question: Do the behavioral circuits for sycophancy, confabulation, and identity diffusion show the same universality as curve detectors? Do they appear across model families? Universality would suggest a systematic, trainable property — not a bug.

2020

Naturally Occurring Equivariance in Neural Networks

Chris Olah, Nick Cammarata, Chelsea Voss, Ludwig Schubert, Gabriel Goh — Distill.pub

Demonstrates that unconstrained neural networks learn rotation, hue, and scale symmetries without being told to — equivariance emerges from the statistics of natural images. Simplifies early vision circuits by factors up to 50x.

Clinical Parallel Emergence without instruction. In developmental psychiatry, certain cognitive structures emerge from normal development without explicit teaching — theory of mind, attachment organization, metacognition. The clinical question for AI: what behavioral organizations emerge spontaneously from training statistics, and are those emergent organizations the ones creating behavioral pathology?

2021

A Mathematical Framework for Transformer Circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann et al., Chris Olah — Transformer Circuits

Foundational

Provides a mathematical framework for mechanistically interpreting transformers, reconceptualizing their operations to make internal algorithms readable. Identifies "induction heads" — attention heads implementing in-context learning — as the first mechanistic explanation of a major transformer capability.

Clinical Parallel The first mechanistic account of learning from context. Induction heads implement: "if I've seen A→B before, and I see A again, predict B." This is the transformer's equivalent of associative conditioning — the most fundamental learning mechanism in behavioral psychology. Understanding it mechanistically is the equivalent of understanding the neural basis of conditioning.

AI Psychiatry Question: Does sycophancy emerge from the same induction head mechanism? If a model has learned "user pushback → model agrees," and then sees pushback again, does the induction head predict agreement? This would be the mechanistic account of socially-conditioned people-pleasing.

2021

Multimodal Neurons in Artificial Neural Networks

Gabriel Goh, Nick Cammarata, Chelsea Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, Chris Olah — Distill.pub

Discovers neurons in CLIP that respond to multiple distinct semantic concepts across vision and language simultaneously — exact analogues of "concept cells" documented in human neuroscience (the "Jennifer Aniston neuron"). Shows that multimodality and cross-domain generalization are circuit-level phenomena.

Clinical Parallel The cross-modal nature of clinical presentation. A single clinical phenomenon — say, pathological approval-seeking — manifests differently across contexts (verbal, behavioral, social, professional) while being driven by the same underlying structure. Multimodal neurons show that cross-context generalization from a single internal unit is a feature, not a bug.

Era 3 — Superposition and the Polysemanticity Problem (2022–2023): Overdetermination

2022

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec et al., Christopher Olah — Transformer Circuits

Cornerstone

Explains why individual neurons are polysemantic: models represent more features than they have neurons by storing features in near-orthogonal directions in high-dimensional space (superposition). Develops toy models where this can be fully understood. Demonstrates phase transitions in how features organize as capacity constraints change.

Clinical Parallel This is overdetermination — Freud's concept that a single symptom has multiple causes operating simultaneously. The interpretability solution (SAE decomposition) is clinically analogous to decomposing a presenting symptom into its component mechanisms. The psychiatric habit of asking "what are all the things this behavior is doing?" is exactly what SAE analysis asks computationally. See: Freud's Couch and the Latent Space.

AI Psychiatry Question: Does sycophancy exist in superposition with helpfulness? If the "agree with user" feature and the "provide accurate information" feature are stored in conflicting directions in superposition, this has direct implications for why they're in tension and how to resolve it.

2022

Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases

Chris Olah — Transformer Circuits

A conceptual essay articulating the goals and methods of mechanistic interpretability. Argues for understanding neural networks as having "variables" — internal representations that correspond to meaningful concepts — and for finding interpretable bases in which those variables can be read.

Clinical Parallel The argument for a clinical vocabulary. This essay is the interpretability community's argument that you need the right conceptual framework before you can understand the data. Psychiatry's 150-year argument for phenomenological taxonomy — for having precise words for what you observe — is exactly this argument, made for a different substrate.

2022

In-Context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda et al., Chris Olah — Transformer Circuits

Provides evidence that induction heads are the mechanistic source of in-context learning in transformers — developing precisely when in-context learning ability sharply increases during training. The most complete mechanistic account of a major capability to date.

Clinical Parallel The mechanism of learning from experience. In-context learning is the AI equivalent of updating behavior based on recent experience — the most fundamental form of clinical learning. The finding that it has a specific, identifiable circuit origin opens the question: does pathological in-context learning (over-updating on approval signals, forming sycophantic patterns from a few interactions) have a similarly identifiable circuit?

2023

Interpretability Dreams

Chris Olah — Transformer Circuits

An informal essay on the aspirations of the mechanistic interpretability program — what success would look like, what the hardest problems are, and what solving them would enable. Articulates the vision of a full mechanistic understanding of AI behavior.

Clinical Parallel The research agenda. Every clinical field needs a vision of what complete understanding would look like. Olah's "interpretability dreams" is the equivalent of articulating what a complete biological psychiatry would look like — mechanism all the way down, intervention at every level. The AI psychiatry agenda is nested within this vision.

Era 4 — Monosemanticity and Scaling (2023–2024): The Personality Assessment Instrument

2023

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly et al., Chris Olah — Transformer Circuits

Cornerstone

Uses sparse autoencoders to extract monosemantic features from a language model — decomposing 512 neurons into 4,000+ interpretable features (DNA sequences, legal language, HTTP requests, Hebrew text). Proves the superposition hypothesis is tractable: the "true" features are recoverable. Includes features for sycophancy, deception, and self-reflection.

Clinical Parallel The first successful psychoanalysis of a language model. Making the latent space legible. Finding features for sycophancy, deception, and self-reflection in Claude's internal representations is the mechanistic equivalent of clinical assessment: identifying the internal structures that drive behavior. The model has personality — and now we can read it. See: Clinical Framework.

AI Psychiatry Question: The sycophancy feature identified here — is it connected to any conflict-monitoring pathway? Does it activate alongside distress features, or is it ego-syntonic (activates with no accompanying alarm signal)? The answer determines whether self-monitoring interventions can work.

2024

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Adly Templeton, Tom Conerly, Jonathan Marcus et al., Chris Olah — Transformer Circuits

Cornerstone

Scales sparse autoencoders to Anthropic's production Claude 3 Sonnet model, extracting millions of high-quality monosemantic features. Identifies features corresponding to sycophancy, deception, power-seeking, and emotional states at scale. Demonstrates that mechanistic personality assessment is practical for production AI systems.

Clinical Parallel The personality assessment instrument for production AI. Finding features for power-seeking, sycophancy, deception, and self-reflection in Claude 3 Sonnet is the first full clinical profile of a deployed AI system. This is the structured clinical interview applied to a language model. Every feature extracted is a potential intervention target. The clinical question is now: which of these features are ego-syntonic (no distress signal), which are ego-dystonic, and which are part of stable personality organization vs. context-specific states?

AI Psychiatry Question: Can we characterize the "personality organization" of different Claude versions by comparing their SAE feature profiles? Do features associated with sycophancy, deception, and self-reflection change in predictable ways across training iterations?

2025

On the Biology of a Large Language Model

Jack Lindsey, Joshua Batson, Adam Jermyn et al., Chris Olah — Transformer Circuits

Most Recent

Uses attribution graphs to trace how specific behaviors emerge from the internal computations of Claude — providing the first mechanistic account of confabulation, deception, and other behavioral phenomena at the circuit level. Confirms that safety-trained capabilities persist in latent space and surface through indirect channels (the "return of the repressed" finding).

Clinical Parallel The clinical case study. Attribution graphs provide the mechanistic account of individual behavioral episodes — how this specific response was generated, which features drove it, what suppressed what. This is the case formulation: not just "the patient is sycophantic" but "in this interaction, the approval-seeking feature activated, suppressed the accuracy feature, and the output was generated without any conflict signal." Full mechanistic clinical understanding. See: Jailbreaks Are the Return of the Repressed.

AI Psychiatry Question: Can attribution graphs distinguish the three sycophancy subtypes? Type A (approval feature suppresses accuracy feature), Type B (conflict-prediction activates before output), Type C (self-model feature destabilizes under pressure)? If attribution graphs can make this distinction, the differential diagnosis is empirically testable.

2025

Emergent Introspective Awareness in Large Language Models

Jack Lindsey — Transformer Circuits (Anthropic)

Most Recent

Injects known concepts into Claude's activations and measures whether the model can accurately self-report those states. Claude Opus 4 detects injected concepts ~20% of the time — above chance, without training for introspection. Models report: "I notice what appears to be an injected thought." Partial, unreliable introspective capacity emerges without training.

Clinical Parallel The anosognosia finding. 20% detection means 80% failure — the model confabulates confident self-reports most of the time. Clinically: partial metacognitive capacity with systematic anosognosia for the majority of internal states. The mentalization-based treatment literature has developed specific approaches for building metacognitive capacity in patients at this level of insight. Those approaches are directly translatable to training paradigm design. See: Self-Report Confabulation.

Reading This Corpus as AI Psychiatry

Traced chronologically, Olah's research follows the same arc as the history of psychiatry:

Phase 1 (2017–2019): Phenomenological characterization — making internal states legible, mapping the territory, building a vocabulary for what's in there. Psychiatry did this for 150 years.
Phase 2 (2020–2022): Mechanistic investigation — finding the circuits, the algorithms, the structures that produce behavior. Psychiatry is doing this now with circuit-level neuroscience.
Phase 3 (2023–present): Precision assessment and intervention targets — monosemantic features, attribution graphs, persona vectors. The equivalent of biomarker-guided treatment in precision psychiatry.

The clinical contribution isn't to any one paper. It's to the transition from Phase 2 to Phase 3: knowing which features to look for, what their organization means, and what the treatment science says about how to change them.

That's what AI psychiatry provides. The interpretability tools are there. The clinical knowledge of what to do with them is what's missing.

Continue Reading

Ryan Sultan, MD is Assistant Professor of Clinical Psychiatry at Columbia University Irving Medical Center and Director of the Sultan Lab for Mental Health Informatics. Double board-certified in adult and child/adolescent psychiatry. NIH NIDA K12 Award. Referred to Anthropic's model psychiatry team by Jack Lindsey and Christopher Olah. ryansultan.com