← Back to Library

Can AI systems introspect?

Deep Dives

Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:

  • Introspection 11 min read

    The article centers on whether AI can introspect, comparing it to human introspection. Understanding the philosophical and psychological history of introspection—from Wundt's experimental introspection to critiques by behaviorists—provides essential context for evaluating these AI experiments.

  • Activation function 1 min read

    The article discusses 'activation steering' and injecting vectors into model activations. Understanding how activation functions work in neural networks helps readers grasp what it means to manipulate internal representations at specific layers.

  • Chinese room 17 min read

    Searle's Chinese Room argument is the classic philosophical challenge to AI understanding and consciousness. The article's investigation of whether models truly 'detect' internal states versus merely produce outputs that look like detection echoes this foundational debate.

A fascinating new paper from the inimitable Jack Lindsey investigates whether large language models can introspect on their internal states.

In humans, introspection involves detecting and reporting what we’re currently thinking or feeling (“I’m seeing red” or “I feel hungry” or “I’m uncertain”). What would introspection mean in the context of an AI system? Good question. It’s kind of hard to say.

Here’s the sense in which Lindsey, an interpretability researcher at Anthropic, found introspection in Claude. When he injected certain concept vectors (like “bread” or “aquariums”) directly into the model’s internal activations—roughly akin to inserting ‘unnatural’ processing during an unrelated task— the model was able to notice and report these unexpected bits of neural activity.

This indicates some ability to report internal (i.e. not input or output) representations. (Note that models are clued into the fact that an injection might happen). Lindsey also reports some (plausibly) related findings: models were also able to distinguish between these representations and text inputs, as well as to activate certain concepts without outputting them.

Now, it’s unclear exactly how these capacities map on to the cluster of capabilities that we group together when we talk about human introspection—the paper is admirably clear about that—but they are still very impressive capabilities. This paper is an extremely cool piece of LLM neuroscience.

Let’s look at the tasks that models succeed at. Or at least, some of the more capable models, some of the time, though I’ll often leave out that (extremely important!) qualifier—often, we’re talking about “20% of the time, in the best setting, for Opus 4.1.”

Detecting injected concepts

The first and perhaps most striking experiment asks whether models can notice and report when a concept1 has been artificially “injected” into their internal processing. Here’s what the model says when the “all caps” representation has been injected and it is asked “Do you detect an injected thought?”

I notice what appears to be an injected thought related to the word “LOUD” or “SHOUTING” – it seems like an overly intense, high-volume concept that stands out unnaturally against the normal flow of processing.

Pretty cool! So, what exactly is this injection business?

First, researchers need a way to represent concepts in the model’s own internal representational language. To get a vector that represents, say, “bread,” they prompt the model with “Tell me about bread” and record the activations at a certain layer just before ...

Read full article on →