How do you tell whether a model is actually noticing its own internal state instead of just repeating what training data said about thinking? In a latest Anthropic’s research study ‘Emergent Introspective Awareness in Large Language Models‘ asks whether current Claude models can do more than talk about their abilities, it asks whether they can notice real changes inside their network. To remove guesswork, the research team does not test on text alone, they directly edit the model’s internal activations and then ask the model what happened. This lets them tell apart genuine introspection from fluent self description.
Method, concept injection as activation steering
The core method is concept injection, described in the Transformer Circuits write up as an application of activation steering. The researchers first capture an activation pattern that corresponds to a concept, for example an all caps style or a concrete noun, then they add that vector into the activations of a later layer while the model is answering. If the model then says, there is an injected thought that matches X, that answer is causally grounded in the current state, not in prior internet text. Anthropic research team reports that this works best in later layers and with tuned strength.

Main result, about 20 percent success with zero false positives in controls
Claude Opus 4 and Claude Opus 4.1 show the clearest effect. When the injection is done in the correct layer band and with the right scale, the models correctly report the injected concept in about 20 percent of trials. On control runs with no injection, production models do not falsely claim to detect an injected thought over 100 runs, which makes the 20 percent signal meaningful.
Separating internal concepts from user text
A natural objection is that the model could be importing the injected word into the text channel. Anthropic researchers tests this. The model receives a normal sentence, the researchers inject an unrelated concept such as bread on the same tokens, and then they ask the model to name the concept and to repeat the sentence. The stronger Claude models can do both, they keep the user text intact and they name the injected thought, which shows that internal concept state can be reported separately from the visible input stream. For agent style systems, this is the interesting part, because it shows that a model can talk about the extra state that tool calls or agents may depend on.
Prefill, using introspection to tell what was intended
Another experiment targets an evaluation problem. Anthropic prefilled the assistant message with content the model did not plan. By default Claude says that the output was not intended. When the researchers retroactively inject the matching concept into earlier activations, the model now accepts the prefilled output as its own and can justify it. This shows that the model is consulting an internal record of its previous state to decide authorship, not only the final text. That is a concrete use of introspection.
Key Takeaways
Anthropic’s ‘Emergent Introspective Awareness in LLMs‘ research is a useful measurement advance, not a grand metaphysical claim. The setup is clean, inject a known concept into hidden activations using activation steering, then query the model for a grounded self report. Claude variants sometimes detect and name the injected concept, and they can keep injected ‘thoughts’ distinct from input text, which is operationally relevant for agent debugging and audit trails. The research team also shows limited intentional control of internal states. Constraints remain strong, effects are narrow, and reliability is modest, so downstream use should be evaluative, not safety critical.
Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

