
(Images made by author with Microsoft Copilot)
Large language models (LLMs) like Claude Sonnet are powerful tools, but their inner workings remain shrouded in mystery. This lack of transparency makes it difficult to trust their outputs and ensure their safety. In this blog post, we’ll explore how researchers at Anthropic have made a significant contribution to AI transparency by peering inside Claude’s mind, revealing the concepts it uses and how they influence its behavior.
Table of Contents
Unveiling the Internal Landscape: Feature Extraction
LLMs have traditionally been opaque black boxes. Data goes in, a response comes out, but the internal process – how the model arrives at that response – remains a mystery. This lack of transparency hinders trust and limits our understanding of how these models function.
Anthropic directly tackles this challenge in a recently published research paper. The research team employs dictionary learning, a technique that helps identify patterns of neuron activations in Claude Sonnet. These patterns act as building blocks for representing human-understandable concepts. By deconstructing the model’s internal state into these individual building blocks, or features, the research reveals how the model grasps specific concepts like cities, scientific fields, and even abstract ideas like “inner conflict.” These features collectively form the model’s comprehensive understanding of the world.
The ‘Golden Gate Bridge’ feature, represented below, is just one example of the millions identified within Claude Sonnet’s middle layer using dictionary learning.

Source : Anthropic
Manipulating Features: Steering AI Behavior
These features aren’t merely theoretical; they actively influence the model’s internal workings. Researchers experimented with feature steering, a technique where they artificially adjust a feature’s activation level during the model’s processing to influence its responses. In one experiment, amplifying the “Golden Gate Bridge” feature caused Claude to identify itself as the bridge, deviating from its usual “I am an AI model” response. Similarly, by significantly intensifying the “scam email” feature, Claude generated a scam email, bypassing its typical safety protocols.
This manipulation highlights the importance of features. They actively shape the model’s behavior, opening doors for various safety applications. Imagine being able to monitor AI systems for features linked to deception or bias, or even deactivating them entirely. It’s like having a safety switch for the model’s internal processes.
Leveraging Features for a Safer AI Future
Beyond mapping the model’s mind, the research team has several long-term goals:
Comprehensive Feature Identification: Scaling up their techniques to identify a more complete set of features, acknowledging the current methods’ limitations.
Understanding Feature Utilization: Moving beyond mere identification to uncover how the model utilizes these features in its decision-making processes. This involves investigating the “circuits” in which features operate.
Developing Safety Mechanisms: Leveraging the understanding of features, particularly those associated with risks like racist claims, to develop robust safety measures that prevent harm and promote responsible AI use.
Conclusion
The findings of this research advance the understanding of AI systems’ internal workings, leading to greater AI transparency. By recognizing and examining the features that govern LLM behavior, this study offers a basis for deciphering their decision-making mechanisms. This deeper understanding will contribute to ensuring that AI systems are powerful, safe, and reliable. Ultimately, genuine intelligence necessitates explainability, and this research establishes a foundation for a future where AI can function as a responsible and transparent collaborator.
Learn more
Adly Templeton, Tom Conerly et al, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Anthropic, May 21, 2024.
Anthropic, Mapping the Mind of a Large Language Model, May 21, 2024
This post was researched and written with the assistance of various AI-based tools.


Leave a comment