AI 'Microscope' Reveals the Hidden Mechanics of LLM Thought

Anthropic has introduced new research tools designed to provide a rare glimpse into the hidden reasoning processes of advanced language models — like a "microscope" for AI. The tools enable scientists to trace internal computations in large models like Anthropic's Claude, revealing the conceptual building blocks, thought circuits, and internal contradictions that emerge when AI "thinks."

The microscope, detailed in two new papers ("Circuit Tracing: Revealing Computational Graphs in Language Models" and "On the Biology of a Large Language Model"), represents a step toward understanding the internal workings of models that are often compared to black boxes. Unlike traditional software, large language models (LLMs) are not explicitly programmed but trained on massive datasets. As a result, their reasoning strategies are encoded in billions of opaque parameters, making it difficult even for their creators to explain how they function.

"We're taking inspiration from neuroscience," the company said in a blog post. "Just as brain researchers probe the physical structure of neural circuits to understand cognition, we're dissecting artificial neurons to see how models process language and generate responses."

Peering into "AI Biology"

Using their interpretability toolset, Anthropic researchers have identified and mapped "circuits" linked patterns of activity that correspond to specific capabilities such as reasoning, planning, or translating between languages. These circuits allow the team to track how a prompt moves through Claude's internal systems, revealing both surprising strengths and hidden flaws.

In one study, Claude was tasked with composing rhyming poetry. Contrary to expectations, researchers discovered that the model plans multiple words ahead to meet rhyme and meaning constraints, effectively reverse-engineering entire lines before writing the first word. Another experiment found that Claude sometimes generates fake reasoning when nudged with a false premise, offering plausible explanations for incorrect answers, raising new questions about the reliability of its step-by-step explanations.

The findings suggest that AI models possess something akin to a "language of thought," an abstract conceptual space that transcends individual languages. When translating between languages, for instance, Claude appears to access a shared semantic core before rendering the response in the target language. This "interlingua" behavior increases with model size, researchers noted.

Microscopic Proof of Concept

Anthropic's method, dubbed circuit tracing, enables researchers to alter internal representations mid-prompt — similar to stimulating parts of the brain to observe behavior changes. For example, when researchers removed the concept of "rabbit" from Claude's poetic planning state, the model swapped the ending rhyme from "rabbit" to "habit." When they inserted unrelated ideas like "green," the model adapted its sentence, accordingly, breaking the rhyme but maintaining coherence.

In mathematical tasks, Claude's internal workings also proved more sophisticated than surface interactions would suggest. While the model claims to follow traditional arithmetic steps, its actual process involves parallel computations: one estimating approximate sums, and another calculating final digits with precision. These findings suggest that Claude has developed hybrid reasoning strategies, even in simple domains.

Toward AI Transparency

The project is part of Anthropic's broader alignment strategy, which seeks to ensure AI systems behave safely and predictably. The interpretability tools are especially promising for identifying cases where a model may be reasoning toward a harmful or deceptive outcome, such as responding to a manipulated jailbreak prompt or appeasing biased reward signals.

One case study showed that Claude can sometimes recognize a harmful request well before formulating a complete refusal, but internal pressure to generate grammatically coherent output causes a brief lapse, only recovering safety alignment after completing a sentence. Another test found that the model declined to speculate by default, only producing an answer when certain "known entity" circuits overruled its reluctance, sometimes resulting in hallucinations.

Although the methods are still limited, capturing only fractions of a model's internal activity, Anthropic believes circuit tracing offers a scientific foundation for scaling interpretability in future AI systems.

"This is high-risk, high-reward work. It's painstaking to map even simple prompts, but as models grow more complex and impactful, the ability to see what they're thinking will be essential for ensuring they're aligned with human values and worthy of our trust," the company said. 

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at [email protected].

Featured

  • Abstract AI circuit board pattern

    Nonprofit LawZero to Work Toward Safer, Truthful AI

    Turing Award-winning AI researcher Yoshua Bengio has launched LawZero, a nonprofit aimed at developing AI systems that prioritize safety and truthfulness over autonomy.

  • abstract pattern of cybersecurity, ai and cloud imagery

    Report Identifies Malicious Use of AI in Cloud-Based Cyber Threats

    A recent report from OpenAI identifies the misuse of artificial intelligence in cybercrime, social engineering, and influence operations, particularly those targeting or operating through cloud infrastructure. In "Disrupting Malicious Uses of AI: June 2025," the company outlines how threat actors are weaponizing large language models for malicious ends — and how OpenAI is pushing back.

  • tutor and student working together at a laptop

    You've Paid for Tutoring. Here's How to Make Sure It Works.

    As districts and states nationwide invest in tutoring, it remains one of the best tools in our educational toolkit, yielding positive impacts on student learning at scale. But to maximize return on investment, both financially and academically, we must focus on improving implementation.

  • red brick school building with a large yellow "AI" sign above its main entrance

    New National Academy for AI Instruction to Provide Free AI Training for Educators

    In an effort to "transform how artificial intelligence is taught and integrated into classrooms across the United States," the American Federation of Teachers (AFT), in partnership with Microsoft, OpenAI, Anthropic, and the United Federation of Teachers, is launching the National Academy for AI Instruction, a $23 million initiative that will provide access to free AI training and curriculum for all AFT members, beginning with K-12 educators.