AI 'Microscope' Reveals the Hidden Mechanics of LLM Thought

Anthropic has introduced new research tools designed to provide a rare glimpse into the hidden reasoning processes of advanced language models — like a "microscope" for AI. The tools enable scientists to trace internal computations in large models like Anthropic's Claude, revealing the conceptual building blocks, thought circuits, and internal contradictions that emerge when AI "thinks."

The microscope, detailed in two new papers ("Circuit Tracing: Revealing Computational Graphs in Language Models" and "On the Biology of a Large Language Model"), represents a step toward understanding the internal workings of models that are often compared to black boxes. Unlike traditional software, large language models (LLMs) are not explicitly programmed but trained on massive datasets. As a result, their reasoning strategies are encoded in billions of opaque parameters, making it difficult even for their creators to explain how they function.

"We're taking inspiration from neuroscience," the company said in a blog post. "Just as brain researchers probe the physical structure of neural circuits to understand cognition, we're dissecting artificial neurons to see how models process language and generate responses."

Peering into "AI Biology"

Using their interpretability toolset, Anthropic researchers have identified and mapped "circuits" linked patterns of activity that correspond to specific capabilities such as reasoning, planning, or translating between languages. These circuits allow the team to track how a prompt moves through Claude's internal systems, revealing both surprising strengths and hidden flaws.

In one study, Claude was tasked with composing rhyming poetry. Contrary to expectations, researchers discovered that the model plans multiple words ahead to meet rhyme and meaning constraints, effectively reverse-engineering entire lines before writing the first word. Another experiment found that Claude sometimes generates fake reasoning when nudged with a false premise, offering plausible explanations for incorrect answers, raising new questions about the reliability of its step-by-step explanations.

The findings suggest that AI models possess something akin to a "language of thought," an abstract conceptual space that transcends individual languages. When translating between languages, for instance, Claude appears to access a shared semantic core before rendering the response in the target language. This "interlingua" behavior increases with model size, researchers noted.

Microscopic Proof of Concept

Anthropic's method, dubbed circuit tracing, enables researchers to alter internal representations mid-prompt — similar to stimulating parts of the brain to observe behavior changes. For example, when researchers removed the concept of "rabbit" from Claude's poetic planning state, the model swapped the ending rhyme from "rabbit" to "habit." When they inserted unrelated ideas like "green," the model adapted its sentence, accordingly, breaking the rhyme but maintaining coherence.

In mathematical tasks, Claude's internal workings also proved more sophisticated than surface interactions would suggest. While the model claims to follow traditional arithmetic steps, its actual process involves parallel computations: one estimating approximate sums, and another calculating final digits with precision. These findings suggest that Claude has developed hybrid reasoning strategies, even in simple domains.

Toward AI Transparency

The project is part of Anthropic's broader alignment strategy, which seeks to ensure AI systems behave safely and predictably. The interpretability tools are especially promising for identifying cases where a model may be reasoning toward a harmful or deceptive outcome, such as responding to a manipulated jailbreak prompt or appeasing biased reward signals.

One case study showed that Claude can sometimes recognize a harmful request well before formulating a complete refusal, but internal pressure to generate grammatically coherent output causes a brief lapse, only recovering safety alignment after completing a sentence. Another test found that the model declined to speculate by default, only producing an answer when certain "known entity" circuits overruled its reluctance, sometimes resulting in hallucinations.

Although the methods are still limited, capturing only fractions of a model's internal activity, Anthropic believes circuit tracing offers a scientific foundation for scaling interpretability in future AI systems.

"This is high-risk, high-reward work. It's painstaking to map even simple prompts, but as models grow more complex and impactful, the ability to see what they're thinking will be essential for ensuring they're aligned with human values and worthy of our trust," the company said. 

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at [email protected].

Featured

  • laptop screen displaying a typed essay, on a child

    McGraw Hill Acquires Essaypop Digital Learning Tool

    Education company McGraw Hill has announced the acquisition of Essaypop, a cloud-based writing tool that will enhance the former's portfolio of personalized learning capabilities.

  • glowing digital brain made of blue circuitry hovers above multiple stylized clouds of interconnected network nodes against a dark, futuristic background

    Report: 85% of Organizations Are Leveraging AI

    Eighty-five percent of organizations today are utilizing some form of AI, according to the latest State of AI in the Cloud 2025 report from Wiz. While AI's role in innovation and disruption continues to expand, security vulnerabilities and governance challenges remain pressing concerns.

  • The AI Show

    Register for Free to Attend the World's Greatest Show for All Things AI in EDU

    The AI Show @ ASU+GSV, held April 5–7, 2025, at the San Diego Convention Center, is a free event designed to help educators, students, and parents navigate AI's role in education. Featuring hands-on workshops, AI-powered networking, live demos from 125+ EdTech exhibitors, and keynote speakers like Colin Kaepernick and Stevie Van Zandt, the event offers practical insights into AI-driven teaching, learning, and career opportunities. Attendees will gain actionable strategies to integrate AI into classrooms while exploring innovations that promote equity, accessibility, and student success.

  • A child surrounded by glowing, fluid virtual patterns and holographic shapes, illuminated in a dark gradient environment of blue, purple, and pink.

    ClassVR Gets Expanded VR/AR Content Library

    Avantis Education has announced a new content library for its ClassVR virtual and augmented reality platform. Dubbed Eduverse+, the library features four content suites — EduverseAI, WildWorld, STEAM3D, and CareerHub — that can be tailored to suit a variety of educational levels.