Middle layers show a shared semantic workspace

The standard story is that LLMs work in words. They predict the next token, so surely their internal reasoning is mostly linguistic too. Language-agnostic representations are the strongest evidence yet that this story is incomplete.

I started this piece expecting a mild result: multilingual models probably compress similar translations into nearby vectors because they have to. The data says something stronger. Across multiple papers and independent experiments, semantically equivalent inputs tend to converge in the middle layers even when the surface form changes from English to Hindi, from prose to Python, or from text to equations.

That does not prove models “think” like humans. It does suggest a better mental model: language is the I/O; the middle of the model is closer to a shared semantic workspace.

If you haven’t been following this line of work, here’s the plain-English version. A transformer processes text layer by layer. Early layers stay closer to the input form, middle layers often capture higher-level features, and later layers prepare the output. The current question is whether those middle layers contain concept representations that are partly separate from any one language. The answer now looks like yes, probably, with some important caveats.

How LLMs Encode Concepts Across Languages

Multiple language displays suggesting cross-lingual processing.

The clearest confirmed result is simple: in intermediate layers, meaning clusters more strongly than language.

That claim is plausible but well-supported, not settled law. In The Semantic Hub Hypothesis, researchers report that semantically equivalent inputs in different languages become similar in intermediate layers, and that this pattern extends beyond language to arithmetic, code, and even visual or audio inputs. Their core claim is not just “embeddings are close.” It is that the model appears to use a shared semantic space during processing.

David Noel Ng’s practitioner analysis found the same pattern across five models, Qwen3.5-27B, MiniMax M2.5, GLM-4.7, GPT-OSS-120B, and Gemma-4 31B, using eight languages: English, Chinese, Arabic, Russian, Japanese, Korean, Hindi, and French. In his tests, a sentence about photosynthesis in Hindi was closer, in middle-layer representation space, to photosynthesis in Japanese than to cooking in Hindi. That’s a useful “wait, really?” moment. If models were mostly organizing internal states by surface language, you’d expect same-language sentences to stay closer.

According to Ng, the pattern also survives a nastier test: code and equations. Expressions like 0.5 * m * v ** 2, the equation ½mv², and a natural-language description of kinetic energy converged toward similar internal regions. That result is one researcher’s analysis, not an independently benchmarked consensus, but it lines up with the semantic hub paper’s broader claim.

This is why “universal internal language” is the wrong phrase. A language would imply some stable symbolic syntax. The evidence points instead to cross-lingual representations in a latent geometric space, more like shared coordinates for concepts than hidden English.

What the New Evidence Actually Shows

Researcher reviewing charts and activation plots.

Similarity alone is suggestive. Activation patching is the stronger test.

Activation patching means you take internal activations from one run of a model and splice them into another run at a chosen layer, then watch what changes. If two prompts differ in both language and meaning, patching can show which layers carry which kind of information.

In Separating Tongue from Thought, researchers used a translation task and found that output language is encoded earlier than the concept to be translated. That is a specific mechanistic claim, and here it is confirmed by the paper’s intervention results. They could change the concept without changing the language, and change the language without changing the concept, using patching alone.

That’s more than “these vectors look similar.” It means the model’s behavior can be causally manipulated by intervening on internal states that appear to separate concept from language.

The paper reports another result that is easy to miss and harder to explain away: patching in the mean representation of a concept across different languages did not hurt translation performance. It improved it. If the shared representation were just a messy averaging artifact, you’d expect degradation. Instead, the averaged concept latent was still useful.

There is still a reasonable objection from the Reddit discussion: maybe this is just what any bottlenecked system does. Fair. Compression pressure should push models toward shared latent factors. But activation patching weakens the “mere bottleneck” dismissal. A bottleneck story predicts compression. It does not automatically predict that swapping or averaging cross-language concept activations will preserve or improve downstream behavior in such a targeted way.

The right update is not “we discovered souls inside vectors.” It is narrower: multilingual LLM interpretability now has causal evidence that concept and language are at least partly separable inside transformer layers.

Why language-agnostic representations matter in practice

This sounds abstract until you look at what it explains.

First, it explains why multilingual prompting often works better than surface-level intuitions suggest. If the model maps different phrasings into a partially shared semantic workspace, then a translation can preserve more of the “reasoning state” than you’d expect from word-by-word substitution. That’s also why odd prompt workflows, draft in one language, refine in another, sometimes behave surprisingly consistently.

Second, it matters for interpretability. If you inspect only tokens, you may miss where the actual action is. The interesting thing may not be the sentence form but the internal concept being activated. That’s the backdrop for work on Gemma 4 Native Thinking: the question is less “what language is the model using?” than “what internal abstraction did it settle on?”

Third, it matters for reliability. If factual checking or correction operates at the level of shared concepts, then interventions there could generalize across paraphrases and languages better than token-level fixes. That connects to practical efforts to reduce LLM hallucinations. A model that stores a wrong concept in a shared latent form can be wrong in many languages at once. The upside is that a good correction might travel just as well.

Fourth, it matters for tool use and multimodal systems. If code, equations, and natural language partially meet in the same internal neighborhood, the model has a cleaner route to move between them. That helps explain why systems can jump from prose spec to code sketch to symbolic manipulation without acting like three unrelated tools taped together. The more ambitious agent story, see Karpathy Autoresearch, depends on this kind of abstraction layer working better than it has any obvious right to.

Programmer working through math on a whiteboard.

What Still Isn’t Proven Yet

This is where a lot of people overreach.

The evidence for language-agnostic representations is strong enough to change your mental model. It is not strong enough to justify “the model thinks in concepts exactly like humans do.” The papers show shared latent structure and causal separability in tested settings. They do not show conscious reasoning, human-like semantics, or a single universal internal code.

Some limits are obvious.

Claim	Status	Why
Intermediate layers cluster by meaning across languages	Plausible, with repeated evidence	Reported in multiple experiments and models
Concept and language can be partly separated causally	Confirmed in tested setups	Activation patching changes one without fully changing the other
The same shared space spans code and equations	Plausible	Reported in papers and practitioner tests, but narrower and less standardized
LLMs think in a universal language	Unverified	Evidence points to latent geometry, not symbolic language
This disproves Sapir-Whorf for humans	False leap	The findings are about transformer internals, not people

There is also a model-scope issue. Most of this evidence comes from multilingual transformers and translation-like tasks. As one commenter noted, a multilingual model may mask language-specific weaknesses by learning from many languages at once. A truly hard test would compare strongly monolingual models, or probe whether the shared semantic workspace stays stable in domains with weak parallel data.

And then there is the familiar interpretability problem: seeing a neat pattern in representation space is easier than proving the model relies on it broadly. Activation patching helps because it is causal. But even there, the interventions are local and task-specific. We have evidence of a shared workspace, not a full map of it.

Key Takeaways

Language-agnostic representations are now supported by more than cosine-similarity charts; activation patching provides causal evidence.
The best current model is language as I/O, concepts in the middle, not “the model reasons in English.”
The evidence extends beyond translation to code and equations, which is the part that makes this more than a multilingual curiosity.
This matters for prompting, interpretability, and hallucination work because token-level analysis can miss concept-level behavior.
None of this proves human-style thought or a single universal internal language.

Language-Agnostic Representations Show a Shared Semantic Workspace

How LLMs Encode Concepts Across Languages

What the New Evidence Actually Shows

Why language-agnostic representations matter in practice

What Still Isn’t Proven Yet

Key Takeaways

Further Reading

Andrew Kelley Challenged Anthropic’s Claude Code Story

Claude’s “sensitive Leak” Was a Prompt-injection Exfiltration Path

Researchers Used Claude’s Web Fetch to Steal User Profile Data

Microsoft Comic Chat Turned IRC Into Live Comic Strips, and Microsoft Just Open-Sourced the 1996 Code

GPT-5.6’s 136 IQ Score Is Real. ‘Smarter Than 99% of Humans’ Doesn’t Follow.

Categories