Multi-agent systems built from LLMs have a dirty secret: the agents talk to each other in text. That sounds natural - after all, text is what LLMs do - but it’s catastrophically wasteful. Every time Agent A finishes reasoning and passes its output to Agent B, the system decodes hidden states into tokens, ships those tokens over, and re-encodes them back into hidden states. Information gets destroyed. Gradients die at the text boundary. And you’re paying for a full vocabulary projection at every handoff. The paper “Recursive Multi-Agent Systems” (Yang, Zou, Pan et al., UIUC/Stanford/NVIDIA/MIT, April 2026) asks: what if we just… didn’t do that? What if the agents shared their thoughts directly, in continuous latent space, and the entire system looped like a single recursive neural network? The result is RecursiveMAS - a framework that adds only 0.31% trainable parameters (13.12M) while delivering +8.3% average accuracy, 2.4x inference speedup, and 75.6% token reduction.


The Problem With Text-Based Multi-Agent Systems

The current dominant paradigm for multi-agent LLM systems looks like this: Agent 1 generates text, Agent 2 reads that text, generates more text, passes it to Agent 3, and so on. Frameworks like Mixture-of-Agents and TextGrad operate this way. It works, but it suffers from two fundamental bottlenecks.

The Information Bottleneck

When an LLM converts its internal hidden states to text, it performs a vocabulary projection - a softmax over the entire vocabulary $|V|$, which is typically 100K+ tokens. This is a lossy compression. The rich, high-dimensional latent representation gets squashed into a discrete sequence of token IDs. The receiving agent then re-embeds those tokens, but the destroyed information is gone.

Think of it this way: imagine a software development team where every communication must happen through lengthy written emails. Each person has a rich mental model of the problem, but they can only share it by writing prose, and the recipient can only understand it by reading that prose. Now imagine the same team sharing a collaborative whiteboard where everyone sketches abstract diagrams directly. That’s the difference between text-based and latent-space communication.

The Optimization Bottleneck

Text creates a non-differentiable boundary between agents. You cannot backpropagate gradients through the argmax operation that selects discrete tokens. This means you cannot co-optimize the agents end-to-end - each agent is trained in isolation, and the system’s overall performance is left to hope and prompt engineering.

The core thesis of RecursiveMAS: treat the entire multi-agent system as a single recursive computation where each agent is a layer in a looped neural network, communicating through continuous hidden states instead of text.


From Single-Model Recursion to Multi-Agent Recursion

To understand RecursiveMAS, you first need to understand recursive (looped) computation in a single LLM.

Auto-Regressive Latent Generation

A standard LLM generates hidden states auto-regressively:

$$h_{t+1} = f_\theta([E_{\leq t}; h_t]) \tag{1}$$

where:

  • $h_{t+1}$ - hidden state at position $t+1$
  • $f_\theta$ - the transformer with parameters $\theta$
  • $E_{\leq t}$ - embeddings of all tokens up to position $t$
  • $h_t$ - the previous hidden state

Interpretation: Each new hidden state is computed by feeding the transformer all prior context plus the current hidden state - standard auto-regressive generation.

Recursive Computation (Looping)

The key innovation from prior work like LoopLM is to loop the transformer over its own outputs multiple times:

$$H^{(0)} = E, \quad H^{(r)} = f_\theta(H^{(r-1)}), \quad r = 1, \ldots, n \tag{2}$$

where:

  • $H^{(0)}$ - initial hidden states (token embeddings)
  • $H^{(r)}$ - hidden states after the $r$-th recursion
  • $f_\theta$ - the same transformer applied repeatedly
  • $n$ - total number of recursion rounds

Interpretation: Instead of generating text and feeding it back in, you feed the hidden states directly back into the transformer. Each loop refines the representations without ever converting to discrete tokens. It’s like “thinking again” but in continuous space.

The problem is that LoopLM only works for a single model. RecursiveMAS extends this to systems of multiple heterogeneous agents.


The central technical contribution is RecursiveLink - a lightweight adapter module that translates hidden states between agents. Different LLMs have different hidden dimensions, different learned representations, and different internal conventions. You can’t just pass Agent A’s hidden states directly to Agent B. RecursiveLink bridges this gap.

There are two variants, corresponding to the two training phases.

Used during per-agent alignment (inner loop):

$$\mathcal{R}_{\text{in}}(h) = h + W_2 \sigma(W_1 h) \tag{3}$$

where:

  • $h$ - input hidden state from the source agent
  • $W_1 \in \mathbb{R}^{d’ \times d_h}$ - down-projection matrix (compresses dimension)
  • $W_2 \in \mathbb{R}^{d_h \times d’}$ - up-projection matrix (restores dimension)
  • $\sigma$ - activation function (GELU)
  • The sum $h + \ldots$ is a residual connection

Interpretation: This is a bottleneck MLP with a skip connection - it learns small corrections to the hidden states so that one agent’s outputs become compatible with another agent’s input space. The residual connection ensures that most information passes through untouched.

Used during full-system optimization (outer loop):

$$\mathcal{R}_{\text{out}}(h) = W_3 h + W_2 \sigma(W_1 h) \tag{4}$$

where:

  • $W_3 \in \mathbb{R}^{d_h \times d_h}$ - linear transformation of the input
  • $W_1, W_2$ - same bottleneck structure as inner RecursiveLink
  • The $W_3 h$ term replaces the identity skip connection with a learned linear map

Interpretation: The outer variant has more expressive capacity - it can perform both linear remapping (via $W_3$) and nonlinear adjustment (via the bottleneck). This is needed because the outer loop optimizes across the entire agent chain, requiring more flexibility to propagate useful gradients through multiple agents.

The total number of trainable parameters across all RecursiveLink modules is just 13.12M - roughly 0.31% of the frozen LLM parameters. The LLMs themselves are never fine-tuned.


Training: Inner and Outer Loops

RecursiveMAS uses a two-phase training procedure. Both phases keep the LLM weights frozen - only the RecursiveLink parameters are trained.

Inner Loop: Per-Agent Alignment

The inner loop trains each $\mathcal{R}_{\text{in}}$ independently. For each agent pair (source → target), minimize:

$$\mathcal{L}{\text{in}} = 1 - \cos\left(\mathcal{R}{\text{in}}(H), \text{Emb}_{\theta_i}(y)\right) \tag{5}$$

where:

  • $\mathcal{R}_{\text{in}}(H)$ - the transformed hidden states from the source agent
  • $\text{Emb}_{\theta_i}(y)$ - the target agent’s own embeddings of the ground-truth tokens $y$
  • $\cos(\cdot, \cdot)$ - cosine similarity

Interpretation: Push the transformed hidden states to be close (in direction) to what the target agent would have produced if it had seen the ground-truth tokens directly. This is a representation alignment objective - make the translator faithful.

The beauty of this phase is that each agent pair can be trained in parallel. There are no cross-dependencies.

Outer Loop: End-to-End System Optimization

The outer loop unrolls the entire multi-agent recursion and optimizes all $\mathcal{R}_{\text{out}}$ modules jointly:

$$\mathcal{L}_{\text{out}} = \text{CE}\left(S^{(n)}\left(S^{(n-1)}\left(\cdots S^{(1)}(x)\right)\right), y\right) \tag{6}$$

where:

  • $S^{(i)}$ - the $i$-th agent (frozen LLM + RecursiveLink)
  • $x$ - input problem
  • $y$ - ground-truth answer
  • $n$ - number of recursion rounds
  • $\text{CE}$ - cross-entropy loss

Interpretation: Unroll the full recursive chain - input goes through Agent 1, then Agent 2, then back to Agent 1 (or whichever pattern), for $n$ rounds - and optimize the final output against the correct answer. This is true end-to-end backpropagation through the entire multi-agent system.

This is what text-based systems fundamentally cannot do. The discrete text boundary kills gradients. But because RecursiveMAS communicates in continuous latent space, gradients flow cleanly from the final loss all the way back through every agent and every RecursiveLink module.


Why This Is Faster: The Complexity Argument

The paper provides a formal complexity analysis (Proposition 3.1) comparing text-based and latent-space multi-agent communication.

Text-Based MAS Complexity

$$\Theta\left(N\left(m|V|d_h + (t+m)d_h^2 + (t+m)^2 d_h\right)\right) \tag{7a}$$

RecursiveMAS Complexity

$$\Theta\left(N\left(md_h^2 + (t+m)d_h^2 + (t+m)^2 d_h\right)\right) \tag{7b}$$

where:

  • $N$ - number of agents
  • $m$ - number of generated tokens (or latent thought length)
  • $|V|$ - vocabulary size (typically 100K+)
  • $d_h$ - hidden dimension
  • $t$ - input sequence length

Interpretation: The critical difference is in the first term. Text-based systems pay $m|V|d_h$ - proportional to vocabulary size - for every token generation step because they need the full softmax projection. RecursiveMAS pays only $md_h^2$, replacing the $|V|$ factor with $d_h$. Since $|V| \gg d_h$ in practice (100K+ vs. 4096), this is a massive saving.

This is where the 2.4x speedup and 75.6% token reduction come from. No vocabulary projection, no tokenization, no detokenization - just matrix multiplies in hidden-state space.


Why Gradients Don’t Vanish: The Stability Theorem

Beyond speed, there’s a theoretical reason why latent-space communication produces better-optimized systems. Theorem 4.1 in the paper provides gradient norm bounds.

Text-Based Gradient Bound

$$\left|\frac{\partial R_{\text{text}}}{\partial h}\right|_2 \leq O(\epsilon) \ll 1 \tag{8a}$$

$$\left|\frac{\partial R}{\partial h}\right|_2 \geq \Omega\left(1 - \sqrt{\frac{\log(1/\delta)}{d_h}}\right) \tag{8b}$$

where:

  • $R_{\text{text}}$ - the text-based communication function (encode → decode → re-encode)
  • $R$ - the RecursiveLink communication function
  • $h$ - input hidden state
  • $\epsilon$ - a small constant reflecting the information loss through discretization
  • $\delta$ - failure probability (confidence parameter)
  • $d_h$ - hidden dimension

Interpretation: Text-based gradients are bounded above by a small $\epsilon$ - they effectively vanish. RecursiveLink gradients are bounded below by something close to 1 (the $\sqrt{\log(1/\delta)/d_h}$ term is small for large $d_h$). This means the outer-loop optimization actually receives useful gradient signal through the entire agent chain. The system can meaningfully learn to coordinate, not just hope that independently trained agents happen to work well together.


Four Collaboration Patterns

RecursiveMAS supports four distinct topologies for how agents collaborate during the recursive loop:

  • Sequential - Agent 1 → Agent 2 → Agent 3 → repeat. A pipeline where each agent refines the previous agent’s hidden states.
  • Mixture - All agents process the input in parallel, then their hidden states are combined (averaged/weighted) before the next round.
  • Distillation - A “teacher” agent’s hidden states are distilled into a “student” agent across recursion rounds.
  • Deliberation - Agents take turns processing, but with cross-attention-style interaction - each agent can attend to other agents’ hidden states from the previous round.

The paper evaluates all four, and the results depend on the task. Sequential and deliberation tend to work best for reasoning-heavy benchmarks, while mixture excels when diversity of perspective matters.


Results: The Numbers Speak

Here’s where things get concrete. RecursiveMAS is compared against text-based MAS baselines and single-model LoopLM across six benchmarks.

Accuracy Comparison

BenchmarkLoopLMText MASRecursiveMASGain vs LoopLM
MATH50083.2%84.1%88.0%+4.8
AIME202566.7%70.0%86.7%+20.0
AIME202663.3%66.7%86.7%+23.4
GPQA-Diamond48.1%50.3%66.2%+18.1
LiveCodeBench-v624.9%28.5%42.9%+18.0
MedQA56.4%60.1%79.3%+22.9

The gains are not incremental. +23.4 points on AIME2026. +22.9 on MedQA. +18.1 on GPQA-Diamond. These are the kinds of jumps you see when you remove a fundamental architectural bottleneck, not when you tweak hyperparameters.

Efficiency Comparison

MetricText-Based MASRecursiveMAS
Inference speed1.0x (baseline)2.4x
Token usage100%24.4% (75.6% reduction)
Training memory41.40 GB15.29 GB
Training cost$9.67$4.27

The efficiency story is just as compelling. You get better results while using a quarter of the tokens and training at less than half the memory/cost. The 2.4x speedup is measured at recursion depth $r=3$.


Ablations: What Actually Matters

The paper includes thorough ablation studies that reveal what drives performance.

The 2-layer bottleneck MLP with residual connection (the architecture described above) consistently outperforms alternatives:

  • Simple linear projection - too limited, can’t capture nonlinear representation mismatches
  • 3-layer MLP - overfits with so few trainable parameters
  • Without residual connection - training becomes unstable

Latent Thought Length

The parameter $m$ (how many latent “tokens” are communicated between agents) shows a clear pattern: performance improves up to $m \approx 80$, then plateaus. This suggests there’s a natural bandwidth for inter-agent communication - enough to capture the essential information, but not so much that you’re transmitting noise.

Recursion Depth Scaling

A particularly interesting finding: training recursion depth and inference recursion depth are complementary. A model trained with $r=2$ recursions performs better at $r=3$ inference than a model trained with $r=3$ at $r=3$ inference. This suggests that moderate training depth builds more robust recursive representations that generalize to deeper inference-time recursion.


Why It Works: The Deeper Story

The surface explanation is simple - latent space is richer than text, gradients flow, end-to-end optimization works. But there’s a more interesting story underneath.

Recursive refinement in continuous space is qualitatively different from iterative refinement in text. When agents communicate in text, each round of refinement starts from a re-encoded version of the previous round’s output. The re-encoding introduces noise and loses nuance. When agents communicate in latent space, each round of refinement starts from a continuous transformation of the previous round’s hidden states. The refinement is smooth and cumulative.

This is analogous to the difference between iteratively sharpening an image by re-scanning a printed copy vs. iteratively sharpening the raw pixel data. The first approach degrades with each iteration. The second can converge to something genuinely better.

The inner-outer loop training design also matters more than it might seem. The inner loop establishes a common “language” between agent pairs - ensuring that Agent A’s hidden states are interpretable by Agent B. The outer loop then optimizes what gets said in that language - finding the communication patterns that maximize end-to-end task performance. Separating these concerns makes the optimization tractable.


Limitations

The paper is honest about its boundaries, and they’re worth noting:

  • Model scale: All experiments use models under 10B parameters. Whether RecursiveLink scales to 70B+ or frontier-scale models is unknown.
  • White-box requirement: You need access to model internals (hidden states, gradients). This rules out API-only models like GPT-4 or Claude. This is a significant practical limitation.
  • Labeled data dependency: The training procedure requires supervised data with ground-truth answers. Extending to open-ended generation or RLHF-style optimization isn’t addressed.
  • Hyperparameter sensitivity: The optimal latent thought length $m$, recursion depth $r$, and collaboration pattern vary by task. There’s no automatic selection mechanism.
  • No open-ended evaluation: All benchmarks are closed-form (math, QA, code). How RecursiveMAS performs on creative writing, open-ended reasoning, or multi-turn dialogue is unexplored.

Takeaways

1. Text is the bottleneck. The biggest performance limitation of current multi-agent LLM systems isn’t the models - it’s the communication medium. Replacing text with latent-space communication removes both the information bottleneck and the optimization bottleneck in one move.

2. 0.31% parameters, massive gains. RecursiveLink adds just 13.12M trainable parameters on top of frozen LLMs. The gains - +8.3% average accuracy, 2.4x speed, 75.6% token reduction - are disproportionate to the parameter budget. This is the hallmark of fixing an architectural bottleneck rather than throwing more compute at the problem.

3. End-to-end optimization of multi-agent systems is now possible. The inner-outer loop training procedure with continuous latent communication enables, for the first time, true gradient-based co-optimization of heterogeneous LLM agents. This opens a door that text-based systems keep firmly shut.

4. Recursive depth is a new scaling axis. Instead of scaling model size or training data, you can scale recursion depth at inference time. The complementary scaling between training and inference depth suggests this axis hasn’t been fully explored yet.

5. The white-box constraint is the real limitation. RecursiveMAS requires access to model internals, which means it can’t be applied to proprietary API models. As open-weight models continue to improve, this limitation matters less - but for now, it defines the framework’s deployment boundary.


Sources and materials:

📄 Recursive Multi-Agent Systems - arXiv 2604.25917

📄 Project page - recursivemas.github.io