Recursive Language Models

Scaling LLM context windows by treating prompts as environment variables

Despite rapid advances in reasoning and tool use, modern language models still struggle with long contexts. Even frontier models like GPT-5 exhibit “context rot” — performance degradation as prompts get longer. Researchers Alex L. Zhang, Tim Kraska, and Omar Khattab from MIT CSAIL propose a solution: Recursive Language Models (RLMs).

The core insight is deceptively simple: don’t feed long prompts directly into the neural network. Instead, treat them as part of an external environment the LLM can programmatically interact with.

Paper: https://arxiv.org/abs/2512.24601

How RLMs Work

An RLM exposes the same interface as a standard LLM — it accepts a string prompt and produces a string response. But internally:

The prompt is loaded as a variable inside a Python REPL environment
The LLM writes code to peek into, decompose, and analyze the prompt
The LLM can recursively call itself (or smaller sub-models) over snippets
Results are aggregated to produce the final answer

# Conceptual example of what the RLM does internally
context = load_prompt_as_variable()  # 10M+ tokens stored externally

# LLM writes code like this:
chunk = context[0:100000]
answer = llm_query(f"Find the magic number in: {chunk}")
print(answer)

This decouples input length from the model’s physical context window limit.

Key Results

The researchers evaluated RLMs on four benchmarks with varying complexity:

Benchmark	Task Type	Input Size	RLM Performance
S-NIAH	Needle-in-haystack	Up to 2^18 tokens	Maintains accuracy at scale
BrowseComp-Plus	Multi-hop QA	6-11M tokens	91.33% (vs 70.47% summary agent)
OOLONG	Linear aggregation	131K tokens	56.50% (vs 44% base model)
OOLONG-Pairs	Quadratic reasoning	32K tokens	58% F1 (vs ~0% base model)

The performance gains are substantial — RLMs handle inputs up to 100x beyond model context windows while maintaining comparable or lower costs.

Emergent Behaviors

Without explicit training, RLMs develop interesting strategies:

Filtering via code execution: Uses regex and keyword searches based on model priors to narrow down relevant sections before reading them.

# Example from actual RLM trajectory
import re
matches = re.findall(r'festival|La Union', context)
relevant_chunks = [context[m.start()-500:m.end()+500] for m in matches]

Recursive chunking: Breaks context into pieces, calls sub-LLMs per chunk, then aggregates results.

# RLM chunking strategy
for i, section in enumerate(context):
    buffer = llm_query(f"Extract relevant info from: {section}")
    answers.append(buffer)
final = llm_query(f"Synthesize these findings: {answers}")

Variable stitching: For long-output tasks, stores sub-LM outputs in variables and combines them programmatically rather than generating everything at once.

Cost Analysis

A key finding: RLMs can be cheaper than alternatives despite their complexity.

Median RLM run often costs less than base model runs
On BrowseComp-Plus, theoretical cost of GPT-5-mini ingesting 6-11M tokens is $1.50-$2.75
RLM(GPT-5) averages $0.99 while outperforming summarization baselines by 29%

The tradeoff: high variance in trajectory lengths means some runs are significantly more expensive.

Limitations

The authors note several areas for improvement:

Models not explicitly trained as RLMs make suboptimal decisions
Different models exhibit different behaviors (Qwen3-Coder makes many more sub-calls than GPT-5)
Synchronous sub-calls create latency; async implementations could help
Max recursion depth of 1 may limit performance on deeply nested tasks

Why This Matters

As LLMs get deployed for long-horizon tasks processing tens or hundreds of millions of tokens, context management becomes critical. RLMs offer a general-purpose solution that:

Works with existing models (no retraining required)
Scales to arbitrary input lengths
Maintains cost efficiency
Enables the model to actively manage its own context

The researchers believe training models explicitly as RLMs — teaching them to reason about context management through reinforcement learning — could be the next major breakthrough for long-horizon agents.