Recursive Language Models
Scaling LLM context windows by treating prompts as environment variables
Despite rapid advances in reasoning and tool use, modern language models still struggle with long contexts. Even frontier models like GPT-5 exhibit “context rot” — performance degradation as prompts get longer. Researchers Alex L. Zhang, Tim Kraska, and Omar Khattab from MIT CSAIL propose a solution: Recursive Language Models (RLMs).
The core insight is deceptively simple: don’t feed long prompts directly into the neural network. Instead, treat them as part of an external environment the LLM can programmatically interact with.
Paper: https://arxiv.org/abs/2512.24601
How RLMs Work
An RLM exposes the same interface as a standard LLM — it accepts a string prompt and produces a string response. But internally:
- The prompt is loaded as a variable inside a Python REPL environment
- The LLM writes code to peek into, decompose, and analyze the prompt
- The LLM can recursively call itself (or smaller sub-models) over snippets
- Results are aggregated to produce the final answer
# Conceptual example of what the RLM does internally
context = load_prompt_as_variable() # 10M+ tokens stored externally
# LLM writes code like this:
chunk = context[0:100000]
answer = llm_query(f"Find the magic number in: {chunk}")
print(answer)
This decouples input length from the model’s physical context window limit.
Key Results
The researchers evaluated RLMs on four benchmarks with varying complexity:
| Benchmark | Task Type | Input Size | RLM Performance |
|---|---|---|---|
| S-NIAH | Needle-in-haystack | Up to 2^18 tokens | Maintains accuracy at scale |
| BrowseComp-Plus | Multi-hop QA | 6-11M tokens | 91.33% (vs 70.47% summary agent) |
| OOLONG | Linear aggregation | 131K tokens | 56.50% (vs 44% base model) |
| OOLONG-Pairs | Quadratic reasoning | 32K tokens | 58% F1 (vs ~0% base model) |
The performance gains are substantial — RLMs handle inputs up to 100x beyond model context windows while maintaining comparable or lower costs.
Emergent Behaviors
Without explicit training, RLMs develop interesting strategies:
Filtering via code execution: Uses regex and keyword searches based on model priors to narrow down relevant sections before reading them.
# Example from actual RLM trajectory
import re
matches = re.findall(r'festival|La Union', context)
relevant_chunks = [context[m.start()-500:m.end()+500] for m in matches]
Recursive chunking: Breaks context into pieces, calls sub-LLMs per chunk, then aggregates results.
# RLM chunking strategy
for i, section in enumerate(context):
buffer = llm_query(f"Extract relevant info from: {section}")
answers.append(buffer)
final = llm_query(f"Synthesize these findings: {answers}")
Variable stitching: For long-output tasks, stores sub-LM outputs in variables and combines them programmatically rather than generating everything at once.
Cost Analysis
A key finding: RLMs can be cheaper than alternatives despite their complexity.
- Median RLM run often costs less than base model runs
- On BrowseComp-Plus, theoretical cost of GPT-5-mini ingesting 6-11M tokens is $1.50-$2.75
- RLM(GPT-5) averages $0.99 while outperforming summarization baselines by 29%
The tradeoff: high variance in trajectory lengths means some runs are significantly more expensive.
Limitations
The authors note several areas for improvement:
- Models not explicitly trained as RLMs make suboptimal decisions
- Different models exhibit different behaviors (Qwen3-Coder makes many more sub-calls than GPT-5)
- Synchronous sub-calls create latency; async implementations could help
- Max recursion depth of 1 may limit performance on deeply nested tasks
Why This Matters
As LLMs get deployed for long-horizon tasks processing tens or hundreds of millions of tokens, context management becomes critical. RLMs offer a general-purpose solution that:
- Works with existing models (no retraining required)
- Scales to arbitrary input lengths
- Maintains cost efficiency
- Enables the model to actively manage its own context
The researchers believe training models explicitly as RLMs — teaching them to reason about context management through reinforcement learning — could be the next major breakthrough for long-horizon agents.
Further Reading
Enjoy Reading This Article?
Here are some more articles you might like to read next: