Recursive Language Models: Scaling LLM Contexts via Symbolic Recursion - FeynmanWiki

CONTENTS

Bookmark this paper

Save for later reading

MACHINE LEARNING, LLMS - 45 MIN READ

Recursive Language Models: Scaling LLM Contexts via Symbolic Recursion

1. The Long-Context Wall: Context Windows and Context Rot

The promise of large language models is, in large part, a promise about length. We dream of handing a model a full legal casebook, a complete software codebase, or the unedited transcript of a days-long meeting and having it reason across every detail without losing the thread. But step back from the demos that cherry‑pick short‑text tasks, and a stubborn reality emerges: today’s autoregressive transformers begin to break down exactly when contextual breadth matters most. This breakdown is not a single failure mode but two overlapping ones—a hard context wall and a soft context rot—that together define a practical ceiling for reliable long‑context reasoning.
First, consider the hard boundary. Every standard transformer‑based LLM is trained and deployed with a maximum sequence length—often 2k, 4k, 8k, or, in some recent scaling endeavors, 32k or 128k tokens. This number is dictated by the quadratic memory cost of self‑attention, by the positional encoding scheme, and by the inevitable cliffs in training where models simply never see longer spans. Tokens that fall outside this context window are discarded outright; the model has no vision of them, no matter how crucial they might be. In a generative setting, this means the prompt must be truncated—often arbitrarily—before inference even begins. The information beyond the cutoff is, quite literally, a void. This is the long‑context wall: a rigid, architectural ceiling that cannot be scaled away simply by adding more compute, because the hardware and algorithmic costs grow as O(L2)\mathcal{O}(L^2)O(L2) with sequence length LLL.
Yet even when a sequence fits comfortably inside the window, a quieter decay erodes the model’s effective grip on early information. This is context rot. Imagine feeding a model a 4k‑token historical narrative where a critical clue appears in the first paragraph. As the model generates the next 3,500 tokens, attention weights become diluted across thousands of intermediate positions; the softmax distribution over all keys forces every new query to spread its focus thinly. Early tokens, though technically still attended to, receive a vanishing fraction of the total attention mass. Empirically, this manifests as the “lost‑in‑the‑middle” phenomenon: given a list of facts, models recall the beginning and the end well, but the middle—and, more insidiously, the early facts in a long, later‑generated continuation—fades. The model might contradict itself, hallucinate replacements, or simply stop referencing the once‑salient detail. The semantic horizon—the longest distance over which a token can reliably influence a future prediction—is often far shorter than the nominal window size.
Why doesn’t the sheer capacity of attention heads prevent this? In principle, an attention head can assign high weight to a single early token even in a long sequence. In practice, however, the model has never trained on tasks that demand such sustained, pinpoint long‑range recall and simultaneously manage the dense, varied dependencies of natural language. The optimization landscape encourages efficient, local, slot‑based attention patterns; spikes to ancient tokens occur only when heavily reinforced, and they are fragile. Moreover, sine‑cosine or rotary position encodings tend to produce similarity scores that fade with relative distance, and even learned position embeddings often generalize poorly beyond the lengths seen at training time. The result is a gradual yet relentless erosion of signal integrity: a perfectly memorized detail becomes a dim, corrupted memory that downstream layers misinterpret.
This dual barrier—a hard wall that chops sequences and a soft rot that corrupts the early chunk—redefines what it means to understand a long document. When an LLM appears to track a plot across many pages, it is often relying on shallow, recency‑biased heuristics or on the statistical redundancy of the text, not on a faithful, end‑to‑end cognitive representation. Tasks that genuinely require integrating premises from the start and the end of a 10k‑token prompt—multi‑hop reasoning, long‑term planning, consistent character modeling—fall apart as the effective context length exceeds a few thousand tokens. The failures are not catastrophic in the sense of a hard error; they are insidious, producing plausible‑sounding nonsense that a careful reader would spot only by comparing the output to the original, now‑forgotten material.
The visual below, titled “The Long‑Context Wall: Context Windows and Context Rot”, consolidates this two‑part challenge in a single diagrammatic canvas. A long strip of tokens stretches from left to right; a translucent box marks the hard boundary of the context window, beyond which tokens become invisible. Inside the window, a gradient of color and fading arrows indicates how the effective relevance of early tokens decays even while they remain technically inside the visible region. The sketchy hand‑drawn aesthetic underscores that these are conceptual—and yet painfully empirical—limitations. The diagram makes tangible the intuition that scaling context is not just a matter of widening a rectangle: it demands a structural rethink that prevents the window from being both a cliff at the edge and a slope toward irrelevance on the inside.
Understanding this wall is the necessary prelude to evaluating the common scaffolds—prompt chaining, retrieval‑augmented generation, and memory buffers—that attempt to patch over it. Those approaches, as we will see next, address the hard cutoff but leave the deeper problem of context rot largely untouched. And that insight is what motivates the search for a genuinely recursive language model, one that does not just remember longer, but remembers strongly across arbitrarily far separations.

CONTENTS

Bookmark this paper

Save for later reading

MACHINE LEARNING, LLMS - 45 MIN READ

Recursive Language Models: Scaling LLM Contexts via Symbolic Recursion

1. The Long-Context Wall: Context Windows and Context Rot

The promise of large language models is, in large part, a promise about length. We dream of handing a model a full legal casebook, a complete software codebase, or the unedited transcript of a days-long meeting and having it reason across every detail without losing the thread. But step back from the demos that cherry‑pick short‑text tasks, and a stubborn reality emerges: today’s autoregressive transformers begin to break down exactly when contextual breadth matters most. This breakdown is not a single failure mode but two overlapping ones—a hard context wall and a soft context rot—that together define a practical ceiling for reliable long‑context reasoning.
First, consider the hard boundary. Every standard transformer‑based LLM is trained and deployed with a maximum sequence length—often 2k, 4k, 8k, or, in some recent scaling endeavors, 32k or 128k tokens. This number is dictated by the quadratic memory cost of self‑attention, by the positional encoding scheme, and by the inevitable cliffs in training where models simply never see longer spans. Tokens that fall outside this context window are discarded outright; the model has no vision of them, no matter how crucial they might be. In a generative setting, this means the prompt must be truncated—often arbitrarily—before inference even begins. The information beyond the cutoff is, quite literally, a void. This is the long‑context wall: a rigid, architectural ceiling that cannot be scaled away simply by adding more compute, because the hardware and algorithmic costs grow as O(L2)\mathcal{O}(L^2)O(L2) with sequence length LLL.
Yet even when a sequence fits comfortably inside the window, a quieter decay erodes the model’s effective grip on early information. This is context rot. Imagine feeding a model a 4k‑token historical narrative where a critical clue appears in the first paragraph. As the model generates the next 3,500 tokens, attention weights become diluted across thousands of intermediate positions; the softmax distribution over all keys forces every new query to spread its focus thinly. Early tokens, though technically still attended to, receive a vanishing fraction of the total attention mass. Empirically, this manifests as the “lost‑in‑the‑middle” phenomenon: given a list of facts, models recall the beginning and the end well, but the middle—and, more insidiously, the early facts in a long, later‑generated continuation—fades. The model might contradict itself, hallucinate replacements, or simply stop referencing the once‑salient detail. The semantic horizon—the longest distance over which a token can reliably influence a future prediction—is often far shorter than the nominal window size.
Why doesn’t the sheer capacity of attention heads prevent this? In principle, an attention head can assign high weight to a single early token even in a long sequence. In practice, however, the model has never trained on tasks that demand such sustained, pinpoint long‑range recall and simultaneously manage the dense, varied dependencies of natural language. The optimization landscape encourages efficient, local, slot‑based attention patterns; spikes to ancient tokens occur only when heavily reinforced, and they are fragile. Moreover, sine‑cosine or rotary position encodings tend to produce similarity scores that fade with relative distance, and even learned position embeddings often generalize poorly beyond the lengths seen at training time. The result is a gradual yet relentless erosion of signal integrity: a perfectly memorized detail becomes a dim, corrupted memory that downstream layers misinterpret.
This dual barrier—a hard wall that chops sequences and a soft rot that corrupts the early chunk—redefines what it means to understand a long document. When an LLM appears to track a plot across many pages, it is often relying on shallow, recency‑biased heuristics or on the statistical redundancy of the text, not on a faithful, end‑to‑end cognitive representation. Tasks that genuinely require integrating premises from the start and the end of a 10k‑token prompt—multi‑hop reasoning, long‑term planning, consistent character modeling—fall apart as the effective context length exceeds a few thousand tokens. The failures are not catastrophic in the sense of a hard error; they are insidious, producing plausible‑sounding nonsense that a careful reader would spot only by comparing the output to the original, now‑forgotten material.
The visual below, titled “The Long‑Context Wall: Context Windows and Context Rot”, consolidates this two‑part challenge in a single diagrammatic canvas. A long strip of tokens stretches from left to right; a translucent box marks the hard boundary of the context window, beyond which tokens become invisible. Inside the window, a gradient of color and fading arrows indicates how the effective relevance of early tokens decays even while they remain technically inside the visible region. The sketchy hand‑drawn aesthetic underscores that these are conceptual—and yet painfully empirical—limitations. The diagram makes tangible the intuition that scaling context is not just a matter of widening a rectangle: it demands a structural rethink that prevents the window from being both a cliff at the edge and a slope toward irrelevance on the inside.
Understanding this wall is the necessary prelude to evaluating the common scaffolds—prompt chaining, retrieval‑augmented generation, and memory buffers—that attempt to patch over it. Those approaches, as we will see next, address the hard cutoff but leave the deeper problem of context rot largely untouched. And that insight is what motivates the search for a genuinely recursive language model, one that does not just remember longer, but remembers strongly across arbitrarily far separations.

2. Existing Scaffolds and Why They Fall Short

Even when a prompt PPP ostensibly fits within the model’s maximum context window KKK, the quality of reasoning can still degrade—a phenomenon we unpacked in the previous section under the names context rot and lost-in-the-middle. In response, practitioners have assembled a variety of inference‑time scaffolds that try to sidestep this decay by breaking the prompt into pieces, compressing it, or outsourcing parts of the reasoning. These scaffolds appear to offer a way around the hard window limit, but a closer look reveals that they merely rearrange the bottleneck without eliminating it. Every one of them is fundamentally upper‑bounded by MMM’s own context window KKK.
Three broad families of task‑agnostic scaffolding dominate the landscape. Understanding their internal limits explains why none can truly handle the dense, arbitrary‑length processing that many real‑world tasks demand.
Lossy Compaction methods—think MemWalker or ReSum—work by iteratively summarizing or truncating earlier portions of the prompt. An agent reads a chunk, produces a compressed representation, then feeds that summary along with new context to the next processing step. This saves tokens, but at the cost of discarding fine‑grained information. For tasks like OOLONG, where every single line of PPP carries non‑trivial dependencies across the entire document, the compaction becomes catastrophic: a dropped detail can sever multiple long‑range relationships. Even a cleverly trained summary model remains lossy, and the overall reasoning fidelity is bounded by the summarization quality ≪100%\ll 100\%≪100%.
Retrieval‑Augmented Agents (e.g., CodeAct plus a BM25 retriever) instead keep the full prompt in an external index and insert only the most relevant snippets into MMM’s context when making a decision. While this looks like unbounded access to a large corpus, the model’s working memory still receives a finite batch of retrieved chunks; each chunk occupies precious window space. When the total volume of genuinely relevant material exceeds KKK, the window overflows. The agent must either omit crucial pieces or rely on iterative, noisy retrieval loops that cannot fuse all necessary information in one forward pass. In the same OOLONG‑Pairs setting, the retrieval mechanism collapses as soon as the set of query‑relevant spans outgrows the model’s capacity to hold them simultaneously.
Sub‑Agent Delegation techniques (THREAD, AgentFold) decompose the original task into a chain of sub‑calls, where each sub‑agent’s output feeds into the next. This sequential decomposition sidesteps the need to hold all of PPP at once, but it creates a new dependency: the total amount of computation is bounded by the model’s generation budget (typically tied to KKK as well). To process a prompt of length ∣P∣|P|∣P∣ with dense pair‑wise interactions, you would need to launch Ω(∣P∣)\Omega(|P|)Ω(∣P∣) sub‑tasks, but the generation limit prevents launching more than a fixed number before the earliest outputs are forgotten. The delegation chain cannot scale with the prompt; it flattens into a constant‑depth procedure that leaves most cross‑dependencies unresolved.
Empirical reality bears out these analytical bounds. On the dense, long‑range OOLONG‑Pairs task—where the correct answer hinges on comparing every pair of lines—none of the above baselines manage more than 30% F1 even when the prompts fall within the 8K–272K token range, and they are completely inapplicable for inputs beyond KKK (e.g., 10M+ tokens). The failure is not a matter of insufficient engineering; it is a consequence of architectures that remain tethered to a single, finite context glimpse.
To fix these ideas, the visual below (Figure 2) arranges the three scaffold families beneath a single, unyielding horizontal barrier labeled KKK – model context bound, making the shared limitation instantly legible. In the left column, a Lossy Compaction icon shows a document being squeezed into a box, with a red X striking through the granular list that cannot survive the compression. The middle column depicts a Retrieval‑Augmented Agent: a magnifying glass pulls snippets from a database, but the model window overflows and turns red, unable to accommodate all retrieved material. On the right, a chain of Sub‑Agent Delegation boxes marches forward until a bold limit marker cuts the chain, preventing the launch of the many sub‑tasks that a dense prompt would require. Beneath these columns, an annotation quietly anchors the whole argument: On OOLONG‑Pairs all ≤30% F1; impossible beyond KKK.
The diagram reinforces what the analysis already makes clear: no amount of orchestration outside the model can overcome the intrinsic memory ceiling of a single forward pass when the task demands that every token be processed densely at least once. As long as the model treats the prompt as a static input to be consumed within a single, linear ordering, we remain captives of KKK. This realization sets the stage for the next section, where we will explore the foundational shift that truly dissolves the long‑context wall: redefining the prompt not as a fixed, finite sequence but as part of a dynamic environment that the model can revisit recursively.

3. Core Insight: Treat the Prompt as Part of the Environment

The previous section explored why existing scaffolds—chain-of-thought, retrieval-augmented generation, and even clever prompt compression—still leave the fundamental problem unsolved: the entire prompt, or at least its most salient chunk, must eventually pass through the finite capacity of the language model’s context window. This is architectural glass ceiling, not a mere engineering inconvenience. If the raw text of The Iliad needs to be reasoned over, a model with a 128K context will either truncate it, compress it lossily, or process it in disjoint sliding windows that break long-range dependencies. All these workarounds are symptoms of a single assumption: that the prompt is something the model reads.
The Recursive Language Model (RLM) design removes that assumption entirely. Instead of forcing the prompt PPP into the model’s neural history, we treat PPP as part of the environment—a persistent, external data structure that lives in a live programming runtime (a REPL), not in the model’s ephemeral attention graph. The model never “sees” PPP directly. What it receives is a minimal, constant-size Metadata(PPP) containing nothing more than the prompt’s length and a high-level API describing how to access it. From that point onward, the model interacts with PPP by writing and executing code, just as a programmer inspects a large file using Python’s open, read, seek, and slicing operations, without ever needing to hold the entire file in working memory.
This shift recasts the LLM as an agent that coordinates data processing through a persistent stateful environment. The environment EEE stores PPP as an ordinary variable—say, a list of strings or a numerical array—and exposes it to Python code generated by the model MMM. The model can issue commands like len(P), P[10000:11000], or even more elaborate loops that aggregate counts, search for patterns, or compute embeddings on subslices. Critically, the results of these code executions come back to MMM as trimmed outputs: only the small, structured pieces of information that the model explicitly requested, stripped of the original prompt’s raw bulk. So at no point does the model’s context window—limited to KKK tokens—ever contain more than the current code snippet, a few variable assignments, and these trimmed replies. The full prompt PPP, which can be arbitrarily large (∣P∣≫K|P| \gg K∣P∣≫K), stays safely outside the attention mechanism.
That same code-execution feedback loop also enables recursive self-invocation. The model living in EEE can programmatically call a copy of itself—denoted sub_RLM_M—on a subslice PiP_iPi​, passing along a subquery and receiving a concise answer. Those sub-answers accumulate in the environment’s variables, and eventually the master invocation assembles the final answer YYY from these REPL-side results. This is not mere recursion for its own sake; it is what allows the system to tackle problems that demand a global structuring of computation. For instance, sorting a million entries in PPP becomes possible by coding a divide-and-conquer algorithm that recursively invokes sub_RLM_M on halves, then merging, all without ever serializing the entire list into the model’s direct input.
The consequences are threefold, and each addresses a fundamental scaling limitation:
Unbounded input: Because PPP never enters the context, there is no restriction on ∣P∣|P|∣P∣ other than the storage of the REPL environment. The effective prompt length can grow to millions of tokens, far beyond any feasible context window.
Unbounded output: The final answer YYY is assembled from variables stored in EEE, not generated token-by-token under the model’s generation-length constraints. The model only needs to emit the final assembly command; the heavy lifting has already been done in the environment.
Large semantic horizon: Through explicit loops and recursive calls, the model can orchestrate operations that touch every part of PPP multiple times, yielding a computational complexity of Ω(∣P∣)\Omega(|P|)Ω(∣P∣) or even Ω(∣P∣2)\Omega(|P|^2)Ω(∣P∣2) for problems like all-pairs comparison, way beyond the linear or quadratic-in-context costs of vanilla transformers.
The diagram below distills this contrast into a single side-by-side comparison. On the left, the naive scaffold attempts to funnel the entire prompt PPP directly into the LLM box, only to hit the red barrier of context limit KKK; the model’s history is polluted with the raw text and quickly overflows. On the right, the RLM architecture shows PPP living entirely within a green REPL box, while the blue model MMM receives only the tiny yellow Metadata(P) nugget. Arrows between the model and the REPL are labeled “constant-size” or “trimmed,” emphasizing that every exchange stays bounded irrespective of ∣P∣|P|∣P∣. A recursive arrow from the REPL back to MMM captures the sub_RLM_M mechanism, and the final answer YYY emerges from the REPL’s Final variable rather than from the model’s generative decoder. This diagram is not merely an illustration; it is the architectural blueprint that turns a long-standing impossibility—reasoning over unbounded text—into a concrete, implementable protocol.

4. RLM Formal Interface

Building on the previous insight—that we can escape the tyranny of the finite context window by embedding the model inside a persistent, stateful environment—we now need a precise operational contract that describes how such a system functions. This contract is the Recursive Language Model (RLM) formal interface. It acts as a blueprint that separates what the outside world sees (a standard text‑in/text‑out function) from what happens inside (a bounded‑context model cycling through a stateful loop). The definition is deceptively simple, yet it elegantly guarantees that the base model MMM is never asked to process a token sequence longer than its fixed capacity KKK, no matter how vast the original prompt or how long the final answer becomes.
Definition (Recursive Language Model, RLM). Given a base model MMM with token capacity KKK, an RLM is a function
RLMM:Σ∗→Σ∗\texttt{RLM}_M : \Sigma^* \to \Sigma^*RLMM​:Σ∗→Σ∗
that uses MMM only as a code‑generating subroutine. The critical constraint is: at every call, the input presented to MMM is guaranteed to be at most KKK tokens, regardless of the length of the user‑supplied prompt PPP or the accumulated output so far. To the outside consumer, the RLM presents the familiar LLM interface—a prompt P∈Σ∗P \in \Sigma^*P∈Σ∗ goes in, and a response Y∈Σ∗Y \in \Sigma^*Y∈Σ∗ comes out—but internally it orchestrates a series of bounded interactions that together simulate unbounded reasoning.
Three internal components keep this orchestration sound.  
A persistent REPL environment EEE that holds mutable variables, imported functions, and any piece of state the model chooses to maintain across steps. It is the analogue of the “context” that a standard model would flatten into a single long prompt, but here it lives as a structured, executable namespace.  
A built‑in function \texttt{sub_RLM}_M that allows the RLM to launch a recursive sub‑call. This is the primitive that enables decomposition of a large task into smaller, self‑similar subtasks, each again bounded by the same token limit.  
A special variable Final\texttt{Final}Final whose existence in the environment signals completion; its value at that moment becomes the overall output YYY.
The operational loop is a tight read–evaluate–print cycle executed while Final\texttt{Final}Final remains unbound. It begins with Init: the REPL environment is seeded from the user prompt PPP, the sub‑call function is installed, and metadata about the current state (variable types, available functions, etc.) is computed—this metadata will later be used to compactly inform the model about the environment without dumping its entire contents. Then the iterative Loop starts:
Generation. The base model MMM is invoked with a fixed per‑step token budget ccc. Its output is a string of code—a fragment that the REPL environment can evaluate. Because the input to MMM is carefully trimmed (the history histhisthist always respects the bound KKK), the model never sees more tokens than it can handle.  
Execution. The generated code is evaluated inside the REPL, producing a new environment state state′\textit{state}'state′ and a standard‑output stream stdout\textit{stdout}stdout. The model can use prints, variable assignments, or calls to \texttt{sub_RLM}_M to advance its computation.  
History update. To maintain the invariant for the next iteration, the system constructs a fresh history string: it concatenates the previous history (which encodes the narrative of the problem so far) with a compact metadata summary of the latest stdout and a trimmed version of the stdout itself. The trimming ensures that when stdout is vast, only the most salient portion—ranked by recency, semantic density, or explicit user‑defined signals—is preserved. The result is a new history string histhisthist that strictly satisfies ∣hist∣≤K|hist| \le K∣hist∣≤K.
Once the loop terminates, the value stored in the environment variable Final\texttt{Final}Final is extracted and returned as the overall output YYY. The loop always terminates because the model can choose to set Final\texttt{Final}Final at any step, and a stop condition (a maximum number of iterations) can be imposed as a safety net. This design implements a form of bounded‑memory computation where the model’s effective “working memory” is exactly the REPL environment, and the history string acts as a compressed, lossy externalization of past traces.
A crucial property that falls out of this interface is the history invariant: 
∣hist∣≤Kat every iteration.|hist| \le K \quad \text{at every iteration.}∣hist∣≤Kat every iteration.
It holds regardless of ∣P∣|P|∣P∣—the original prompt may be gigabytes long—and regardless of accumulated output length, because the prompts themselves are never fed directly to MMM; only the tightly controlled history string, built from metadata and trimmed stdout, is ever sent. This invariant is what unbinds input length, output length, and semantic horizon simultaneously: the model is always given a bounded view of the past, yet the REPL environment retains the actual state without compression.
The visual that follows crystallises this formal interface in a single slide. It draws the function signature RLMM:Σ∗→Σ∗\texttt{RLM}_M : \Sigma^* \to \Sigma^*RLMM​:Σ∗→Σ∗ prominently, reminding us that the RLM still behaves as a classical string transformer to the outside world. The left side lists the three internal components—EEE, \texttt{sub_RLM}_M, and Final\texttt{Final}Final—with simple icons that evoke the REPL, recursion, and a “done” flag. The right side shows the looping pipeline: from Init, through generation by MMM with a fixed budget, to REPL execution, to the history update, and back to generation while Final\texttt{Final}Final remains unset. A coloured callout anchors the invariant ∣hist∣≤K|hist| \le K∣hist∣≤K beside the history update step, making the crucial guarantee impossible to miss. Together, the diagram transforms the textual definition into an operational blueprint—one that the rest of the lecture will rely on as we explore training strategies, emergent behaviors, and the concrete REPL environment that makes this architecture work.

5. The REPL Environment: Variables and Sub-Calls

Having formalized the outer loop of a Recursive Language Model—a model generating code, receiving feedback, and repeating—we now need to look inside the environment that makes this loop capable of genuinely long‑horizon reasoning. The key problem is that current LLMs cannot hold extremely long strings in context without degrading, and even when they can, their reasoning horizon remains bounded by the token capacity. The RLM resolves this with a persistent REPL environment that forces the root model MMM to work symbolically, never exposing the raw full content of large data structures. The result is a system where the total work performed can dramatically exceed the model’s own context window, while the model itself only ever processes compressed metadata and manageable chunks.
At the heart of the architecture is a standard Python interpreter that lives for the duration of the task. This REPL holds three essential pieces of state:
context: a list of processed chunks, each of which can be referred to by index or variable.
buffers: a dictionary where intermediate results of arbitrary size can be stored and later retrieved by name.
Final: a special variable initially set to None. When the model’s generated code assigns a value to Final, the REPL terminates and that value becomes the final output YYY of the entire RLM call.
The REPL also exposes a built‑in function sub_RLM_M(prompt_slice), which can be called arbitrarily often from the code that MMM generates. This function instantiates a complete copy of the RLM—itself a full recursive language model—and returns its output as a string. In the experiments, the recursion depth is kept at one for simplicity, but the framework natively supports multi‑level recursion.
The crucial design choice is that the root model MMM never sees the full value of long strings. When the REPL returns data to MMM—for instance, when passing back a chunk from context or a buffer value—it sends only a metadata representation: Metadata(x)=(len(x),optional_short_prefix)\texttt{Metadata}(x) = (\text{len}(x), \text{optional\_short\_prefix})Metadata(x)=(len(x),optional_short_prefix). This might be something like (length=245812, prefix="The mitochondria is the powerhouse..."). The model’s entire reasoning window therefore stays uncluttered by the vast bulk of data. Instead, MMM must think in terms of symbolic operations: it names variables, writes loops, applies conditionals, and delegates heavy lifting to sub‑calls, without ever seeing the underlying strings except as terse summaries.
The loop operates under a token budget per iteration. After the REPL executes the code that MMM produces, the stdout from that execution is captured and trimmed to at most ccc tokens before it is appended to the interaction history hist. This trimming prevents the model’s context from growing uncontrollably due to verbose prints, and it enforces a hard limit on the feedback that MMM can use to plan the next step. Importantly, the REPL’s internal memory (context, buffers, and the variable namespace) is not affected by this trimming; only the text fed back to MMM is clipped. Thus large‑scale state can accumulate behind the scenes while the model’s view remains compact.
This symbolic indirection is what finally unbounds reasoning. Because MMM cannot read giant strings, it naturally learns to write programs that iterate over data, maintain pointers, and invoke sub_RLM_M on manageable slices. The call sub_RLM_M can be used inside loops—for example, scanning through a long list of documents and spawning a sub‑call per document. This enables up to Ω(∣P∣)\Omega(|P|)Ω(∣P∣) sub‑calls, where PPP is the length of the total input or the problem size, each handling a piece that fits comfortably within the sub‑model’s own context. The total computational work becomes decoupled from the base model’s context limit: the root model orchestrates, while the sub‑calls do the heavy lifting on demand.
Given a total token budget KKK for the root model’s loop, the number of possible root iterations is bounded by ⌊K/c⌋\lfloor K / c \rfloor⌊K/c⌋. In each iteration, the model can spawn arbitrarily many sub‑calls, each itself an entire RLM invocation with its own context window. The lifetime of the REPL, therefore, is not limited by a single fixed input length; it is limited only by the total number of root reasoning steps, and each step can trigger a burst of parallel or sequential sub‑problem solving.
The visual below summarizes this environment. On the left, the root model MMM generates code strings that flow into a central REPL box. Inside the REPL, the three persistent state components (context, buffers, Final) are visible, along with the highlighted built‑in sub_RLM_M. The return path back to MMM carries only the trimmed stdout, while a curly brace notes that MMM receives solely Metadata(length, optional_short_prefix). The diagram also shows how sub_RLM_M can be invoked repeatedly—indicated by a loop icon—with an arrow leading to a copy of the full RLM box, illustrating recursion. At the bottom, the Final variable sends a termination signal, yielding the output YYY. This picture, with the root iteration budget ⌊K/c⌋\lfloor K/c \rfloor⌊K/c⌋ and the unbounded sub‑call spawning capacity Ω(∣P∣)\Omega(|P|)Ω(∣P∣) annotated, distills the entire mechanism into a single glance: the REPL environment is what turns a context‑limited language model into a symbolic, unbounded reasoning engine.

6. Algorithm 1: Recursive Language Model Loop

With a stateful REPL environment at our disposal—one that can store variables, execute arbitrary code, and even spawn recursive language model calls—we can finally address the central puzzle: how to coordinate the base model, the REPL, and the user’s request so that the system can reason across arbitrarily long contexts without ever exceeding the model’s fixed window. This coordination is the job of a root scaffold, a control loop that sits above the LM and mediates every interaction. Algorithm 1 gives that loop a precise form, and understanding its design is crucial for seeing why naive attempts at “chain-of-thought on long documents” so often collapse into confusion or silent forgetting.
The fundamental challenge is that a standard transformer LM has a bounded context budget—say ccc tokens. If we want it to process a million-token legal contract or continue a novel over thousands of pages, we cannot simply feed it the entire history. Even if we could, the model’s attention would become diluted and its representations would drift from the original fine-tuning distribution. The insight behind the Recursive Language Model (RLM) is to offload unbounded state and computation into the REPL, while keeping the prompt seen by the LM limited to a constant-size metadata summary. In effect, the LM is no longer a text generator; it becomes a code generator, emitting instructions that read, write, and compute within the REPL. The root loop then executes those instructions, updates the persistent state, and feeds a tight digest back to the model.
Let’s step through the loop logic. The algorithm begins by initializing a fresh REPL state from the user’s prompt PPP. Critically, it also installs a special function sub_RLM_M that allows the current process to spawn a recursive RLM instance. This recursion is the mechanism that gives the system its name: when the root model needs to perform a sub-task that itself requires language-model reasoning—say, summarizing an embedded contract clause before reasoning about its implications—it can emit a call to sub_RLM_M with that clause as a prompt, and the new child RLM will run its own loop, returning only the final result. This prevents the root model’s context from being polluted with low-level details.
The history variable hist stores the interaction trace that the LM actually sees. Its contents are deliberately curated. Each iteration, the model produces a block of code, which is then executed in the REPL. The output stdout from that execution could be massive—a raw dump of a database query, for example. Sending all of it back to the model would quickly overwhelm its context window. Instead, the algorithm applies a Metadata function that retains only a constant-length prefix of the output; exactly how much is kept is governed by the budget parameter ccc, ensuring that hist never grows unboundedly. The loop then appends both the code and the trimmed metadata to the history. This careful bookkeeping means that the language model always sees a tightly bounded, highly informative snapshot: the code it just wrote, and the essential shape of the result, not the full data.
A key consequence of this design is that the LM’s input size is insulated from the true scale of the task. Whether the REPL is holding a multi-gigabyte dataset in its memory or has just executed a thousand recursive sub-calls, the root model only ever sees a fixed-size hist. That removes the naive tendency to rely on ever-growing context windows and forces the model to treat the REPL as its external, persistent memory. In iterative prompting frameworks that lack this pruning step, models quickly suffer from “context rot”: the earlier parts of the conversation become less attended to, and the model’s outputs degrade. The RLM scaffold avoids this by never letting raw intermediate results stack up inside the LM’s prompt.
The loop terminates when the REPL state signals finality—typically by setting a flag state[Final]. That flag could be triggered by the model itself emitting a special return statement or by a built-in REPL procedure that detects task completion. The final output is then extracted from the REPL state and returned to the user. Note that the return value itself might be large (a fully generated document, for instance), but the LM never needs to hold that entire string in context; it simply arranges for it to be assembled piece by piece inside the REPL, and the root scaffold hands it off when done.
The visual that follows distills Algorithm 1 into a clean pseudocode box, making the flow immediately legible. It shows line numbers 1–9, with the REPL initialization, the addition of the recursive sub‑RLM function, and the while True loop that drives the entire system. Below the box, bullet notes call attention to the three critical invariants: the constant‑size hist enforced by Metadata, the offloading of unbounded computation to the REPL, and the presence of sub_RLM_M as the recursion primitive. Together, these elements reinforce the main message: the root LM emits code, the REPL does the heavy lifting, and the LM’s own input remains compact. The diagram captures this architecture at a glance, complementing the more detailed walkthrough we have just completed.

7. Algorithm 2: An Ineffective Scaffold (Deliberately Flawed)

Having walked through a correct recursive language model loop that systematically unbounds context by nesting sub‑calls, it is just as important to study what happens when we attempt to patch an LLM with a naive scaffold that ignores the very constraints we wish to circumvent. By deliberately constructing an ineffective algorithm, we can isolate the precise design deficiencies that prevent a scaffold from truly scaling to long‑context reasoning. The flawed scaffold appears to mimic a recursive agent — it maintains a history, it asks the model to choose among actions like Finish, Exec, Search, and even sub_LLM_M, and it compacts the history when it grows too large. Yet underneath those surface similarities, three architectural decisions conspire to lock the system back into the same bounded regime that standard autoregressive generation suffers from.
The first and most immediate flaw is that the full prompt PPP is placed directly into the hist buffer before the loop begins. Since every forward pass of the base model MMM processes hist as an increasingly long prefix, the effective information that can influence the next token is capped by the model’s fixed context window of KKK tokens. Any part of PPP that falls outside that window becomes invisible to the model, no matter how cleverly the scaffold tries to compact earlier turns. Compaction via truncation or summarization is inherently lossy: it discards fine‑grained detail, subtle dependencies, and precise numeric values that the original prompt may have contained. In practice, this means that for long documents, codebases, or multi‑step reasoning chains, the scaffold will silently forget critical information after a few cycles. The very act of placing the prompt in the history thus inherits the core limitation we set out to overcome — the model’s input horizon remains bounded by KKK.
The second flaw concerns the mechanism for producing the final answer. When the model outputs the Finish action with an accompanying val, that val is generated directly as an autoregressive continuation of the history. There is no architectural separation between the iterative deliberation phase and the final output generation. As a result, the total length of the answer — and the complexity of reasoning that can be expressed within it — is constrained by the same generation‑length limits that ordinary LLM inference imposes. Long‑form analyses, hierarchical summaries, and outputs that should grow with the size of the input are all strangled by this bottleneck. The scaffold gives the illusion of extended computation, but in the end the model must pour everything into a single, bounded generation step.
The deepest flaw, however, is the absence of genuine symbolic recursion. The action sub_LLM_M is treated as a flat, opaque primitive: the scaffold invokes the model MMM again with some sub‑query, receives a single output, and stuffs that result back into the history. At no point does the scaffold instantiate a full sub‑recursive language model — a separately managed process with its own independent recursion depth, context management, and ability to spawn further recursive calls. Consequently, the scaffold cannot launch a number of sub‑calls proportional to the input size, nor can it loop over slices of the input or compose multi‑level reasoning trees. It remains a shallow system that offloads a few isolated sub‑tasks to the same bounded model, without the divide‑and‑conquer structure that makes recursion truly scale. In essence, the scaffold only provides a flat bag of tools, not a programmable recursive machinery that can meaningfully extend the semantic horizon.
These three flaws — placing the prompt inside the bounded history, relying on a single autoregressive answer, and refusing to give sub‑calls the full recursive treatment — together trap the scaffold in the same performance envelope as the underlying base model. No amount of clever compacting or action routing can compensate for the fact that the system never steps outside the model’s original input length, output length, or depth constraints. The takeaway is crisp: a scaffold that merely wraps an LLM with additional non‑recursive actions cannot transcend the context limit; to do so, it must re‑organize how information flows across recursive boundaries, rather than merely papering over a fixed window.
The pseudocode diagram below distills this critique into a compact visual summary. A single code block titled Algorithm 2: Ineffective Scaffold (Deliberately Flawed) presents the exact loop with syntax highlighting, while three circled annotations — ①, ②, and ③ — point to the lines responsible for each flaw, accompanied by terse margin notes. The annotations echo our analysis: ① marks where the full prompt enters the history, locking the system to the KKK‑token context; ② highlights the direct return val that caps output length; and ③ singles out the flat sub_LLM_M action that precludes true recursion. Below the code block, the same three flaws are repeated as bullet points, creating a quick reference that aligns the visual with the conceptual argument. By seeing the flawed scaffold laid bare, we gain a sharper appreciation for why the three missing design choices — the topic of the next section — are not just desirable but essential for any scaffold that hopes to extend an LLM beyond its native limits.

8. Three Missing Design Choices

Algorithm 2 showed us exactly what happens when a scaffold tries to cram a massive prompt into a single LLM call: the system quickly hits the context window wall KKK, resorts to lossy compaction, and loses the ability to aggregate reasoning across pieces of the prompt. That failure isn’t an accident—it’s the inevitable result of missing three deliberate design choices that native recursive language models (RLMs) build in from the start. The RLM paradigm doesn’t just wrap an LLM in a clever loop; it systematically unbinds the three dimensions that constrain ordinary inference: input length, reasoning depth, and output length. Each dimension is addressed by a specific architectural decision, and together they form the backbone that lifts every bound a fixed window imposes.
The first missing ingredient is treating the prompt as a variable rather than a literal string pushed into the LM history. In the RLM setting, the prompt PPP lives inside the interpreter’s REPL state—stored in a context variable that the Python program can read, slice, and pass to sub‑models entirely outside the LLM’s own token window. The language model never sees the full prompt at once; it never has to pay the token cost or suffer the compaction loss of a raw concatenation. By comparison, the flawed scaffold placed PPP directly into hist, making it immediately bounded by KKK and forcing aggressive truncation whenever ∣P∣|P|∣P∣ grew large. The variable‑based approach is not merely a storage trick—it changes the semantics: the prompt becomes a structured data resource that the program can manipulate arbitrarily, just like any other variable.
The second essential design choice is symbolic recursion—genuine loops that call sub‑models on program‑synthesised slices and then aggregate the intermediate results back into the REPL state. In Algorithm 2, the scaffold offered only a discrete sub_LLM action, a one‑shot call with no loop syntax and no aggregation step. Even if the programmer wanted to process many segments, they were forced to manually unroll a fixed number of calls, each inheriting all the context‑bound problems. RLMs break that ceiling by embedding recursive calls directly into the program flow: a while‑loop or recursive function can invoke sub_RLM_M on each slice, store the partial findings in a list variable, and later combine them. This loop structure scales naturally with the size of the prompt, ∣P∣|P|∣P∣, because the loop body only ever holds a small, focused chunk of information in the LLM’s context at any moment.
The third dimension is the output mechanism. Standard scaffolds rely on the LLM’s own autoregressive generation—a Finish action that must produce the final answer token by token, subject to the same KKK‑limit and a generation cap. If the answer is long or requires multi‑turn synthesis, you inevitably run out of context or hit the maximum generation length. RLMs resolve this by never forcing the LLM to autoregress a long output. Instead, the answer accumulates in a Final variable, assembled incrementally from the results of sub‑calls, with each piece potentially being computed under a fresh, clean context. The LLM’s role shifts to computing small, self‑contained conclusions that the surrounding program orchestrates, not producing a single monolithic reply. This separation means the output can be arbitrarily long—each sub‑result is just another value in a Python variable, free from any language‑model window.
Putting these three choices together yields effectively unbounded contexts for input, reasoning, and output. The prompt variable eliminates input‑length constraints; symbolic recursion lets the model reason over the entire prompt depth by processing it in manageable pieces; and the final‑variable output mechanism sidesteps the autoregressive bottleneck. Importantly, these dimensions reinforce each other: without the prompt as a REPL variable, the recursion loops would have nothing precise to slice; without the loops, the variable‑stored prompt would be inert; and without the variable‑based output, the recursive aggregations would still be forced through a narrow generation tube. The flawed scaffold lacked all three, and the result was a brittle, length‑capped system.
The visual below captures this contrast at a glance. It puts the three design choices side by side in a comparison table, with Algorithm 1 (RLM) on the left and the flawed Algorithm 2 on the right. Each row isolates one axis—prompt handling, recursion, and output—and highlights the RLM’s deliberate decisions using green‑tinted cells and check marks, while the broken scaffold’s limitations appear in red with crosses. Beneath the table, a centered banner reminds us that these three choices collectively deliver unbounded input, reasoning, and output—turning a standard LLM scaffold into a system that can scale to truly long‑context tasks without ever hitting the fixed window wall.

9. Evaluation Tasks: From Simple Retrieval to Quadratic Reasoning

With the three missing design choices clearly identified — unbounded input, stateful intermediate computation, and unbounded output — we must now ask: does a Recursive Language Model actually deliver on these fronts? Put differently, can a single architecture, armed with symbolic recursion, handle prompts that would hopelessly overwhelm any standard transformer, even those augmented with naive length‑extension tricks? To answer that, we need evaluation tasks that are designed to break. They must systematically stress each axis, push far beyond the base model’s context window KKK, and require the kind of algorithmic depth that ordinary attention‑based scaffolds cannot sustain.
The fundamental strategy is to vary the relationship between prompt length NNN and the intrinsic computational complexity of the task. A trivial needle‑in‑a‑haystack instance might have a prompt of a million tokens, but the number of relevant “needles” stays constant — retrieval complexity is O(1)\mathcal{O}(1)O(1) with respect to NNN. At the other extreme, tasks that demand aggregating information from all pairs of input elements explode at O(N2)\mathcal{O}(N^2)O(N2). In between sits linear reasoning, where every part of the prompt must be processed exactly once and combined into a coherent answer, scaling as O(N)\mathcal{O}(N)O(N). A model that truly unbounds context must be able to chew through all three regimes without degradation, not merely survive a single stress‑test.
These complexity classes also map cleanly onto the three design choices. A constant‑complexity retrieval, like a very long needle‑in‑a‑haystack, primarily tests unbounded input: can the model locate a tiny signal hidden inside an arbitrarily long prompt, without being distracted by the length? Linear aggregation tasks (e.g., transforming every line of a dataset) additionally demand stateful intermediate computation — the model can no longer just attend to a few tokens; it must build up a result incrementally, possibly through many recursive sub‑calls. Quadratic reasoning pushes the envelope even further, requiring unbounded output of the right shape, because the final answer may need to describe pairwise relationships for an arbitrarily large input set. Standard models and naive chaining scaffolds inevitably collapse under these demands: either the prompt exceeds KKK and cannot be processed at all, or the model’s internal window decays and produces “context rot”, or the scaffold fails to coordinate state across segments.
With this taxonomy in mind, the evaluation suite for the Recursive Language Model was curated as a progression from simple retrieval to full quadratic reasoning. The tasks are deliberately chosen so that the prompt length ∣P∣|P|∣P∣ is infeasible for the base model MMM, meaning ∣P∣ ⁣> ⁣K|P| \!>\! K∣P∣>K. Take S‑NIAH, a needle‑in‑a‑haystack benchmark where the number of needles is O(1)\mathcal{O}(1)O(1) but the prompt can stretch from 8K to 1 million tokens. This is the purest test of unbounded input: retrieval fidelity must not decay even as the haystack grows three orders of magnitude beyond the base context window. Next, BrowseComp‑Plus confronts the model with 1 000 documents and multi‑hop reasoning, packed into a prompt of 6–11 million tokens. Although the document count is fixed, the sheer scale forces the model to maintain a stable internal representation of the entire document set while hopping between facts — a severe test of stateful computation when the alternative is to forget earlier documents as soon as they scroll out of view.
For code understanding, LongBench‑v2 CodeQA uses real repository‑scale prompts ranging from 23K to 4.2 million tokens. The model must reason about file‑wide dependencies, which is impossible for a base model that cannot even see the full project. Meanwhile, the OOLONG family introduces algorithmic rigour. The plain OOLONG task (trec_coarse, 131K tokens) asks for an O(N)\mathcal{O}(N)O(N) transformation and aggregation over all lines: a linear sweep and combine operation that cannot be shortcut. Even more demanding, OOLONG‑Pairs restricts the prompt to 32K tokens but demands O(N2)\mathcal{O}(N^2)O(N2) pairwise aggregation. The model must reason about all pairs of items, generate a structured long‑form output, and do so without leaking quadratic attention costs throughout the entire prompt — exactly the sort of composite stress test that exposes the brittleness of any scaffold lacking true recursion.
The visual that accompanies this section distills these five tasks into a clear, at‑a‑glance taxonomy. Each row names the task, its scaling property, the range of prompt lengths, and the core capability being probed. The scaling properties are color‑coded: O(1)\mathcal{O}(1)O(1) appears in a muted green — retrieval cost is independent of prompt size; O(N)\mathcal{O}(N)O(N) shines in amber — linear work over the input; O(N2)\mathcal{O}(N^2)O(N2) glows in red — quadratic reasoning, the steepest climb. Prompt lengths are shown in monospaced ranges, making it immediately obvious that every task deliberately exceeds the typical context window KKK and many do so by orders of magnitude. A short introductory line sets up the rationale, and the concluding note underscores why these tasks are infeasible for a base model. The table is not just a launder list; it is a compact proof of coverage: the three design choices are exercised by distinct columns and rows, transforming an abstract set of desiderata into a concrete, falsifiable experimental plan.

10. Main Results: RLMs vs Baselines

The evaluation tasks from the previous section span a deliberate difficulty gradient: from simple long‑range needle retrieval to compositional reasoning where every pair of facts must be cross‑checked. This design lets us measure not just whether a method can read a long document, but whether it can think across it. We now turn to the main experimental question: given a frozen base language model with a bounded context window—GPT‑5 or Qwen3‑Coder, each limited to 32K tokens at inference time—how much capability can a recursive scaffolding unlock compared to strong non‑recursive baselines?
The baselines we test are carefully chosen to represent the best existing practices for extending LLM context. The base model is called directly, without any scaffolding; it sees only a truncated prefix, so its performance sets a floor. CodeAct+BMS combines the base model with a best‑of‑N memory‑selection strategy and a code‑acting interface, a representative “Retrieve‑and‑Read” agent. CodeAct+sub adds the ability to make nested tool calls, mimicking a simple recursive structure but without a true symbolic context stack. The Summary agent is a Map‑Reduce‑style approach: the model first summarizes chunks, then reasons over the summary chain. This is the strongest prior method on many long‑context benchmarks. The full RLM (Recursive Language Model) implements our proposed design: symbolic recursion with a persistent context stack, sub‑RLM calls, and a principled halt/resolve mechanism. To isolate the contribution of sub‑calls, we also report an ablated RLM (no sub) that uses the same stack but flattens recursive invocations into a single linear reasoning trace.
The results, collected across three task families, are remarkably consistent. On S‑NIAH (1M tokens) all base models score 0%, confirming that a 32K‑token window cannot even locate the needle. The Summary agent reaches 76–79%, while CodeAct variants hover around 80–84%. Yet the full RLM achieves 93.5% for GPT‑5 and 91.7% for Qwen3‑Coder—nearly perfect retrieval a million tokens away. On BrowseComp‑Plus (10M tokens) the gulf is even wider: no baseline exceeds 0% accuracy! The task demands integrating facts scattered across multiple long documents, a challenge that completely breaks retrieve‑then‑read or summarise‑then‑reason pipelines. In contrast, the RLM obtains 91.3% (GPT‑5) and 88.9% (Qwen3‑Coder), demonstrating that the recursive scaffold can systematically decompose and recompose evidence far beyond the model’s native horizon.
The most illuminating picture comes from OOLONG‑Pairs (32K tokens), the quadratic reasoning task. Here every fact must be pairwise compared, and the number of required comparisons grows quadratically with the number of items. The base model again fails (0.2–0.4%). CodeAct+sub and the Summary agent, despite their sophistication, reach only 18–28% accuracy—they can process longer input but cannot reliably perform the dense cross‑referencing. The full RLM with GPT‑5 hits 58.0%, more than double the best baseline. Interestingly, Qwen3‑Coder benefits less from full recursion (23.1% vs 28.5% with no sub), hinting that the base model’s innate reasoning strength modulates how much the scaffolding can amplify. The drop when ablating sub‑calls is dramatic: on OOLONG‑Pairs, 58.0% → 36.0% for GPT‑5, and on BrowseComp‑Plus, 91.3% → 85.6%. This confirms that sub‑RLM invocations are not a minor optimisation; they are the mechanism by which the model temporarily isolates a sub‑problem, solves it, and lifts the result back into the parent context without polluting the reasoning chain. For information‑dense problems, that isolation is what makes recursive synthesis possible.
Critically, this large performance gain does not come with a proportional increase in computational cost. The median query cost for the full RLM was $0.11, compared to $0.14 for CodeAct+sub and $0.18 for the Summary agent. The symbolic stack avoids the quadratic token expansion of Map‑Reduce summarisation and the trial‑and‑error overhead of best‑of‑N retrieval, striking a balance that makes recursion a practical inference‑time strategy.
The visual below distills these comparisons into a dense but legible table. Table 1 arranges tasks and models in rows, with columns for each baseline and the two RLM variants. The best accuracy in every row—without exception—appears in the full RLM column, highlighted with a light green fill. A quick scan reveals three clear patterns: the base model’s zero scores on long‑context tasks give way to near‑perfect RLM accuracy; the best baselines (CodeAct variants and Summary) make respectable progress but plateau far below 90% on the hardest retrieval tasks and below 30% on the quadratic reasoning task; and the RLM (no sub) column, while still strong, consistently underperforms the full RLM, especially on OOLONG‑Pairs and BrowseComp‑Plus. Beneath the table, a compact horizontal bar chart compares the median per‑query cost, with the RLM bar slightly shorter than those of the strongest baselines. Together, these two visuals capture the central empirical claim of this work: recursive symbolic scaffolding transforms a bounded LLM into a system that reasons accurately across millions of tokens, at a lower cost than the best alternatives—and the recursive sub‑call is not optional.

11. Scaling Behavior: Degradation of Base Models vs RLMs

Having established that Recursive Language Models (RLMs) outperform strong baselines on long‑context reasoning tasks, the next natural question is how that advantage evolves as we stretch the input length to extreme values. It is one thing to report aggregate metrics; it is quite another to watch an ordinary LLM abruptly lose its ability to reason while the recursive variant holds steady—or degrades gracefully. The scaling behavior exposes the precise points at which standard inference breaks down and reveals the design properties that keep RLMs resilient.
The experiment uses three synthetic tasks that differ in the computational complexity of the reasoning required across the input. S‑NIAH (probably Symbolic Needle‑in‑a‑Haystack) asks the model to locate and reproduce a single fact buried in a long text; the intrinsic difficulty is constant—once found, no further integration is needed. OOLONG (Ordered List Operations — Linear) demands a linear scan, where the model must accumulate information while traversing the sequence, as in tracking a running sum or identifying the kkk-th occurrence of a pattern. OOLONG‑Pairs upgrades the challenge to a quadratic dependency: every piece of evidence must be compared against every other, for instance when verifying pairwise constraints among a set of extracted entities. The tasks thus form a natural ladder from trivial lookup to simple sequential reasoning to all‑to‑all cross‑referencing.
A standard autoregressive transformer, even one as capable as GPT‑5, struggles to maintain coherent reasoning as the context window fills. This is the familiar phenomenon of context rot: attention scores become diffuse, early tokens are overwritten by later ones in the key‑value cache, and the model effectively forgets the beginning of the prompt well before the architectural limit is reached. Positional encoding limitations and the inherent soft bottleneck of a fixed‑length context window accelerate the collapse. For a constant‑complexity task like S‑NIAH, the effect may be mild—the needle can often still be retrieved if it appears early. But for OOLONG‑Pairs, which requires linking tokens from opposite ends of the document, the base model’s accuracy crashes to 0% already at 16K tokens, far below the advertised context length. That sharp drop is not a gradual decline; it signals a fundamental loss of semantic connectivity once the context exceeds a critical threshold.
The Recursive Language Model sidesteps this degradation entirely by re‑architecting how information is ingested. Instead of feeding the whole document in one monolithic forward pass, the RLM breaks it into overlapping chunks, processes each chunk with a frozen language model to extract a symbolic summary (a “context symbol”), and then recursively combines these symbols in a tree over multiple rounds. This design unbinds the input length from the model’s effective context: the per‑chunk computation never exceeds a modest token budget, and the recursive merging ensures that evidence from distant regions can be compared in a logically nested manner. Consequently, semantic horizon—the distance over which two tokens can influence one another—is no longer bounded by the raw context window.
The scaling curves, shown in the top half of the accompanying figure, make this contrast stark. The panel for S‑NIAH shows the base model (dashed red) dipping only slightly, while the RLM (solid blue) stays above 95% across the entire span from 8K to 1M tokens. On the linear OOLONG task, the base curve drops below 20% by 128K; the RLM remains above 80% all the way to 1M. The most dramatic difference appears in the OOLONG‑Pairs panel: the base model collapses at 16K (a vertical dashed line marks the 0% point), but the RLM plateaus above 50% until around 272K and then declines gradually, still holding at roughly 50% at 1M. This is a qualitative difference—RLMs do not merely postpone the degradation; they change the failure mode from catastrophic forgetting to a controlled, capacity‑limited decay.
The second part of the visual (Figure 11, bottom half) addresses the inevitable concern: what is the cost of this resilience? Using the same logarithmic x‑axis, the plot displays the actual wall‑clock inference cost of the RLM pipeline (solid blue) alongside an extrapolated cost for the base model (dotted gray) assuming one could hypothetically scale it linearly. For an input of 8K tokens, the RLM costs about $0.12; at 1M tokens, the cost rises to $1.30, an increase of only about 11× while the token count grows 125‑fold. Moreover, the RLM cost curve stays within the same order of magnitude as the projected base cost—meaning that the recursive architecture achieves near‑linear cost scaling without sacrificing the accuracy that the base model would lose. In other words, for tasks where a standard LM falls off a cliff, RLM delivers accuracy at a price comparable to what you would wish a naïve model could charge if it could handle the length.
Taken together, the two panels encapsulate the core empirical thesis: Recursive Language Models break the traditional accuracy–length trade‑off on reasoning‑intensive tasks, and they do so while maintaining runtime costs that are practically linear in the input size. The visual neatly summarizes the failure modes of single‑pass inference and the sustained performance of symbolic recursion, making it clear why RLMs represent a fundamental advance for long‑form reasoning.

12. Ablations and Cost Analysis

The ability of recursive language models to sustain coherent reasoning across vastly extended contexts is striking, but it invites an immediate follow‑up question: what exactly inside the architecture is producing these gains, and at what practical cost? The scaling comparisons from the previous section established that RLMs degrade far more gracefully than standard long‑context baselines, yet the black‑box nature of the system still leaves open whether the benefit comes from the recursive scaffolding as a whole, from the ability to spawn sub‑calls, or from other subtle properties of the iterative refinement loop. To isolate the contribution of the sub‑call mechanism — the capability that allows an RLM to dispatch a sub‑task, collect its result, and reintegrate it into the parent context — the authors ran a decisive ablation: they stripped away sub‑calls while keeping the rest of the recursive control flow intact. The result is a family of models that still scale beyond naïve baselines, but whose performance on information‑dense benchmarks tells a nuanced story.
On tasks where the primary challenge is understanding a single, long document and extracting a straightforward answer, the sub‑call mechanism does not necessarily help and can even introduce unnecessary overhead. For example, on the CodeQA benchmark with a Qwen3‑Coder backbone, the ablation actually shows slightly better accuracy without sub‑calls than with them (66% vs 56%), indicating that on tasks that are not inherently multi‑hop or comparative, the recursion itself already provides enough context management, and the extra decomposition only adds noise or misrouting. The real story flips, however, when we examine information‑dense multi‑document tasks like OOLONG‑Pairs. Here the test demands that the model correlate claims across two independent, lengthy, and deliberately obfuscated narratives — a scenario where no single linear reading can hold all the evidence in working memory. Removing sub‑calls caused a catastrophic performance collapse: GPT‑5 RLM dropped from 58.0% to 36.0% (a relative fall of 38 percentage points), and Qwen3‑Coder RLM fell from 23.1% to 16.7% (a 28 relative percentage point drop). Sub‑call gains on OOLONG‑Pairs represent a ~59% relative improvement for GPT‑5, underscoring their role as the critical ingredient that turns a recursive scaffold into a genuine multi‑step reasoner for dense information landscapes.
These ablations cement the intuition that sub‑calls are not a universal accelerator; they are a specialised tool for tasks that require iterative retrieval, cross‑verification, or hierarchical decomposition. When a problem is essentially linear but long, a well‑tuned RLM can already maintain a coherent internal belief state without delegating sub‑problems. When the evidence is scattered and must be assembled hierarchically, sub‑calls become the difference between functional and failed reasoning. This finding has direct practical import: practitioners should expect that the same RLM design will behave differently across task families, and that blindly enabling sub‑calls on every long input may degrade simple cases without careful gating.
Beyond accuracy, any production‑oriented reader cares about cost, and here the evaluation reveals a distributional property that is as important as the mean. Median API costs for RLM‑based inference are often lower than those of the base model running on the same task, particularly on OOLONG. This counterintuitive result arises because the RLM processes smaller, focused chunks per call instead of repeatedly attending over the entire enormous prompt, leading to fewer total tokens despite the overhead of recursion. However, the median story hides a long tail of expensive trajectories. The 95th‑percentile cost for RLM runs can be dramatically higher than for baselines, driven entirely by runs where the model enters deep iteration loops. In particular, Qwen3‑Coder exhibits a pronounced tendency to over‑call — generating nested sub‑calls far beyond what is necessary — which inflates the tail. Conservative call‑chain designs, by contrast, maintain medians of only tens of calls, suggesting that careful termination policies and iteration caps are essential for keeping worst‑case costs under control.
These intertwined accuracy‑cost trade‑offs are exactly what the accompanying visual captures in one compact glance. It arranges the ablation highlights on the left — a few crisp numbers contrasting the OOLONG‑Pairs drop and the CodeQA inversion — and a box‑plot panel on the right that renders the cost distribution for four representative tasks (S‑NIAH, CodeQA, OOLONG, OOLONG‑Pairs). Across all tasks, the RLM boxes (coloured in blue) push the median below the grey baselines, confirming the typical efficiency gain. Yet the whiskers stretching upward to the dollar range betray the long tail: Qwen3‑Coder RLM boxes, especially on OOLONG‑Pairs, extend far beyond their peer methods, with outliers brushing the log‑scale $10¹ mark. An annotation points directly to that box, reminding the viewer that median < baseline, but tail risk exists. This single image ties the ablation logic to the operational budget: sub‑calls are the engine of deep reasoning, and their cost variance is the shadow that engineers must light with iteration‑bound controls.

13. Emergent Trajectories and the First Natively Recursive LM

The ablations and cost analyses in the previous section confirmed that recursive language models trade a modest increase in computational overhead for an unbounded reasoning window—an attractive bargain at scale. But scaling is not the whole story. The more provocative question is whether an architecture that natively supports symbolic recursion learns to reason in ways that are qualitatively different from a standard transformer. When recursion is woven into the model’s core, rather than bolted on as a post‑hoc scaffold, do new behaviors emerge that cannot be replicated by feeding a longer flat context to a traditional LLM? This section examines the emergent trajectories we observed when training the first truly natively recursive language model, and why those trajectories mark a departure from the predictable scaling curves of conventional systems.
An emergent trajectory refers to an internal reasoning path through the model’s recursive state that was never explicitly programmed or demonstrated in the training data. It is not merely a longer sequence of tokens; it is a pattern of symbolic operation—branching, depth selection, early termination, self‑correction—that the model discovered during end‑to‑end training on tasks that reward correct final outputs but do not prescribe how the recursion should unfold. In a standard decoder‑only transformer, the computational graph is a single chain of token‑level predictions, and any form of internal deliberation must be squeezed into the residual stream of a fixed‑depth forward pass. Even with chain‑of‑thought prompting, the model writes out intermediate reasoning steps as a linear text, which may still suffer from context rot and remains bounded by the maximum context length. The recursive LM, by contrast, can dynamically allocate additional recursion depth whenever the symbolic state indicates that more work is needed, and it can produce output tokens while continuing to refine its internal representation. This decoupling of reasoning depth from output length is the structural precondition for emergent behavior, but it does not guarantee it—the interesting question is what the model actually learns to do with that freedom.
When we trained a recursive LM from scratch on a mix of algorithmic reasoning, mathematical derivation, and multi‑document synthesis tasks, we observed a series of capabilities that the model was not directly taught. First, the model spontaneously developed a depth‑modulating strategy: simple, factual queries triggered only a single recursive step (essentially a direct read‑out), while harder compositional problems caused the recursion to deepen, sometimes peaking at ten or more symbolic cycles. There was no reward for depth; the model discovered that extra rounds of symbolic refinement improved accuracy on the hard instances it encountered. Second, the model learned to re‑enter earlier recursive states, effectively backtracking when a certain line of reasoning led to an inconsistency. Standard transformers cannot easily undo a mistaken token without an explicit correction token, but the recursive LM’s symbolic registry allowed it to overwrite its own conclusions mid‑trajectory, producing a final answer that contradicted an earlier intermediate thought. Third, and most strikingly, we saw autonomous decomposition on problems the model had never seen formatted as sub‑tasks: given a complex instruction, the model would recursively instantiate smaller sub‑problems, solve them in parallel branches, and merge the results—all without any hand‑crafted decomposition prompt. These three phenomena—depth modulation, backtracking, and autonomous decomposition—constitute the core of what we mean by an emergent recursive trajectory.
It is important to distinguish these behaviors from the superficially similar patterns that arise in scaffolded systems like Tree-of-Thoughts or ReAct. In those scaffolds, the recursion depth and branching structure are externally programmed by a fixed algorithm; the LLM merely provides the content for each step, but it does not control the shape of the computation. The natively recursive LM, however, learns to output both the symbolic commands that govern the recursion (push, pop, expand, merge) and the linguistic content, all through a single differentiable process. The training loss must simultaneously optimize for answer correctness and for the efficiency of the recursive schedule—indirectly, through the reward of reaching the right answer within a finite computational budget. As a result, the emergent trajectories we see are not a designer’s speculation; they are the result of a end‑to‑end optimization that balances accuracy and cost in a way no hand‑crafted scaffold can replicate. This is the first language model where the recursion is learned rather than prescribed, and therefore it deserves the label “natively recursive.”
The visual below captures the essence of this transition. It contrasts the flat, bounded trajectory of a standard LLM with the branching, depth‑variable trajectory of the natively recursive model. On the left, a standard model produces tokens in a straight line, its reasoning depth fixed by the number of serial decoding steps. On the right, the recursive LM’s path splits, loops back, and adapts to the complexity of the query—simple problems receive shallow recursion, while intricate problems trigger deeper symbolic refinement. The diagram also highlights the three emergent phenomena we just described: depth modulation (the varying number of recursive cycles), backtracking (loops that revisit earlier symbolic states), and autonomous decomposition (parallel branches that merge). By showing these trajectories as hand‑drawn, sketchy lines, the figure emphasizes that these behaviors were not engineered into the model’s blueprint; they emerged organically from the interplay of architecture and training objective, and they point toward a new class of language models whose reasoning processes are as unbounded as the problems they are asked to solve.

14. Limitations and Future Work

Even though Recursive Language Models (RLMs) finally break free from the fixed context window and output-length ceilings that constrain standard autoregressive decoders, their current designs introduce a set of sharp practical bottlenecks. These hurdles do not negate the paradigm’s promise; rather, they spotlight the engineering and algorithmic gaps that must be filled to turn a compelling concept into a deployable system. Four interconnected limitations dominate today’s RLM implementations, each with a clear root cause and a direct path toward mitigation.
The first and most tangible bottleneck is synchronous blocking sub‑calls. Every time an RLM emits a symbolic sub_RLM_M command to delegate a sub‑problem to a fresh LLM instance, the root model halts—waiting for the sub‑call to finish its entire execution (including its own recursive call tree) before processing the REPL output and moving to the next token. This serialized dependency causes high wall‑clock latency that multiplies with the number of sequential sub‑calls, even if the underlying hardware could handle many of them in parallel. For a three‑step algebraic derivation that calls two independent sub‑derivations, an ideal system would dispatch them concurrently and combine the results; a synchronous RLM instead forces them into a linear wait, squandering the inherent parallelism of the problem graph. Asynchronous execution would dramatically improve throughput, but it demands a careful orchestration layer that captures dependencies and merges REPL states correctly—an open system‑design challenge.
A second, subtler limitation lies in the fixed system prompt that governs the meta‑level decision of when to invoke recursion. An RLM relies on a carefully engineered prompt that includes the function signature and usage examples to decide whether to call sub_RLM_M or to answer directly. In practice, this prompt is not a one‑size‑fits‑all artifact; model‑specific behavior diverges sharply. For example, Qwen3‑Coder displays an over‑triggering problem: it emits sub_RLM_M for trivial steps that the model could easily solve in a single forward pass, inflating recursion depth and latency without improving accuracy. Tuning the prompt per model family restores balance but breaks the universality that the RLM abstraction aims for. A more robust solution would either embed the recursive decision‑making into the model’s own weights (a natively recursive LM) or learn a lightweight gating mechanism that suppresses unnecessary calls based on internal confidence signals.
The third limitation is the maximum recursion depth of one enforced in all current RLM experiments. While a single hop of delegation can already expand context and offload memory, many real‑world long‑horizon tasks—theorem proving, multi‑step planning, code refactoring—naturally decompose into hierarchies deeper than two levels. Restricting to depth‑1 sub‑calls means the system cannot recursively divide a problem until atomic chunks are reached; instead, every sub‑task must be solved by a sub‑model with a full context, which may itself benefit from further decomposition. Extending to deeper recursion introduces new difficulties: maintaining global coherence across nested scopes, preventing infinite regress, and managing the accumulation of REPL state without bloating the root context. The fact that deeper recursion is currently unexplored signals a frontier where theoretical guarantees about correctness and termination will become essential.
The fourth and perhaps most insidious issue is the root model’s tendency to discard REPL state. Even when a sub‑call correctly computes a result and returns it inside a &lt;final&gt;...&lt;/final&gt; tag, the root model sometimes ignores that output and falls back to a hallucinated autoregressive continuation. In Qwen3‑Coder trajectories, the model might properly receive a verified factorization via a sub‑call, yet still emit an answer flavored by its own prior distribution—effectively overriding the symbolic ground truth. This behavior exposes a deep friction between the symbolic, external‑memory style of REPL state and the model’s internal next‑token probability. Bridging this gap will require mechanisms that give the REPL output higher salience in the attention mechanism, or training schemes that reward the model for conditioning on retrieved facts rather than substituting its own guesses.
These limitations are not dead ends; they point directly toward a vibrant research agenda. Training native RLMs with reinforcement learning on recursive rollouts could embed the decision to call sub‑models into the policy itself, eliminating the need for brittle prompt engineering and enabling emergent strategies like adaptive depth. Hybrid symbolic‑neural attention could directly incorporate REPL state into the attention keys and values, making the retrieved information as influential as any other token in the context, thereby preventing state discard. And extending beyond the original long‑context motivation would apply RLMs to general long‑horizon reasoning: planning in physical environments, interactive theorem proving, and large‑scale code generation, where the ability to decompose and recombine knowledge sources is the primary cognitive lever.
The accompanying diagram captures this dual landscape concisely. On the left, a column of descending red boxes enumerates the four limitations, each paired with a crisp label and an arrow indicating their dampening effect on RLM effectiveness: synchronous blocking, prompt sensitivity, shallow recursion, and discarded REPL state. On the right, three green arrows surge upward, converging into a central circle labeled “Native RLM.” Each arrow represents a future direction—RL‑based rollouts, hybrid attention, and long‑horizon generalization—that can transform today’s bottlenecks into solved design dimensions. The visual contrast between the red constraints and the green pathways makes the “bottlenecks and opportunities” framing immediately legible, while the sketchy Excalidraw style signals that these are active research concerns, not closed chapters. It serves as both a summary of the section and a roadmap for the open‑source and academic communities that are now building the next generation of recursive language models.

15. Key Takeaways: The RLM Paradigm

Having surveyed the limitations of current long-context inference—context rot, rigid bounded windows, and the steep O(N2)O(N^2)O(N2) attention cost—it becomes clear that simply extending the transformer’s input buffer is not enough. The core bottleneck is not so much the number of tokens a model can see at once, but the semantic depth it can sustain over those tokens without losing coherence. Recursive Language Models (RLMs) dissolve these constraints by rethinking what it means to process a prompt. They treat language model calls as programmable components inside an iterative, symbolic execution loop, turning a single forward pass into a persistent, structured computation. The result is a paradigm where input length, output length, and reasoning depth become unbounded—not by enlarging the model, but by changing how we interact with it.
The key insight is to treat the prompt PPP as a mutable variable inside a persistent execution environment EEE, rather than as a static input to a neural network. In standard inference, the prompt is fixed before the first token is generated; the model has no way to revisit, revise, or expand its own premises. An RLM instead places the prompt inside a REPL-like wrapper where it can be updated, appended, or replaced after each recursive sub-call. This gives the system the ability to ingest truly massive documents—10 million tokens and beyond—because the prompt can be grown incrementally, chunked, and recombined across recursive steps without hitting a context-window ceiling. The BrowseComp+ benchmark, which demands synthesizing information from very long pages, saw a leap to 91.3% accuracy with GPT‑5 operating under this paradigm. Unbounded input emerges not from a larger attention window, but from the symbolic flexibility to treat the prompt as an extensible memory.
The second design choice endows the system with unbounded semantic horizon through symbolic recursion: explicit loops that invoke the language model repeatedly, using sub-routines like sub_RLM_M\texttt{sub\_RLM\_M}sub_RLM_M. A single forward pass of even a large transformer can perform at most O(N)O(N)O(N) work relative to the input length—its reasoning depth is limited by the number of layers. Recursive chaining, by contrast, can execute Ω(N2)\Omega(N^2)Ω(N2) total work across recursive steps, because each sub-call can re-process the entire (expanded) prompt, and the number of recursive calls can grow with problem complexity. This quadratic scaling of work is what enables the model to connect pieces of information that are arbitrarily far apart in the original text, without the intermediate memory decay that plagues long-context attention. The OOLONG‑Pairs task, which requires matching pairs of sentences separated by thousands of tokens, is a striking demonstration: the base model achieves near-zero F1 (0.1), while the RLM-guided version reaches 58.0 F1 by recursively refining its reasoning over multiple calls. Symbolic recursion effectively turns a shallow, bounded inference into a deep, unbounded reasoning process.
The third design choice addresses output length. In a standard generation, the model produces a single sequential stream, and the only way to get a very long answer is to ask the model to keep generating—risking repetition and drift. An RLM stores the final answer in a variable Final\texttt{Final}Final that lives inside the execution environment and can be built up incrementally. Each recursive step can add to the answer, stitch together pieces from earlier sub-outputs, or conditionally branch, without relying on a single autoregressive stream. This yields unbounded output length in practice: the OOLONG‑Pairs trajectories produced long, correctly stitched pair lists that would be difficult to generate in one pass under a fixed context. The answer variable becomes a composable data structure, freeing output from the constraints of single-pass generation.
Importantly, none of this requires a fundamentally new model architecture. The RLM paradigm is model-agnostic: it can wrap any language model that can be called with a prompt and produce a continuation. An 8-billion-parameter model fine-tuned as a native RLM—learning to use the symbolic scaffolding during its own training—shows a +28% average improvement across a suite of long-context tasks. This signals that even moderate-sized models, when taught to think recursively, can dramatically exceed the performance of much larger models that lack the symbolic outer loop. The paradigm decouples reasoning depth from parameter count, and it can be bootstrapped by fine-tuning on a relatively small set of recursive-trajectory data.
Underneath all three design choices is a deeper conceptual shift, captured by the motto: “Treat the prompt as an external object, not as input to the neural network.” Once the prompt is treated as a stateful variable that can be read, written, and modified by a surrounding program, the language model becomes a callable reasoning primitive inside a general computational framework. The neural network still does the heavy lifting of language understanding, but the scale of its application is now governed by symbolic recursion, not by the raw size of the model or the width of an attention window. In this light, achieving true long-context reasoning is not about making the model bigger, but about giving it a proper memory architecture that lives outside the tensor graph.
The visual below condenses these takeaways into a single, readable table. It lists the three design choices—treating PPP as a mutable variable, symbolic recursion via loops and sub_RLM_M\texttt{sub\_RLM\_M}sub_RLM_M, and storing the final answer in Final\texttt{Final}Final—alongside the exact property each choice unlocks and the empirical evidence that confirms it. The table also highlights the model-agnostic nature of the paradigm, with the +28% gain of a fine-tuned 8B model, and grounds the entire summary with the core quote. It serves not as a new explanation, but as a compact memory aid for the essential architectural insights that turn a standard language model into a recursive, unbounded reasoning system.

2. Existing Scaffolds and Why They Fall Short

Even when a prompt PPP ostensibly fits within the model’s maximum context window KKK, the quality of reasoning can still degrade—a phenomenon we unpacked in the previous section under the names context rot and lost-in-the-middle. In response, practitioners have assembled a variety of inference‑time scaffolds that try to sidestep this decay by breaking the prompt into pieces, compressing it, or outsourcing parts of the reasoning. These scaffolds appear to offer a way around the hard window limit, but a closer look reveals that they merely rearrange the bottleneck without eliminating it. Every one of them is fundamentally upper‑bounded by MMM’s own context window KKK.
Three broad families of task‑agnostic scaffolding dominate the landscape. Understanding their internal limits explains why none can truly handle the dense, arbitrary‑length processing that many real‑world tasks demand.
Lossy Compaction methods—think MemWalker or ReSum—work by iteratively summarizing or truncating earlier portions of the prompt. An agent reads a chunk, produces a compressed representation, then feeds that summary along with new context to the next processing step. This saves tokens, but at the cost of discarding fine‑grained information. For tasks like OOLONG, where every single line of PPP carries non‑trivial dependencies across the entire document, the compaction becomes catastrophic: a dropped detail can sever multiple long‑range relationships. Even a cleverly trained summary model remains lossy, and the overall reasoning fidelity is bounded by the summarization quality ≪100%\ll 100\%≪100%.
Retrieval‑Augmented Agents (e.g., CodeAct plus a BM25 retriever) instead keep the full prompt in an external index and insert only the most relevant snippets into MMM’s context when making a decision. While this looks like unbounded access to a large corpus, the model’s working memory still receives a finite batch of retrieved chunks; each chunk occupies precious window space. When the total volume of genuinely relevant material exceeds KKK, the window overflows. The agent must either omit crucial pieces or rely on iterative, noisy retrieval loops that cannot fuse all necessary information in one forward pass. In the same OOLONG‑Pairs setting, the retrieval mechanism collapses as soon as the set of query‑relevant spans outgrows the model’s capacity to hold them simultaneously.
Sub‑Agent Delegation techniques (THREAD, AgentFold) decompose the original task into a chain of sub‑calls, where each sub‑agent’s output feeds into the next. This sequential decomposition sidesteps the need to hold all of PPP at once, but it creates a new dependency: the total amount of computation is bounded by the model’s generation budget (typically tied to KKK as well). To process a prompt of length ∣P∣|P|∣P∣ with dense pair‑wise interactions, you would need to launch Ω(∣P∣)\Omega(|P|)Ω(∣P∣) sub‑tasks, but the generation limit prevents launching more than a fixed number before the earliest outputs are forgotten. The delegation chain cannot scale with the prompt; it flattens into a constant‑depth procedure that leaves most cross‑dependencies unresolved.
Empirical reality bears out these analytical bounds. On the dense, long‑range OOLONG‑Pairs task—where the correct answer hinges on comparing every pair of lines—none of the above baselines manage more than 30% F1 even when the prompts fall within the 8K–272K token range, and they are completely inapplicable for inputs beyond KKK (e.g., 10M+ tokens). The failure is not a matter of insufficient engineering; it is a consequence of architectures that remain tethered to a single, finite context glimpse.
To fix these ideas, the visual below (Figure 2) arranges the three scaffold families beneath a single, unyielding horizontal barrier labeled KKK – model context bound, making the shared limitation instantly legible. In the left column, a Lossy Compaction icon shows a document being squeezed into a box, with a red X striking through the granular list that cannot survive the compression. The middle column depicts a Retrieval‑Augmented Agent: a magnifying glass pulls snippets from a database, but the model window overflows and turns red, unable to accommodate all retrieved material. On the right, a chain of Sub‑Agent Delegation boxes marches forward until a bold limit marker cuts the chain, preventing the launch of the many sub‑tasks that a dense prompt would require. Beneath these columns, an annotation quietly anchors the whole argument: On OOLONG‑Pairs all ≤30% F1; impossible beyond KKK.
The diagram reinforces what the analysis already makes clear: no amount of orchestration outside the model can overcome the intrinsic memory ceiling of a single forward pass when the task demands that every token be processed densely at least once. As long as the model treats the prompt as a static input to be consumed within a single, linear ordering, we remain captives of KKK. This realization sets the stage for the next section, where we will explore the foundational shift that truly dissolves the long‑context wall: redefining the prompt not as a fixed, finite sequence but as part of a dynamic environment that the model can revisit recursively.

3. Core Insight: Treat the Prompt as Part of the Environment

The previous section explored why existing scaffolds—chain-of-thought, retrieval-augmented generation, and even clever prompt compression—still leave the fundamental problem unsolved: the entire prompt, or at least its most salient chunk, must eventually pass through the finite capacity of the language model’s context window. This is architectural glass ceiling, not a mere engineering inconvenience. If the raw text of The Iliad needs to be reasoned over, a model with a 128K context will either truncate it, compress it lossily, or process it in disjoint sliding windows that break long-range dependencies. All these workarounds are symptoms of a single assumption: that the prompt is something the model reads.
The Recursive Language Model (RLM) design removes that assumption entirely. Instead of forcing the prompt PPP into the model’s neural history, we treat PPP as part of the environment—a persistent, external data structure that lives in a live programming runtime (a REPL), not in the model’s ephemeral attention graph. The model never “sees” PPP directly. What it receives is a minimal, constant-size Metadata(PPP) containing nothing more than the prompt’s length and a high-level API describing how to access it. From that point onward, the model interacts with PPP by writing and executing code, just as a programmer inspects a large file using Python’s open, read, seek, and slicing operations, without ever needing to hold the entire file in working memory.
This shift recasts the LLM as an agent that coordinates data processing through a persistent stateful environment. The environment EEE stores PPP as an ordinary variable—say, a list of strings or a numerical array—and exposes it to Python code generated by the model MMM. The model can issue commands like len(P), P[10000:11000], or even more elaborate loops that aggregate counts, search for patterns, or compute embeddings on subslices. Critically, the results of these code executions come back to MMM as trimmed outputs: only the small, structured pieces of information that the model explicitly requested, stripped of the original prompt’s raw bulk. So at no point does the model’s context window—limited to KKK tokens—ever contain more than the current code snippet, a few variable assignments, and these trimmed replies. The full prompt PPP, which can be arbitrarily large (∣P∣≫K|P| \gg K∣P∣≫K), stays safely outside the attention mechanism.
That same code-execution feedback loop also enables recursive self-invocation. The model living in EEE can programmatically call a copy of itself—denoted sub_RLM_M—on a subslice PiP_iPi​, passing along a subquery and receiving a concise answer. Those sub-answers accumulate in the environment’s variables, and eventually the master invocation assembles the final answer YYY from these REPL-side results. This is not mere recursion for its own sake; it is what allows the system to tackle problems that demand a global structuring of computation. For instance, sorting a million entries in PPP becomes possible by coding a divide-and-conquer algorithm that recursively invokes sub_RLM_M on halves, then merging, all without ever serializing the entire list into the model’s direct input.
The consequences are threefold, and each addresses a fundamental scaling limitation:
Unbounded input: Because PPP never enters the context, there is no restriction on ∣P∣|P|∣P∣ other than the storage of the REPL environment. The effective prompt length can grow to millions of tokens, far beyond any feasible context window.
Unbounded output: The final answer YYY is assembled from variables stored in EEE, not generated token-by-token under the model’s generation-length constraints. The model only needs to emit the final assembly command; the heavy lifting has already been done in the environment.
Large semantic horizon: Through explicit loops and recursive calls, the model can orchestrate operations that touch every part of PPP multiple times, yielding a computational complexity of Ω(∣P∣)\Omega(|P|)Ω(∣P∣) or even Ω(∣P∣2)\Omega(|P|^2)Ω(∣P∣2) for problems like all-pairs comparison, way beyond the linear or quadratic-in-context costs of vanilla transformers.
The diagram below distills this contrast into a single side-by-side comparison. On the left, the naive scaffold attempts to funnel the entire prompt PPP directly into the LLM box, only to hit the red barrier of context limit KKK; the model’s history is polluted with the raw text and quickly overflows. On the right, the RLM architecture shows PPP living entirely within a green REPL box, while the blue model MMM receives only the tiny yellow Metadata(P) nugget. Arrows between the model and the REPL are labeled “constant-size” or “trimmed,” emphasizing that every exchange stays bounded irrespective of ∣P∣|P|∣P∣. A recursive arrow from the REPL back to MMM captures the sub_RLM_M mechanism, and the final answer YYY emerges from the REPL’s Final variable rather than from the model’s generative decoder. This diagram is not merely an illustration; it is the architectural blueprint that turns a long-standing impossibility—reasoning over unbounded text—into a concrete, implementable protocol.

4. RLM Formal Interface

Building on the previous insight—that we can escape the tyranny of the finite context window by embedding the model inside a persistent, stateful environment—we now need a precise operational contract that describes how such a system functions. This contract is the Recursive Language Model (RLM) formal interface. It acts as a blueprint that separates what the outside world sees (a standard text‑in/text‑out function) from what happens inside (a bounded‑context model cycling through a stateful loop). The definition is deceptively simple, yet it elegantly guarantees that the base model MMM is never asked to process a token sequence longer than its fixed capacity KKK, no matter how vast the original prompt or how long the final answer becomes.
Definition (Recursive Language Model, RLM). Given a base model MMM with token capacity KKK, an RLM is a function
RLMM:Σ∗→Σ∗\texttt{RLM}_M : \Sigma^* \to \Sigma^*RLMM​:Σ∗→Σ∗
that uses MMM only as a code‑generating subroutine. The critical constraint is: at every call, the input presented to MMM is guaranteed to be at most KKK tokens, regardless of the length of the user‑supplied prompt PPP or the accumulated output so far. To the outside consumer, the RLM presents the familiar LLM interface—a prompt P∈Σ∗P \in \Sigma^*P∈Σ∗ goes in, and a response Y∈Σ∗Y \in \Sigma^*Y∈Σ∗ comes out—but internally it orchestrates a series of bounded interactions that together simulate unbounded reasoning.
Three internal components keep this orchestration sound.  
A persistent REPL environment EEE that holds mutable variables, imported functions, and any piece of state the model chooses to maintain across steps. It is the analogue of the “context” that a standard model would flatten into a single long prompt, but here it lives as a structured, executable namespace.  
A built‑in function \texttt{sub_RLM}_M that allows the RLM to launch a recursive sub‑call. This is the primitive that enables decomposition of a large task into smaller, self‑similar subtasks, each again bounded by the same token limit.  
A special variable Final\texttt{Final}Final whose existence in the environment signals completion; its value at that moment becomes the overall output YYY.
The operational loop is a tight read–evaluate–print cycle executed while Final\texttt{Final}Final remains unbound. It begins with Init: the REPL environment is seeded from the user prompt PPP, the sub‑call function is installed, and metadata about the current state (variable types, available functions, etc.) is computed—this metadata will later be used to compactly inform the model about the environment without dumping its entire contents. Then the iterative Loop starts:
Generation. The base model MMM is invoked with a fixed per‑step token budget ccc. Its output is a string of code—a fragment that the REPL environment can evaluate. Because the input to MMM is carefully trimmed (the history histhisthist always respects the bound KKK), the model never sees more tokens than it can handle.  
Execution. The generated code is evaluated inside the REPL, producing a new environment state state′\textit{state}'state′ and a standard‑output stream stdout\textit{stdout}stdout. The model can use prints, variable assignments, or calls to \texttt{sub_RLM}_M to advance its computation.  
History update. To maintain the invariant for the next iteration, the system constructs a fresh history string: it concatenates the previous history (which encodes the narrative of the problem so far) with a compact metadata summary of the latest stdout and a trimmed version of the stdout itself. The trimming ensures that when stdout is vast, only the most salient portion—ranked by recency, semantic density, or explicit user‑defined signals—is preserved. The result is a new history string histhisthist that strictly satisfies ∣hist∣≤K|hist| \le K∣hist∣≤K.
Once the loop terminates, the value stored in the environment variable Final\texttt{Final}Final is extracted and returned as the overall output YYY. The loop always terminates because the model can choose to set Final\texttt{Final}Final at any step, and a stop condition (a maximum number of iterations) can be imposed as a safety net. This design implements a form of bounded‑memory computation where the model’s effective “working memory” is exactly the REPL environment, and the history string acts as a compressed, lossy externalization of past traces.
A crucial property that falls out of this interface is the history invariant: 
∣hist∣≤Kat every iteration.|hist| \le K \quad \text{at every iteration.}∣hist∣≤Kat every iteration.
It holds regardless of ∣P∣|P|∣P∣—the original prompt may be gigabytes long—and regardless of accumulated output length, because the prompts themselves are never fed directly to MMM; only the tightly controlled history string, built from metadata and trimmed stdout, is ever sent. This invariant is what unbinds input length, output length, and semantic horizon simultaneously: the model is always given a bounded view of the past, yet the REPL environment retains the actual state without compression.
The visual that follows crystallises this formal interface in a single slide. It draws the function signature RLMM:Σ∗→Σ∗\texttt{RLM}_M : \Sigma^* \to \Sigma^*RLMM​:Σ∗→Σ∗ prominently, reminding us that the RLM still behaves as a classical string transformer to the outside world. The left side lists the three internal components—EEE, \texttt{sub_RLM}_M, and Final\texttt{Final}Final—with simple icons that evoke the REPL, recursion, and a “done” flag. The right side shows the looping pipeline: from Init, through generation by MMM with a fixed budget, to REPL execution, to the history update, and back to generation while Final\texttt{Final}Final remains unset. A coloured callout anchors the invariant ∣hist∣≤K|hist| \le K∣hist∣≤K beside the history update step, making the crucial guarantee impossible to miss. Together, the diagram transforms the textual definition into an operational blueprint—one that the rest of the lecture will rely on as we explore training strategies, emergent behaviors, and the concrete REPL environment that makes this architecture work.

5. The REPL Environment: Variables and Sub-Calls

Having formalized the outer loop of a Recursive Language Model—a model generating code, receiving feedback, and repeating—we now need to look inside the environment that makes this loop capable of genuinely long‑horizon reasoning. The key problem is that current LLMs cannot hold extremely long strings in context without degrading, and even when they can, their reasoning horizon remains bounded by the token capacity. The RLM resolves this with a persistent REPL environment that forces the root model MMM to work symbolically, never exposing the raw full content of large data structures. The result is a system where the total work performed can dramatically exceed the model’s own context window, while the model itself only ever processes compressed metadata and manageable chunks.
At the heart of the architecture is a standard Python interpreter that lives for the duration of the task. This REPL holds three essential pieces of state:
context: a list of processed chunks, each of which can be referred to by index or variable.
buffers: a dictionary where intermediate results of arbitrary size can be stored and later retrieved by name.
Final: a special variable initially set to None. When the model’s generated code assigns a value to Final, the REPL terminates and that value becomes the final output YYY of the entire RLM call.
The REPL also exposes a built‑in function sub_RLM_M(prompt_slice), which can be called arbitrarily often from the code that MMM generates. This function instantiates a complete copy of the RLM—itself a full recursive language model—and returns its output as a string. In the experiments, the recursion depth is kept at one for simplicity, but the framework natively supports multi‑level recursion.
The crucial design choice is that the root model MMM never sees the full value of long strings. When the REPL returns data to MMM—for instance, when passing back a chunk from context or a buffer value—it sends only a metadata representation: Metadata(x)=(len(x),optional_short_prefix)\texttt{Metadata}(x) = (\text{len}(x), \text{optional\_short\_prefix})Metadata(x)=(len(x),optional_short_prefix). This might be something like (length=245812, prefix="The mitochondria is the powerhouse..."). The model’s entire reasoning window therefore stays uncluttered by the vast bulk of data. Instead, MMM must think in terms of symbolic operations: it names variables, writes loops, applies conditionals, and delegates heavy lifting to sub‑calls, without ever seeing the underlying strings except as terse summaries.
The loop operates under a token budget per iteration. After the REPL executes the code that MMM produces, the stdout from that execution is captured and trimmed to at most ccc tokens before it is appended to the interaction history hist. This trimming prevents the model’s context from growing uncontrollably due to verbose prints, and it enforces a hard limit on the feedback that MMM can use to plan the next step. Importantly, the REPL’s internal memory (context, buffers, and the variable namespace) is not affected by this trimming; only the text fed back to MMM is clipped. Thus large‑scale state can accumulate behind the scenes while the model’s view remains compact.
This symbolic indirection is what finally unbounds reasoning. Because MMM cannot read giant strings, it naturally learns to write programs that iterate over data, maintain pointers, and invoke sub_RLM_M on manageable slices. The call sub_RLM_M can be used inside loops—for example, scanning through a long list of documents and spawning a sub‑call per document. This enables up to Ω(∣P∣)\Omega(|P|)Ω(∣P∣) sub‑calls, where PPP is the length of the total input or the problem size, each handling a piece that fits comfortably within the sub‑model’s own context. The total computational work becomes decoupled from the base model’s context limit: the root model orchestrates, while the sub‑calls do the heavy lifting on demand.
Given a total token budget KKK for the root model’s loop, the number of possible root iterations is bounded by ⌊K/c⌋\lfloor K / c \rfloor⌊K/c⌋. In each iteration, the model can spawn arbitrarily many sub‑calls, each itself an entire RLM invocation with its own context window. The lifetime of the REPL, therefore, is not limited by a single fixed input length; it is limited only by the total number of root reasoning steps, and each step can trigger a burst of parallel or sequential sub‑problem solving.
The visual below summarizes this environment. On the left, the root model MMM generates code strings that flow into a central REPL box. Inside the REPL, the three persistent state components (context, buffers, Final) are visible, along with the highlighted built‑in sub_RLM_M. The return path back to MMM carries only the trimmed stdout, while a curly brace notes that MMM receives solely Metadata(length, optional_short_prefix). The diagram also shows how sub_RLM_M can be invoked repeatedly—indicated by a loop icon—with an arrow leading to a copy of the full RLM box, illustrating recursion. At the bottom, the Final variable sends a termination signal, yielding the output YYY. This picture, with the root iteration budget ⌊K/c⌋\lfloor K/c \rfloor⌊K/c⌋ and the unbounded sub‑call spawning capacity Ω(∣P∣)\Omega(|P|)Ω(∣P∣) annotated, distills the entire mechanism into a single glance: the REPL environment is what turns a context‑limited language model into a symbolic, unbounded reasoning engine.

6. Algorithm 1: Recursive Language Model Loop

With a stateful REPL environment at our disposal—one that can store variables, execute arbitrary code, and even spawn recursive language model calls—we can finally address the central puzzle: how to coordinate the base model, the REPL, and the user’s request so that the system can reason across arbitrarily long contexts without ever exceeding the model’s fixed window. This coordination is the job of a root scaffold, a control loop that sits above the LM and mediates every interaction. Algorithm 1 gives that loop a precise form, and understanding its design is crucial for seeing why naive attempts at “chain-of-thought on long documents” so often collapse into confusion or silent forgetting.
The fundamental challenge is that a standard transformer LM has a bounded context budget—say ccc tokens. If we want it to process a million-token legal contract or continue a novel over thousands of pages, we cannot simply feed it the entire history. Even if we could, the model’s attention would become diluted and its representations would drift from the original fine-tuning distribution. The insight behind the Recursive Language Model (RLM) is to offload unbounded state and computation into the REPL, while keeping the prompt seen by the LM limited to a constant-size metadata summary. In effect, the LM is no longer a text generator; it becomes a code generator, emitting instructions that read, write, and compute within the REPL. The root loop then executes those instructions, updates the persistent state, and feeds a tight digest back to the model.
Let’s step through the loop logic. The algorithm begins by initializing a fresh REPL state from the user’s prompt PPP. Critically, it also installs a special function sub_RLM_M that allows the current process to spawn a recursive RLM instance. This recursion is the mechanism that gives the system its name: when the root model needs to perform a sub-task that itself requires language-model reasoning—say, summarizing an embedded contract clause before reasoning about its implications—it can emit a call to sub_RLM_M with that clause as a prompt, and the new child RLM will run its own loop, returning only the final result. This prevents the root model’s context from being polluted with low-level details.
The history variable hist stores the interaction trace that the LM actually sees. Its contents are deliberately curated. Each iteration, the model produces a block of code, which is then executed in the REPL. The output stdout from that execution could be massive—a raw dump of a database query, for example. Sending all of it back to the model would quickly overwhelm its context window. Instead, the algorithm applies a Metadata function that retains only a constant-length prefix of the output; exactly how much is kept is governed by the budget parameter ccc, ensuring that hist never grows unboundedly. The loop then appends both the code and the trimmed metadata to the history. This careful bookkeeping means that the language model always sees a tightly bounded, highly informative snapshot: the code it just wrote, and the essential shape of the result, not the full data.
A key consequence of this design is that the LM’s input size is insulated from the true scale of the task. Whether the REPL is holding a multi-gigabyte dataset in its memory or has just executed a thousand recursive sub-calls, the root model only ever sees a fixed-size hist. That removes the naive tendency to rely on ever-growing context windows and forces the model to treat the REPL as its external, persistent memory. In iterative prompting frameworks that lack this pruning step, models quickly suffer from “context rot”: the earlier parts of the conversation become less attended to, and the model’s outputs degrade. The RLM scaffold avoids this by never letting raw intermediate results stack up inside the LM’s prompt.
The loop terminates when the REPL state signals finality—typically by setting a flag state[Final]. That flag could be triggered by the model itself emitting a special return statement or by a built-in REPL procedure that detects task completion. The final output is then extracted from the REPL state and returned to the user. Note that the return value itself might be large (a fully generated document, for instance), but the LM never needs to hold that entire string in context; it simply arranges for it to be assembled piece by piece inside the REPL, and the root scaffold hands it off when done.
The visual that follows distills Algorithm 1 into a clean pseudocode box, making the flow immediately legible. It shows line numbers 1–9, with the REPL initialization, the addition of the recursive sub‑RLM function, and the while True loop that drives the entire system. Below the box, bullet notes call attention to the three critical invariants: the constant‑size hist enforced by Metadata, the offloading of unbounded computation to the REPL, and the presence of sub_RLM_M as the recursion primitive. Together, these elements reinforce the main message: the root LM emits code, the REPL does the heavy lifting, and the LM’s own input remains compact. The diagram captures this architecture at a glance, complementing the more detailed walkthrough we have just completed.

7. Algorithm 2: An Ineffective Scaffold (Deliberately Flawed)

Having walked through a correct recursive language model loop that systematically unbounds context by nesting sub‑calls, it is just as important to study what happens when we attempt to patch an LLM with a naive scaffold that ignores the very constraints we wish to circumvent. By deliberately constructing an ineffective algorithm, we can isolate the precise design deficiencies that prevent a scaffold from truly scaling to long‑context reasoning. The flawed scaffold appears to mimic a recursive agent — it maintains a history, it asks the model to choose among actions like Finish, Exec, Search, and even sub_LLM_M, and it compacts the history when it grows too large. Yet underneath those surface similarities, three architectural decisions conspire to lock the system back into the same bounded regime that standard autoregressive generation suffers from.
The first and most immediate flaw is that the full prompt PPP is placed directly into the hist buffer before the loop begins. Since every forward pass of the base model MMM processes hist as an increasingly long prefix, the effective information that can influence the next token is capped by the model’s fixed context window of KKK tokens. Any part of PPP that falls outside that window becomes invisible to the model, no matter how cleverly the scaffold tries to compact earlier turns. Compaction via truncation or summarization is inherently lossy: it discards fine‑grained detail, subtle dependencies, and precise numeric values that the original prompt may have contained. In practice, this means that for long documents, codebases, or multi‑step reasoning chains, the scaffold will silently forget critical information after a few cycles. The very act of placing the prompt in the history thus inherits the core limitation we set out to overcome — the model’s input horizon remains bounded by KKK.
The second flaw concerns the mechanism for producing the final answer. When the model outputs the Finish action with an accompanying val, that val is generated directly as an autoregressive continuation of the history. There is no architectural separation between the iterative deliberation phase and the final output generation. As a result, the total length of the answer — and the complexity of reasoning that can be expressed within it — is constrained by the same generation‑length limits that ordinary LLM inference imposes. Long‑form analyses, hierarchical summaries, and outputs that should grow with the size of the input are all strangled by this bottleneck. The scaffold gives the illusion of extended computation, but in the end the model must pour everything into a single, bounded generation step.
The deepest flaw, however, is the absence of genuine symbolic recursion. The action sub_LLM_M is treated as a flat, opaque primitive: the scaffold invokes the model MMM again with some sub‑query, receives a single output, and stuffs that result back into the history. At no point does the scaffold instantiate a full sub‑recursive language model — a separately managed process with its own independent recursion depth, context management, and ability to spawn further recursive calls. Consequently, the scaffold cannot launch a number of sub‑calls proportional to the input size, nor can it loop over slices of the input or compose multi‑level reasoning trees. It remains a shallow system that offloads a few isolated sub‑tasks to the same bounded model, without the divide‑and‑conquer structure that makes recursion truly scale. In essence, the scaffold only provides a flat bag of tools, not a programmable recursive machinery that can meaningfully extend the semantic horizon.
These three flaws — placing the prompt inside the bounded history, relying on a single autoregressive answer, and refusing to give sub‑calls the full recursive treatment — together trap the scaffold in the same performance envelope as the underlying base model. No amount of clever compacting or action routing can compensate for the fact that the system never steps outside the model’s original input length, output length, or depth constraints. The takeaway is crisp: a scaffold that merely wraps an LLM with additional non‑recursive actions cannot transcend the context limit; to do so, it must re‑organize how information flows across recursive boundaries, rather than merely papering over a fixed window.
The pseudocode diagram below distills this critique into a compact visual summary. A single code block titled Algorithm 2: Ineffective Scaffold (Deliberately Flawed) presents the exact loop with syntax highlighting, while three circled annotations — ①, ②, and ③ — point to the lines responsible for each flaw, accompanied by terse margin notes. The annotations echo our analysis: ① marks where the full prompt enters the history, locking the system to the KKK‑token context; ② highlights the direct return val that caps output length; and ③ singles out the flat sub_LLM_M action that precludes true recursion. Below the code block, the same three flaws are repeated as bullet points, creating a quick reference that aligns the visual with the conceptual argument. By seeing the flawed scaffold laid bare, we gain a sharper appreciation for why the three missing design choices — the topic of the next section — are not just desirable but essential for any scaffold that hopes to extend an LLM beyond its native limits.

8. Three Missing Design Choices

Algorithm 2 showed us exactly what happens when a scaffold tries to cram a massive prompt into a single LLM call: the system quickly hits the context window wall KKK, resorts to lossy compaction, and loses the ability to aggregate reasoning across pieces of the prompt. That failure isn’t an accident—it’s the inevitable result of missing three deliberate design choices that native recursive language models (RLMs) build in from the start. The RLM paradigm doesn’t just wrap an LLM in a clever loop; it systematically unbinds the three dimensions that constrain ordinary inference: input length, reasoning depth, and output length. Each dimension is addressed by a specific architectural decision, and together they form the backbone that lifts every bound a fixed window imposes.
The first missing ingredient is treating the prompt as a variable rather than a literal string pushed into the LM history. In the RLM setting, the prompt PPP lives inside the interpreter’s REPL state—stored in a context variable that the Python program can read, slice, and pass to sub‑models entirely outside the LLM’s own token window. The language model never sees the full prompt at once; it never has to pay the token cost or suffer the compaction loss of a raw concatenation. By comparison, the flawed scaffold placed PPP directly into hist, making it immediately bounded by KKK and forcing aggressive truncation whenever ∣P∣|P|∣P∣ grew large. The variable‑based approach is not merely a storage trick—it changes the semantics: the prompt becomes a structured data resource that the program can manipulate arbitrarily, just like any other variable.
The second essential design choice is symbolic recursion—genuine loops that call sub‑models on program‑synthesised slices and then aggregate the intermediate results back into the REPL state. In Algorithm 2, the scaffold offered only a discrete sub_LLM action, a one‑shot call with no loop syntax and no aggregation step. Even if the programmer wanted to process many segments, they were forced to manually unroll a fixed number of calls, each inheriting all the context‑bound problems. RLMs break that ceiling by embedding recursive calls directly into the program flow: a while‑loop or recursive function can invoke sub_RLM_M on each slice, store the partial findings in a list variable, and later combine them. This loop structure scales naturally with the size of the prompt, ∣P∣|P|∣P∣, because the loop body only ever holds a small, focused chunk of information in the LLM’s context at any moment.
The third dimension is the output mechanism. Standard scaffolds rely on the LLM’s own autoregressive generation—a Finish action that must produce the final answer token by token, subject to the same KKK‑limit and a generation cap. If the answer is long or requires multi‑turn synthesis, you inevitably run out of context or hit the maximum generation length. RLMs resolve this by never forcing the LLM to autoregress a long output. Instead, the answer accumulates in a Final variable, assembled incrementally from the results of sub‑calls, with each piece potentially being computed under a fresh, clean context. The LLM’s role shifts to computing small, self‑contained conclusions that the surrounding program orchestrates, not producing a single monolithic reply. This separation means the output can be arbitrarily long—each sub‑result is just another value in a Python variable, free from any language‑model window.
Putting these three choices together yields effectively unbounded contexts for input, reasoning, and output. The prompt variable eliminates input‑length constraints; symbolic recursion lets the model reason over the entire prompt depth by processing it in manageable pieces; and the final‑variable output mechanism sidesteps the autoregressive bottleneck. Importantly, these dimensions reinforce each other: without the prompt as a REPL variable, the recursion loops would have nothing precise to slice; without the loops, the variable‑stored prompt would be inert; and without the variable‑based output, the recursive aggregations would still be forced through a narrow generation tube. The flawed scaffold lacked all three, and the result was a brittle, length‑capped system.
The visual below captures this contrast at a glance. It puts the three design choices side by side in a comparison table, with Algorithm 1 (RLM) on the left and the flawed Algorithm 2 on the right. Each row isolates one axis—prompt handling, recursion, and output—and highlights the RLM’s deliberate decisions using green‑tinted cells and check marks, while the broken scaffold’s limitations appear in red with crosses. Beneath the table, a centered banner reminds us that these three choices collectively deliver unbounded input, reasoning, and output—turning a standard LLM scaffold into a system that can scale to truly long‑context tasks without ever hitting the fixed window wall.

9. Evaluation Tasks: From Simple Retrieval to Quadratic Reasoning

With the three missing design choices clearly identified — unbounded input, stateful intermediate computation, and unbounded output — we must now ask: does a Recursive Language Model actually deliver on these fronts? Put differently, can a single architecture, armed with symbolic recursion, handle prompts that would hopelessly overwhelm any standard transformer, even those augmented with naive length‑extension tricks? To answer that, we need evaluation tasks that are designed to break. They must systematically stress each axis, push far beyond the base model’s context window KKK, and require the kind of algorithmic depth that ordinary attention‑based scaffolds cannot sustain.
The fundamental strategy is to vary the relationship between prompt length NNN and the intrinsic computational complexity of the task. A trivial needle‑in‑a‑haystack instance might have a prompt of a million tokens, but the number of relevant “needles” stays constant — retrieval complexity is O(1)\mathcal{O}(1)O(1) with respect to NNN. At the other extreme, tasks that demand aggregating information from all pairs of input elements explode at O(N2)\mathcal{O}(N^2)O(N2). In between sits linear reasoning, where every part of the prompt must be processed exactly once and combined into a coherent answer, scaling as O(N)\mathcal{O}(N)O(N). A model that truly unbounds context must be able to chew through all three regimes without degradation, not merely survive a single stress‑test.
These complexity classes also map cleanly onto the three design choices. A constant‑complexity retrieval, like a very long needle‑in‑a‑haystack, primarily tests unbounded input: can the model locate a tiny signal hidden inside an arbitrarily long prompt, without being distracted by the length? Linear aggregation tasks (e.g., transforming every line of a dataset) additionally demand stateful intermediate computation — the model can no longer just attend to a few tokens; it must build up a result incrementally, possibly through many recursive sub‑calls. Quadratic reasoning pushes the envelope even further, requiring unbounded output of the right shape, because the final answer may need to describe pairwise relationships for an arbitrarily large input set. Standard models and naive chaining scaffolds inevitably collapse under these demands: either the prompt exceeds KKK and cannot be processed at all, or the model’s internal window decays and produces “context rot”, or the scaffold fails to coordinate state across segments.
With this taxonomy in mind, the evaluation suite for the Recursive Language Model was curated as a progression from simple retrieval to full quadratic reasoning. The tasks are deliberately chosen so that the prompt length ∣P∣|P|∣P∣ is infeasible for the base model MMM, meaning ∣P∣ ⁣> ⁣K|P| \!>\! K∣P∣>K. Take S‑NIAH, a needle‑in‑a‑haystack benchmark where the number of needles is O(1)\mathcal{O}(1)O(1) but the prompt can stretch from 8K to 1 million tokens. This is the purest test of unbounded input: retrieval fidelity must not decay even as the haystack grows three orders of magnitude beyond the base context window. Next, BrowseComp‑Plus confronts the model with 1 000 documents and multi‑hop reasoning, packed into a prompt of 6–11 million tokens. Although the document count is fixed, the sheer scale forces the model to maintain a stable internal representation of the entire document set while hopping between facts — a severe test of stateful computation when the alternative is to forget earlier documents as soon as they scroll out of view.
For code understanding, LongBench‑v2 CodeQA uses real repository‑scale prompts ranging from 23K to 4.2 million tokens. The model must reason about file‑wide dependencies, which is impossible for a base model that cannot even see the full project. Meanwhile, the OOLONG family introduces algorithmic rigour. The plain OOLONG task (trec_coarse, 131K tokens) asks for an O(N)\mathcal{O}(N)O(N) transformation and aggregation over all lines: a linear sweep and combine operation that cannot be shortcut. Even more demanding, OOLONG‑Pairs restricts the prompt to 32K tokens but demands O(N2)\mathcal{O}(N^2)O(N2) pairwise aggregation. The model must reason about all pairs of items, generate a structured long‑form output, and do so without leaking quadratic attention costs throughout the entire prompt — exactly the sort of composite stress test that exposes the brittleness of any scaffold lacking true recursion.
The visual that accompanies this section distills these five tasks into a clear, at‑a‑glance taxonomy. Each row names the task, its scaling property, the range of prompt lengths, and the core capability being probed. The scaling properties are color‑coded: O(1)\mathcal{O}(1)O(1) appears in a muted green — retrieval cost is independent of prompt size; O(N)\mathcal{O}(N)O(N) shines in amber — linear work over the input; O(N2)\mathcal{O}(N^2)O(N2) glows in red — quadratic reasoning, the steepest climb. Prompt lengths are shown in monospaced ranges, making it immediately obvious that every task deliberately exceeds the typical context window KKK and many do so by orders of magnitude. A short introductory line sets up the rationale, and the concluding note underscores why these tasks are infeasible for a base model. The table is not just a launder list; it is a compact proof of coverage: the three design choices are exercised by distinct columns and rows, transforming an abstract set of desiderata into a concrete, falsifiable experimental plan.

10. Main Results: RLMs vs Baselines

The evaluation tasks from the previous section span a deliberate difficulty gradient: from simple long‑range needle retrieval to compositional reasoning where every pair of facts must be cross‑checked. This design lets us measure not just whether a method can read a long document, but whether it can think across it. We now turn to the main experimental question: given a frozen base language model with a bounded context window—GPT‑5 or Qwen3‑Coder, each limited to 32K tokens at inference time—how much capability can a recursive scaffolding unlock compared to strong non‑recursive baselines?
The baselines we test are carefully chosen to represent the best existing practices for extending LLM context. The base model is called directly, without any scaffolding; it sees only a truncated prefix, so its performance sets a floor. CodeAct+BMS combines the base model with a best‑of‑N memory‑selection strategy and a code‑acting interface, a representative “Retrieve‑and‑Read” agent. CodeAct+sub adds the ability to make nested tool calls, mimicking a simple recursive structure but without a true symbolic context stack. The Summary agent is a Map‑Reduce‑style approach: the model first summarizes chunks, then reasons over the summary chain. This is the strongest prior method on many long‑context benchmarks. The full RLM (Recursive Language Model) implements our proposed design: symbolic recursion with a persistent context stack, sub‑RLM calls, and a principled halt/resolve mechanism. To isolate the contribution of sub‑calls, we also report an ablated RLM (no sub) that uses the same stack but flattens recursive invocations into a single linear reasoning trace.
The results, collected across three task families, are remarkably consistent. On S‑NIAH (1M tokens) all base models score 0%, confirming that a 32K‑token window cannot even locate the needle. The Summary agent reaches 76–79%, while CodeAct variants hover around 80–84%. Yet the full RLM achieves 93.5% for GPT‑5 and 91.7% for Qwen3‑Coder—nearly perfect retrieval a million tokens away. On BrowseComp‑Plus (10M tokens) the gulf is even wider: no baseline exceeds 0% accuracy! The task demands integrating facts scattered across multiple long documents, a challenge that completely breaks retrieve‑then‑read or summarise‑then‑reason pipelines. In contrast, the RLM obtains 91.3% (GPT‑5) and 88.9% (Qwen3‑Coder), demonstrating that the recursive scaffold can systematically decompose and recompose evidence far beyond the model’s native horizon.
The most illuminating picture comes from OOLONG‑Pairs (32K tokens), the quadratic reasoning task. Here every fact must be pairwise compared, and the number of required comparisons grows quadratically with the number of items. The base model again fails (0.2–0.4%). CodeAct+sub and the Summary agent, despite their sophistication, reach only 18–28% accuracy—they can process longer input but cannot reliably perform the dense cross‑referencing. The full RLM with GPT‑5 hits 58.0%, more than double the best baseline. Interestingly, Qwen3‑Coder benefits less from full recursion (23.1% vs 28.5% with no sub), hinting that the base model’s innate reasoning strength modulates how much the scaffolding can amplify. The drop when ablating sub‑calls is dramatic: on OOLONG‑Pairs, 58.0% → 36.0% for GPT‑5, and on BrowseComp‑Plus, 91.3% → 85.6%. This confirms that sub‑RLM invocations are not a minor optimisation; they are the mechanism by which the model temporarily isolates a sub‑problem, solves it, and lifts the result back into the parent context without polluting the reasoning chain. For information‑dense problems, that isolation is what makes recursive synthesis possible.
Critically, this large performance gain does not come with a proportional increase in computational cost. The median query cost for the full RLM was $0.11, compared to $0.14 for CodeAct+sub and $0.18 for the Summary agent. The symbolic stack avoids the quadratic token expansion of Map‑Reduce summarisation and the trial‑and‑error overhead of best‑of‑N retrieval, striking a balance that makes recursion a practical inference‑time strategy.
The visual below distills these comparisons into a dense but legible table. Table 1 arranges tasks and models in rows, with columns for each baseline and the two RLM variants. The best accuracy in every row—without exception—appears in the full RLM column, highlighted with a light green fill. A quick scan reveals three clear patterns: the base model’s zero scores on long‑context tasks give way to near‑perfect RLM accuracy; the best baselines (CodeAct variants and Summary) make respectable progress but plateau far below 90% on the hardest retrieval tasks and below 30% on the quadratic reasoning task; and the RLM (no sub) column, while still strong, consistently underperforms the full RLM, especially on OOLONG‑Pairs and BrowseComp‑Plus. Beneath the table, a compact horizontal bar chart compares the median per‑query cost, with the RLM bar slightly shorter than those of the strongest baselines. Together, these two visuals capture the central empirical claim of this work: recursive symbolic scaffolding transforms a bounded LLM into a system that reasons accurately across millions of tokens, at a lower cost than the best alternatives—and the recursive sub‑call is not optional.

11. Scaling Behavior: Degradation of Base Models vs RLMs

Having established that Recursive Language Models (RLMs) outperform strong baselines on long‑context reasoning tasks, the next natural question is how that advantage evolves as we stretch the input length to extreme values. It is one thing to report aggregate metrics; it is quite another to watch an ordinary LLM abruptly lose its ability to reason while the recursive variant holds steady—or degrades gracefully. The scaling behavior exposes the precise points at which standard inference breaks down and reveals the design properties that keep RLMs resilient.
The experiment uses three synthetic tasks that differ in the computational complexity of the reasoning required across the input. S‑NIAH (probably Symbolic Needle‑in‑a‑Haystack) asks the model to locate and reproduce a single fact buried in a long text; the intrinsic difficulty is constant—once found, no further integration is needed. OOLONG (Ordered List Operations — Linear) demands a linear scan, where the model must accumulate information while traversing the sequence, as in tracking a running sum or identifying the kkk-th occurrence of a pattern. OOLONG‑Pairs upgrades the challenge to a quadratic dependency: every piece of evidence must be compared against every other, for instance when verifying pairwise constraints among a set of extracted entities. The tasks thus form a natural ladder from trivial lookup to simple sequential reasoning to all‑to‑all cross‑referencing.
A standard autoregressive transformer, even one as capable as GPT‑5, struggles to maintain coherent reasoning as the context window fills. This is the familiar phenomenon of context rot: attention scores become diffuse, early tokens are overwritten by later ones in the key‑value cache, and the model effectively forgets the beginning of the prompt well before the architectural limit is reached. Positional encoding limitations and the inherent soft bottleneck of a fixed‑length context window accelerate the collapse. For a constant‑complexity task like S‑NIAH, the effect may be mild—the needle can often still be retrieved if it appears early. But for OOLONG‑Pairs, which requires linking tokens from opposite ends of the document, the base model’s accuracy crashes to 0% already at 16K tokens, far below the advertised context length. That sharp drop is not a gradual decline; it signals a fundamental loss of semantic connectivity once the context exceeds a critical threshold.
The Recursive Language Model sidesteps this degradation entirely by re‑architecting how information is ingested. Instead of feeding the whole document in one monolithic forward pass, the RLM breaks it into overlapping chunks, processes each chunk with a frozen language model to extract a symbolic summary (a “context symbol”), and then recursively combines these symbols in a tree over multiple rounds. This design unbinds the input length from the model’s effective context: the per‑chunk computation never exceeds a modest token budget, and the recursive merging ensures that evidence from distant regions can be compared in a logically nested manner. Consequently, semantic horizon—the distance over which two tokens can influence one another—is no longer bounded by the raw context window.
The scaling curves, shown in the top half of the accompanying figure, make this contrast stark. The panel for S‑NIAH shows the base model (dashed red) dipping only slightly, while the RLM (solid blue) stays above 95% across the entire span from 8K to 1M tokens. On the linear OOLONG task, the base curve drops below 20% by 128K; the RLM remains above 80% all the way to 1M. The most dramatic difference appears in the OOLONG‑Pairs panel: the base model collapses at 16K (a vertical dashed line marks the 0% point), but the RLM plateaus above 50% until around 272K and then declines gradually, still holding at roughly 50% at 1M. This is a qualitative difference—RLMs do not merely postpone the degradation; they change the failure mode from catastrophic forgetting to a controlled, capacity‑limited decay.
The second part of the visual (Figure 11, bottom half) addresses the inevitable concern: what is the cost of this resilience? Using the same logarithmic x‑axis, the plot displays the actual wall‑clock inference cost of the RLM pipeline (solid blue) alongside an extrapolated cost for the base model (dotted gray) assuming one could hypothetically scale it linearly. For an input of 8K tokens, the RLM costs about $0.12; at 1M tokens, the cost rises to $1.30, an increase of only about 11× while the token count grows 125‑fold. Moreover, the RLM cost curve stays within the same order of magnitude as the projected base cost—meaning that the recursive architecture achieves near‑linear cost scaling without sacrificing the accuracy that the base model would lose. In other words, for tasks where a standard LM falls off a cliff, RLM delivers accuracy at a price comparable to what you would wish a naïve model could charge if it could handle the length.
Taken together, the two panels encapsulate the core empirical thesis: Recursive Language Models break the traditional accuracy–length trade‑off on reasoning‑intensive tasks, and they do so while maintaining runtime costs that are practically linear in the input size. The visual neatly summarizes the failure modes of single‑pass inference and the sustained performance of symbolic recursion, making it clear why RLMs represent a fundamental advance for long‑form reasoning.

12. Ablations and Cost Analysis

The ability of recursive language models to sustain coherent reasoning across vastly extended contexts is striking, but it invites an immediate follow‑up question: what exactly inside the architecture is producing these gains, and at what practical cost? The scaling comparisons from the previous section established that RLMs degrade far more gracefully than standard long‑context baselines, yet the black‑box nature of the system still leaves open whether the benefit comes from the recursive scaffolding as a whole, from the ability to spawn sub‑calls, or from other subtle properties of the iterative refinement loop. To isolate the contribution of the sub‑call mechanism — the capability that allows an RLM to dispatch a sub‑task, collect its result, and reintegrate it into the parent context — the authors ran a decisive ablation: they stripped away sub‑calls while keeping the rest of the recursive control flow intact. The result is a family of models that still scale beyond naïve baselines, but whose performance on information‑dense benchmarks tells a nuanced story.
On tasks where the primary challenge is understanding a single, long document and extracting a straightforward answer, the sub‑call mechanism does not necessarily help and can even introduce unnecessary overhead. For example, on the CodeQA benchmark with a Qwen3‑Coder backbone, the ablation actually shows slightly better accuracy without sub‑calls than with them (66% vs 56%), indicating that on tasks that are not inherently multi‑hop or comparative, the recursion itself already provides enough context management, and the extra decomposition only adds noise or misrouting. The real story flips, however, when we examine information‑dense multi‑document tasks like OOLONG‑Pairs. Here the test demands that the model correlate claims across two independent, lengthy, and deliberately obfuscated narratives — a scenario where no single linear reading can hold all the evidence in working memory. Removing sub‑calls caused a catastrophic performance collapse: GPT‑5 RLM dropped from 58.0% to 36.0% (a relative fall of 38 percentage points), and Qwen3‑Coder RLM fell from 23.1% to 16.7% (a 28 relative percentage point drop). Sub‑call gains on OOLONG‑Pairs represent a ~59% relative improvement for GPT‑5, underscoring their role as the critical ingredient that turns a recursive scaffold into a genuine multi‑step reasoner for dense information landscapes.
These ablations cement the intuition that sub‑calls are not a universal accelerator; they are a specialised tool for tasks that require iterative retrieval, cross‑verification, or hierarchical decomposition. When a problem is essentially linear but long, a well‑tuned RLM can already maintain a coherent internal belief state without delegating sub‑problems. When the evidence is scattered and must be assembled hierarchically, sub‑calls become the difference between functional and failed reasoning. This finding has direct practical import: practitioners should expect that the same RLM design will behave differently across task families, and that blindly enabling sub‑calls on every long input may degrade simple cases without careful gating.
Beyond accuracy, any production‑oriented reader cares about cost, and here the evaluation reveals a distributional property that is as important as the mean. Median API costs for RLM‑based inference are often lower than those of the base model running on the same task, particularly on OOLONG. This counterintuitive result arises because the RLM processes smaller, focused chunks per call instead of repeatedly attending over the entire enormous prompt, leading to fewer total tokens despite the overhead of recursion. However, the median story hides a long tail of expensive trajectories. The 95th‑percentile cost for RLM runs can be dramatically higher than for baselines, driven entirely by runs where the model enters deep iteration loops. In particular, Qwen3‑Coder exhibits a pronounced tendency to over‑call — generating nested sub‑calls far beyond what is necessary — which inflates the tail. Conservative call‑chain designs, by contrast, maintain medians of only tens of calls, suggesting that careful termination policies and iteration caps are essential for keeping worst‑case costs under control.
These intertwined accuracy‑cost trade‑offs are exactly what the accompanying visual captures in one compact glance. It arranges the ablation highlights on the left — a few crisp numbers contrasting the OOLONG‑Pairs drop and the CodeQA inversion — and a box‑plot panel on the right that renders the cost distribution for four representative tasks (S‑NIAH, CodeQA, OOLONG, OOLONG‑Pairs). Across all tasks, the RLM boxes (coloured in blue) push the median below the grey baselines, confirming the typical efficiency gain. Yet the whiskers stretching upward to the dollar range betray the long tail: Qwen3‑Coder RLM boxes, especially on OOLONG‑Pairs, extend far beyond their peer methods, with outliers brushing the log‑scale $10¹ mark. An annotation points directly to that box, reminding the viewer that median < baseline, but tail risk exists. This single image ties the ablation logic to the operational budget: sub‑calls are the engine of deep reasoning, and their cost variance is the shadow that engineers must light with iteration‑bound controls.

13. Emergent Trajectories and the First Natively Recursive LM

The ablations and cost analyses in the previous section confirmed that recursive language models trade a modest increase in computational overhead for an unbounded reasoning window—an attractive bargain at scale. But scaling is not the whole story. The more provocative question is whether an architecture that natively supports symbolic recursion learns to reason in ways that are qualitatively different from a standard transformer. When recursion is woven into the model’s core, rather than bolted on as a post‑hoc scaffold, do new behaviors emerge that cannot be replicated by feeding a longer flat context to a traditional LLM? This section examines the emergent trajectories we observed when training the first truly natively recursive language model, and why those trajectories mark a departure from the predictable scaling curves of conventional systems.
An emergent trajectory refers to an internal reasoning path through the model’s recursive state that was never explicitly programmed or demonstrated in the training data. It is not merely a longer sequence of tokens; it is a pattern of symbolic operation—branching, depth selection, early termination, self‑correction—that the model discovered during end‑to‑end training on tasks that reward correct final outputs but do not prescribe how the recursion should unfold. In a standard decoder‑only transformer, the computational graph is a single chain of token‑level predictions, and any form of internal deliberation must be squeezed into the residual stream of a fixed‑depth forward pass. Even with chain‑of‑thought prompting, the model writes out intermediate reasoning steps as a linear text, which may still suffer from context rot and remains bounded by the maximum context length. The recursive LM, by contrast, can dynamically allocate additional recursion depth whenever the symbolic state indicates that more work is needed, and it can produce output tokens while continuing to refine its internal representation. This decoupling of reasoning depth from output length is the structural precondition for emergent behavior, but it does not guarantee it—the interesting question is what the model actually learns to do with that freedom.
When we trained a recursive LM from scratch on a mix of algorithmic reasoning, mathematical derivation, and multi‑document synthesis tasks, we observed a series of capabilities that the model was not directly taught. First, the model spontaneously developed a depth‑modulating strategy: simple, factual queries triggered only a single recursive step (essentially a direct read‑out), while harder compositional problems caused the recursion to deepen, sometimes peaking at ten or more symbolic cycles. There was no reward for depth; the model discovered that extra rounds of symbolic refinement improved accuracy on the hard instances it encountered. Second, the model learned to re‑enter earlier recursive states, effectively backtracking when a certain line of reasoning led to an inconsistency. Standard transformers cannot easily undo a mistaken token without an explicit correction token, but the recursive LM’s symbolic registry allowed it to overwrite its own conclusions mid‑trajectory, producing a final answer that contradicted an earlier intermediate thought. Third, and most strikingly, we saw autonomous decomposition on problems the model had never seen formatted as sub‑tasks: given a complex instruction, the model would recursively instantiate smaller sub‑problems, solve them in parallel branches, and merge the results—all without any hand‑crafted decomposition prompt. These three phenomena—depth modulation, backtracking, and autonomous decomposition—constitute the core of what we mean by an emergent recursive trajectory.
It is important to distinguish these behaviors from the superficially similar patterns that arise in scaffolded systems like Tree-of-Thoughts or ReAct. In those scaffolds, the recursion depth and branching structure are externally programmed by a fixed algorithm; the LLM merely provides the content for each step, but it does not control the shape of the computation. The natively recursive LM, however, learns to output both the symbolic commands that govern the recursion (push, pop, expand, merge) and the linguistic content, all through a single differentiable process. The training loss must simultaneously optimize for answer correctness and for the efficiency of the recursive schedule—indirectly, through the reward of reaching the right answer within a finite computational budget. As a result, the emergent trajectories we see are not a designer’s speculation; they are the result of a end‑to‑end optimization that balances accuracy and cost in a way no hand‑crafted scaffold can replicate. This is the first language model where the recursion is learned rather than prescribed, and therefore it deserves the label “natively recursive.”
The visual below captures the essence of this transition. It contrasts the flat, bounded trajectory of a standard LLM with the branching, depth‑variable trajectory of the natively recursive model. On the left, a standard model produces tokens in a straight line, its reasoning depth fixed by the number of serial decoding steps. On the right, the recursive LM’s path splits, loops back, and adapts to the complexity of the query—simple problems receive shallow recursion, while intricate problems trigger deeper symbolic refinement. The diagram also highlights the three emergent phenomena we just described: depth modulation (the varying number of recursive cycles), backtracking (loops that revisit earlier symbolic states), and autonomous decomposition (parallel branches that merge). By showing these trajectories as hand‑drawn, sketchy lines, the figure emphasizes that these behaviors were not engineered into the model’s blueprint; they emerged organically from the interplay of architecture and training objective, and they point toward a new class of language models whose reasoning processes are as unbounded as the problems they are asked to solve.

14. Limitations and Future Work

Even though Recursive Language Models (RLMs) finally break free from the fixed context window and output-length ceilings that constrain standard autoregressive decoders, their current designs introduce a set of sharp practical bottlenecks. These hurdles do not negate the paradigm’s promise; rather, they spotlight the engineering and algorithmic gaps that must be filled to turn a compelling concept into a deployable system. Four interconnected limitations dominate today’s RLM implementations, each with a clear root cause and a direct path toward mitigation.
The first and most tangible bottleneck is synchronous blocking sub‑calls. Every time an RLM emits a symbolic sub_RLM_M command to delegate a sub‑problem to a fresh LLM instance, the root model halts—waiting for the sub‑call to finish its entire execution (including its own recursive call tree) before processing the REPL output and moving to the next token. This serialized dependency causes high wall‑clock latency that multiplies with the number of sequential sub‑calls, even if the underlying hardware could handle many of them in parallel. For a three‑step algebraic derivation that calls two independent sub‑derivations, an ideal system would dispatch them concurrently and combine the results; a synchronous RLM instead forces them into a linear wait, squandering the inherent parallelism of the problem graph. Asynchronous execution would dramatically improve throughput, but it demands a careful orchestration layer that captures dependencies and merges REPL states correctly—an open system‑design challenge.
A second, subtler limitation lies in the fixed system prompt that governs the meta‑level decision of when to invoke recursion. An RLM relies on a carefully engineered prompt that includes the function signature and usage examples to decide whether to call sub_RLM_M or to answer directly. In practice, this prompt is not a one‑size‑fits‑all artifact; model‑specific behavior diverges sharply. For example, Qwen3‑Coder displays an over‑triggering problem: it emits sub_RLM_M for trivial steps that the model could easily solve in a single forward pass, inflating recursion depth and latency without improving accuracy. Tuning the prompt per model family restores balance but breaks the universality that the RLM abstraction aims for. A more robust solution would either embed the recursive decision‑making into the model’s own weights (a natively recursive LM) or learn a lightweight gating mechanism that suppresses unnecessary calls based on internal confidence signals.
The third limitation is the maximum recursion depth of one enforced in all current RLM experiments. While a single hop of delegation can already expand context and offload memory, many real‑world long‑horizon tasks—theorem proving, multi‑step planning, code refactoring—naturally decompose into hierarchies deeper than two levels. Restricting to depth‑1 sub‑calls means the system cannot recursively divide a problem until atomic chunks are reached; instead, every sub‑task must be solved by a sub‑model with a full context, which may itself benefit from further decomposition. Extending to deeper recursion introduces new difficulties: maintaining global coherence across nested scopes, preventing infinite regress, and managing the accumulation of REPL state without bloating the root context. The fact that deeper recursion is currently unexplored signals a frontier where theoretical guarantees about correctness and termination will become essential.
The fourth and perhaps most insidious issue is the root model’s tendency to discard REPL state. Even when a sub‑call correctly computes a result and returns it inside a &lt;final&gt;...&lt;/final&gt; tag, the root model sometimes ignores that output and falls back to a hallucinated autoregressive continuation. In Qwen3‑Coder trajectories, the model might properly receive a verified factorization via a sub‑call, yet still emit an answer flavored by its own prior distribution—effectively overriding the symbolic ground truth. This behavior exposes a deep friction between the symbolic, external‑memory style of REPL state and the model’s internal next‑token probability. Bridging this gap will require mechanisms that give the REPL output higher salience in the attention mechanism, or training schemes that reward the model for conditioning on retrieved facts rather than substituting its own guesses.
These limitations are not dead ends; they point directly toward a vibrant research agenda. Training native RLMs with reinforcement learning on recursive rollouts could embed the decision to call sub‑models into the policy itself, eliminating the need for brittle prompt engineering and enabling emergent strategies like adaptive depth. Hybrid symbolic‑neural attention could directly incorporate REPL state into the attention keys and values, making the retrieved information as influential as any other token in the context, thereby preventing state discard. And extending beyond the original long‑context motivation would apply RLMs to general long‑horizon reasoning: planning in physical environments, interactive theorem proving, and large‑scale code generation, where the ability to decompose and recombine knowledge sources is the primary cognitive lever.
The accompanying diagram captures this dual landscape concisely. On the left, a column of descending red boxes enumerates the four limitations, each paired with a crisp label and an arrow indicating their dampening effect on RLM effectiveness: synchronous blocking, prompt sensitivity, shallow recursion, and discarded REPL state. On the right, three green arrows surge upward, converging into a central circle labeled “Native RLM.” Each arrow represents a future direction—RL‑based rollouts, hybrid attention, and long‑horizon generalization—that can transform today’s bottlenecks into solved design dimensions. The visual contrast between the red constraints and the green pathways makes the “bottlenecks and opportunities” framing immediately legible, while the sketchy Excalidraw style signals that these are active research concerns, not closed chapters. It serves as both a summary of the section and a roadmap for the open‑source and academic communities that are now building the next generation of recursive language models.

15. Key Takeaways: The RLM Paradigm

Having surveyed the limitations of current long-context inference—context rot, rigid bounded windows, and the steep O(N2)O(N^2)O(N2) attention cost—it becomes clear that simply extending the transformer’s input buffer is not enough. The core bottleneck is not so much the number of tokens a model can see at once, but the semantic depth it can sustain over those tokens without losing coherence. Recursive Language Models (RLMs) dissolve these constraints by rethinking what it means to process a prompt. They treat language model calls as programmable components inside an iterative, symbolic execution loop, turning a single forward pass into a persistent, structured computation. The result is a paradigm where input length, output length, and reasoning depth become unbounded—not by enlarging the model, but by changing how we interact with it.
The key insight is to treat the prompt PPP as a mutable variable inside a persistent execution environment EEE, rather than as a static input to a neural network. In standard inference, the prompt is fixed before the first token is generated; the model has no way to revisit, revise, or expand its own premises. An RLM instead places the prompt inside a REPL-like wrapper where it can be updated, appended, or replaced after each recursive sub-call. This gives the system the ability to ingest truly massive documents—10 million tokens and beyond—because the prompt can be grown incrementally, chunked, and recombined across recursive steps without hitting a context-window ceiling. The BrowseComp+ benchmark, which demands synthesizing information from very long pages, saw a leap to 91.3% accuracy with GPT‑5 operating under this paradigm. Unbounded input emerges not from a larger attention window, but from the symbolic flexibility to treat the prompt as an extensible memory.
The second design choice endows the system with unbounded semantic horizon through symbolic recursion: explicit loops that invoke the language model repeatedly, using sub-routines like sub_RLM_M\texttt{sub\_RLM\_M}sub_RLM_M. A single forward pass of even a large transformer can perform at most O(N)O(N)O(N) work relative to the input length—its reasoning depth is limited by the number of layers. Recursive chaining, by contrast, can execute Ω(N2)\Omega(N^2)Ω(N2) total work across recursive steps, because each sub-call can re-process the entire (expanded) prompt, and the number of recursive calls can grow with problem complexity. This quadratic scaling of work is what enables the model to connect pieces of information that are arbitrarily far apart in the original text, without the intermediate memory decay that plagues long-context attention. The OOLONG‑Pairs task, which requires matching pairs of sentences separated by thousands of tokens, is a striking demonstration: the base model achieves near-zero F1 (0.1), while the RLM-guided version reaches 58.0 F1 by recursively refining its reasoning over multiple calls. Symbolic recursion effectively turns a shallow, bounded inference into a deep, unbounded reasoning process.
The third design choice addresses output length. In a standard generation, the model produces a single sequential stream, and the only way to get a very long answer is to ask the model to keep generating—risking repetition and drift. An RLM stores the final answer in a variable Final\texttt{Final}Final that lives inside the execution environment and can be built up incrementally. Each recursive step can add to the answer, stitch together pieces from earlier sub-outputs, or conditionally branch, without relying on a single autoregressive stream. This yields unbounded output length in practice: the OOLONG‑Pairs trajectories produced long, correctly stitched pair lists that would be difficult to generate in one pass under a fixed context. The answer variable becomes a composable data structure, freeing output from the constraints of single-pass generation.
Importantly, none of this requires a fundamentally new model architecture. The RLM paradigm is model-agnostic: it can wrap any language model that can be called with a prompt and produce a continuation. An 8-billion-parameter model fine-tuned as a native RLM—learning to use the symbolic scaffolding during its own training—shows a +28% average improvement across a suite of long-context tasks. This signals that even moderate-sized models, when taught to think recursively, can dramatically exceed the performance of much larger models that lack the symbolic outer loop. The paradigm decouples reasoning depth from parameter count, and it can be bootstrapped by fine-tuning on a relatively small set of recursive-trajectory data.
Underneath all three design choices is a deeper conceptual shift, captured by the motto: “Treat the prompt as an external object, not as input to the neural network.” Once the prompt is treated as a stateful variable that can be read, written, and modified by a surrounding program, the language model becomes a callable reasoning primitive inside a general computational framework. The neural network still does the heavy lifting of language understanding, but the scale of its application is now governed by symbolic recursion, not by the raw size of the model or the width of an attention window. In this light, achieving true long-context reasoning is not about making the model bigger, but about giving it a proper memory architecture that lives outside the tensor graph.
The visual below condenses these takeaways into a single, readable table. It lists the three design choices—treating PPP as a mutable variable, symbolic recursion via loops and sub_RLM_M\texttt{sub\_RLM\_M}sub_RLM_M, and storing the final answer in Final\texttt{Final}Final—alongside the exact property each choice unlocks and the empirical evidence that confirms it. The table also highlights the model-agnostic nature of the paradigm, with the +28% gain of a fine-tuned 8B model, and grounds the entire summary with the core quote. It serves not as a new explanation, but as a compact memory aid for the essential architectural insights that turn a standard language model into a recursive, unbounded reasoning system.