A New York University paper introduces Randomized YaRN, a training method that combines YaRN positional extrapolation, randomized positional encodings, and a length curriculum. The authors say it improves long-context reasoning on BABILong and MRCR, with the strongest gains at far out-of-distribution lengths.
New York University researchers say a training method called Randomized YaRN can improve how language models reason over long inputs, especially when test-time context lengths are far beyond what the model saw during training.
The paper, Randomized YaRN Improves Length Generalization for Long-Context Reasoning, was posted to arXiv on June 22, 2026. Its authors are Manas Mehta, Fangcong Yin, and Greg Durrett.
What Randomized YaRN changes
The method combines YaRN-based positional extrapolation with randomized positional encodings and a length curriculum. In the paper’s setup, tokens seen during short-context training are assigned YaRN positional encodings sampled from a larger position range.
The idea is to teach the model not just to accept longer inputs, but to generalize its reasoning when the context length shifts well beyond the training distribution. That distinction matters in long-context work, where models often degrade when they are evaluated far outside the sequence lengths they were optimized on.
YaRN itself is a context-extension technique for RoPE-based models. Randomized YaRN builds on that base by introducing randomness during training, while the curriculum component gradually changes the lengths the model sees.
The paper frames length generalization as the core problem, rather than simple retrieval. It argues that long-context systems need to preserve reasoning quality when the input length changes, not merely avoid truncation.
Benchmarks and models
The authors evaluate the method on BABILong and Multi-Round Coreference Resolution, or MRCR. The reported tables also cover Qwen2.5-7B-Instruct and Olmo3-7B-Instruct.
According to the paper, the training data stay under 8K context, but the results are measured from 16K through 128K context lengths. The strongest gains appear at the most out-of-distribution lengths, which is where many long-context methods are most brittle.
On BABILong, Randomized YaRN outperforms the compared baselines for both model families in the reported results. The same pattern appears on MRCR, where the paper again reports better out-of-distribution performance than the baselines.
The paper positions its approach against standard fine-tuning, vanilla LoRA, and inference-time methods. Its claim is that the training recipe itself can matter as much as simply extending the nominal context window.
Why it matters
Long-context reasoning remains a practical bottleneck for language models trained on short sequences. Many systems can technically handle more tokens after a context extension upgrade, but their performance can still slip when the input distribution changes too much.
If the paper’s results hold up, the work could influence how teams train models for 16K, 32K, 64K, and 128K-plus context windows. The larger implication is that positional-encoding strategy and curriculum design may be as important as raw context length.
That is a meaningful takeaway for model builders because it shifts attention from window size alone to how the model is taught to use that window.
What happens next
The PDF says the code and data are available on GitHub through the authors’ repository, which should make the method easier to inspect and reproduce.
The key unanswered questions are whether independent groups reproduce the gains, whether the approach holds on additional benchmarks, and whether it scales cleanly to larger foundation models.
Those follow-ups matter because this is currently a first-publication result from the authors’ own paper. The next round of evidence will show whether Randomized YaRN becomes a broadly useful recipe for long-context training or stays a narrower benchmark win.
Revision note
Initial automated publication.