Building the Missing Training Data for Coding Agents

Launching Reasoning Infrastructure on SN33

PlanSearch (ICLR 2025 Spotlight) demonstrated something important: reasoning in natural language before generating code dramatically improves output diversity and correctness on coding benchmarks. Yang et al. (2025) reinforce this in their survey "Code to Think, Think to Code," showing that training with reasoning-rich data improves coding performance, and coding data improves reasoning. The two are deeply linked.

The Data Problem

This matters because coding agents have a well-documented data problem. The paper behind "Immersion in the GitHub Universe" (Feb 2026) states directly that agent advancement is "fundamentally constrained by the scarcity of high-quality training data."

SWE-rebench (NeurIPS 2025) echoes this, noting that existing datasets are "either limited to one-shot code generation or comprise small, manually curated collections," lacking scale and diversity. A separate study on synthetic coding data calls out a "persistent bottleneck: the lack of training data that jointly captures diversity, reasoning, and functional correctness at scale."

The Performance Gap on Hard Tasks

The gap is especially visible on hard tasks. GPT-5 with OpenHands scores 65% on SWE-Bench Verified but drops to 21% on SWE-EVO, which tests sustained, multi-file reasoning. The kind that requires understanding why a system is designed a certain way, not just generating a function.

The Key Question

If natural language reasoning improves coding performance, and the field is starved for reasoning-rich training data, where does that data come from?

Our Answer: Deep Technical Conversations

We believe the answer is deep technical conversations. Podcasts, architecture discussions, long-form developer dialogue. This is where engineers reason through tradeoffs, explain system design, and work through complex problems out loud. It's a rich, largely untapped source of exactly the signal the research says is missing.

We've now built the infrastructure to extract this data at scale. Our latest release on Subnet 33 introduces specialized enrichment logic to pull architectural reasoning and contextual knowledge from conversational data.

What's Next

Early results are promising, and we'll be sharing benchmark improvements on some of the hardest coding evaluations soon.

Stay tuned for benchmark results. Follow us for updates.

Follow @ReadyAI_