Non-Deterministic Enrichment: A New Primitive for Scalable Dataset Generation
Abstract
This document describes a new miner task primitive deployed on SN33 (ReadyAI): context-aware, non-deterministic data enrichment. Unlike traditional deterministic tasks where a fixed input produces a fixed output, enrichment tasks generate valid but variable outputs depending on configurable inference parameters. This enables infinite dataset generation from finite source material — a property particularly valuable for training data synthesis and domain-specific knowledge augmentation.
1. The Determinism Constraint
Most subnet task designs are deterministic by necessity. Given input X, miners produce output Y, and validators score based on proximity to expected Y. This works well for inference, compute, and verification tasks.
But determinism constrains dataset generation. If you want to enrich a source document with supplemental research, there's no single "correct" enrichment. The value lies in generating relevant supplemental data, not identical supplemental data.
The constraint: How do you design a miner task where multiple valid outputs exist for the same input, while maintaining verifiable quality?
2. Mechanism: Persona-Driven Enrichment
Architecture
→ Inference Layer (persona prompt + source context)
→ Search Execution (N queries × M results each)
→ Structured Output (enriched dataset with provenance)
Non-Determinism by Design
The key insight: same source + different persona = different valid queries.
A 60,000-character city council transcript viewed through an "Opportunistic Investor" lens produces queries about rezoning, distressed assets, and cyclical inflection points. The same transcript through a "Real Estate Lender" lens produces queries about refinancing risk, sponsor strength, and debt market conditions.
Both outputs are valid. Both are useful. Neither is "correct" in an exclusive sense.
Validation Approach
Rather than scoring against a ground truth, validation checks:
- Relevance: Do generated queries relate to source document themes?
- Persona alignment: Do queries reflect the specified perspective?
- Search executability: Do queries return meaningful results?
- Structural compliance: Does output conform to expected schema?
3. Implementation: Real Estate Enrichment
20 domain-specific personas currently deployed, each generating 10 contextual search queries per source document:
POC Run: Seattle City Council Minutes
Source: City Council meeting transcript (2025-12-16), 59,121 characters
Persona: Opportunistic Investor
Inference model: claude-sonnet-4-5-20250929
Cost: $0.064
Output: 10 search queries
4. Scaling Properties
For a single source document:
- 20 personas × 10 queries = 200 unique search queries
- 200 queries × 10 results = 2,000 enrichment data points
For a corpus of 1,000 source documents:
- 1,000 × 200 = 200,000 unique queries
- 200,000 × 10 = 2,000,000 enrichment data points
The persona library is extensible. Adding 10 new personas doubles output without changing source material.
5. Domain Expansion
The architecture generalizes beyond real estate:
- Real Estate (current): Transcripts, filings, offering memos → 20 investor/operator types
- LLMs.txt: Web crawl, documentation, repos → Developer, researcher, enterprise, compliance
- M&A: Deal docs, earnings calls, filings → Buyer, seller, lender, advisor
- Public Policy: Legislation, regulatory filings → Advocacy, compliance, investment
6. The llms.txt Opportunity
The llms.txt standard creates a new source corpus for enrichment. Raw llms.txt files contain documentation, API references, product information, and organizational knowledge.
Enrichment transforms this into training-ready data by generating question-answer pairs, creating domain-specific query-response datasets, and building retrieval corpora with relevance labels.
7. Conclusion
Non-deterministic enrichment tasks solve a structural problem in dataset generation: how to create infinite valid outputs from finite inputs while maintaining verifiable quality. The mechanism — persona-driven inference producing contextual search queries — is simple, scalable, and domain-agnostic.
SN33 is currently processing real estate source material for production customers. The architecture extends to any domain where supplemental research adds value to source documents, with llms.txt enrichment as the next target application.