← Back to Blog
Technical Paper

Non-Deterministic Enrichment: A New Primitive for Scalable Dataset Generation

January 13, 2026

Abstract

This document describes a new miner task primitive deployed on SN33 (ReadyAI): context-aware, non-deterministic data enrichment. Unlike traditional deterministic tasks where a fixed input produces a fixed output, enrichment tasks generate valid but variable outputs depending on configurable inference parameters. This enables infinite dataset generation from finite source material — a property particularly valuable for training data synthesis and domain-specific knowledge augmentation.

1. The Determinism Constraint

Most subnet task designs are deterministic by necessity. Given input X, miners produce output Y, and validators score based on proximity to expected Y. This works well for inference, compute, and verification tasks.

But determinism constrains dataset generation. If you want to enrich a source document with supplemental research, there's no single "correct" enrichment. The value lies in generating relevant supplemental data, not identical supplemental data.

The constraint: How do you design a miner task where multiple valid outputs exist for the same input, while maintaining verifiable quality?

2. Mechanism: Persona-Driven Enrichment

Architecture

Source Document (transcript, filing, crawl)
  → Inference Layer (persona prompt + source context)
  → Search Execution (N queries × M results each)
  → Structured Output (enriched dataset with provenance)

Non-Determinism by Design

The key insight: same source + different persona = different valid queries.

A 60,000-character city council transcript viewed through an "Opportunistic Investor" lens produces queries about rezoning, distressed assets, and cyclical inflection points. The same transcript through a "Real Estate Lender" lens produces queries about refinancing risk, sponsor strength, and debt market conditions.

Both outputs are valid. Both are useful. Neither is "correct" in an exclusive sense.

Validation Approach

Rather than scoring against a ground truth, validation checks:

3. Implementation: Real Estate Enrichment

20 domain-specific personas currently deployed, each generating 10 contextual search queries per source document:

1 Institutional Core Investor
2 Value-Add Investor
3 Opportunistic Investor
4 PE Real Estate Fund Manager
5 Family Office Investor
6 Ground-Up Developer
7 Adaptive Reuse Developer
8 Merchant Builder
9 Mixed-Use Developer
10 Affordable Housing Developer
11 Multifamily Owner-Operator
12 Office Investor
13 Industrial/Logistics Investor
14 Retail Investor
15 Hospitality Investor
16 Real Estate Lender
17 Distressed Asset Specialist
18 ESG-Focused Investor
19 Macro/Strategy Advisor
20 Portfolio Risk Manager

POC Run: Seattle City Council Minutes

Source: City Council meeting transcript (2025-12-16), 59,121 characters

Persona: Opportunistic Investor

Inference model: claude-sonnet-4-5-20250929

Cost: $0.064

Output: 10 search queries

4. Scaling Properties

For a single source document:

For a corpus of 1,000 source documents:

The persona library is extensible. Adding 10 new personas doubles output without changing source material.

5. Domain Expansion

The architecture generalizes beyond real estate:

6. The llms.txt Opportunity

The llms.txt standard creates a new source corpus for enrichment. Raw llms.txt files contain documentation, API references, product information, and organizational knowledge.

Enrichment transforms this into training-ready data by generating question-answer pairs, creating domain-specific query-response datasets, and building retrieval corpora with relevance labels.

7. Conclusion

Non-deterministic enrichment tasks solve a structural problem in dataset generation: how to create infinite valid outputs from finite inputs while maintaining verifiable quality. The mechanism — persona-driven inference producing contextual search queries — is simple, scalable, and domain-agnostic.

SN33 is currently processing real estate source material for production customers. The architecture extends to any domain where supplemental research adds value to source documents, with llms.txt enrichment as the next target application.