Non-Deterministic Enrichment

Abstract

This document describes a new miner task primitive deployed on SN33 (ReadyAI): context-aware, non-deterministic data enrichment. Unlike traditional deterministic tasks where a fixed input produces a fixed output, enrichment tasks generate valid but variable outputs depending on configurable inference parameters. This enables infinite dataset generation from finite source material — a property particularly valuable for training data synthesis and domain-specific knowledge augmentation.

1. The Determinism Constraint

Most subnet task designs are deterministic by necessity. Given input X, miners produce output Y, and validators score based on proximity to expected Y. This works well for inference, compute, and verification tasks.

But determinism constrains dataset generation. If you want to enrich a source document with supplemental research, there's no single "correct" enrichment. The value lies in generating relevant supplemental data, not identical supplemental data.

The constraint: How do you design a miner task where multiple valid outputs exist for the same input, while maintaining verifiable quality?

2. Mechanism: Persona-Driven Enrichment

Architecture

Source Document (transcript, filing, crawl)
  → Inference Layer (persona prompt + source context)
  → Search Execution (N queries × M results each)
  → Structured Output (enriched dataset with provenance)

Non-Determinism by Design

The key insight: same source + different persona = different valid queries.

A 60,000-character city council transcript viewed through an "Opportunistic Investor" lens produces queries about rezoning, distressed assets, and cyclical inflection points. The same transcript through a "Real Estate Lender" lens produces queries about refinancing risk, sponsor strength, and debt market conditions.

Both outputs are valid. Both are useful. Neither is "correct" in an exclusive sense.

Validation Approach

Rather than scoring against a ground truth, validation checks:

Relevance: Do generated queries relate to source document themes?
Persona alignment: Do queries reflect the specified perspective?
Search executability: Do queries return meaningful results?
Structural compliance: Does output conform to expected schema?

3. Implementation: Real Estate Enrichment

20 domain-specific personas currently deployed, each generating 10 contextual search queries per source document:

1 Institutional Core Investor

2 Value-Add Investor

3 Opportunistic Investor

4 PE Real Estate Fund Manager

5 Family Office Investor

6 Ground-Up Developer

7 Adaptive Reuse Developer

8 Merchant Builder

9 Mixed-Use Developer

10 Affordable Housing Developer

11 Multifamily Owner-Operator

12 Office Investor

13 Industrial/Logistics Investor

14 Retail Investor

15 Hospitality Investor

16 Real Estate Lender

17 Distressed Asset Specialist

18 ESG-Focused Investor

19 Macro/Strategy Advisor

20 Portfolio Risk Manager

POC Run: Seattle City Council Minutes

Source: City Council meeting transcript (2025-12-16), 59,121 characters

Persona: Opportunistic Investor

Inference model: claude-sonnet-4-5-20250929

Cost: $0.064

Output: 10 search queries

4. Scaling Properties

For a single source document:

20 personas × 10 queries = 200 unique search queries
200 queries × 10 results = 2,000 enrichment data points

For a corpus of 1,000 source documents:

1,000 × 200 = 200,000 unique queries
200,000 × 10 = 2,000,000 enrichment data points

The persona library is extensible. Adding 10 new personas doubles output without changing source material.

5. Domain Expansion

The architecture generalizes beyond real estate:

Real Estate (current): Transcripts, filings, offering memos → 20 investor/operator types
LLMs.txt: Web crawl, documentation, repos → Developer, researcher, enterprise, compliance
M&A: Deal docs, earnings calls, filings → Buyer, seller, lender, advisor
Public Policy: Legislation, regulatory filings → Advocacy, compliance, investment

6. The llms.txt Opportunity

The llms.txt standard creates a new source corpus for enrichment. Raw llms.txt files contain documentation, API references, product information, and organizational knowledge.

Enrichment transforms this into training-ready data by generating question-answer pairs, creating domain-specific query-response datasets, and building retrieval corpora with relevance labels.

7. Conclusion

Non-deterministic enrichment tasks solve a structural problem in dataset generation: how to create infinite valid outputs from finite inputs while maintaining verifiable quality. The mechanism — persona-driven inference producing contextual search queries — is simple, scalable, and domain-agnostic.

SN33 is currently processing real estate source material for production customers. The architecture extends to any domain where supplemental research adds value to source documents, with llms.txt enrichment as the next target application.

Non-Deterministic Enrichment: A New Primitive for Scalable Dataset Generation