← Back to Blog
Research

SN33 Miners Outperform Mechanical Turk by 86% at 660x Lower Cost

2025

Introduction

This study evaluates the performance of miners using their own fine-tuned models in comparison to the Mechanical Turk (MTurk) benchmark used as the gold standard for comparison of LLMs for annotation tasks. Previous studies like Gilardi et al. (2023) showed GPT-3.5 outperformed the Mechanical Turk benchmark by 25% on average.

Key Results

  • 86% better than Mechanical Turk
  • 51% better than GPT-4o with no further optimizations
  • 92% improvement in top miner performance over 3 months
  • 660x cheaper than MTurk annotation costs

Structured data is the foundation of successful AI models and applications. Annotations are critical in transforming raw information into high-quality, organized datasets that fuel AI development and performance. The process of data annotation faces significant challenges of cost inefficiency, temporal constraints, limited scalability, and annotation inconsistency.

Methodology

Sample Size and Data Collection

A sample of 1,270 transcriptions were used for this study. Each transcription was broken up into conversation windows which represents a distinct piece of dialogue that miners must annotate.

Participants

Evaluation Metrics

Participants are presented with a data window extracted from a larger podcast conversation. Their task is to generate tags that accurately describe and categorize the content. Tags are scored via cosine distance calculation comparing the tag's embedding space relative to the ground truth embeddings.

The final score is a weighted combination of four components:

Top Unique Tags (TUT) — 55% weight

Mean score of the top 3 unique tags

Overall Mean (OM) — 25% weight

Mean score of all submitted tags

Median Score (MS) — 10% weight

Median score of all submitted tags

Top Score (TS) — 10% weight

Highest individual tag score

Results

Participant Group Adjusted Score
Miners (Average) 0.96
GPT-4o 0.65
Mechanical Turk 0.52

Additional Context

The average MTurk time per task was 3 minutes and 57 seconds, with an associated cost of $0.12 per task. SN33 miners produce these annotations for 660x less cost.

Recent research from the Swiss Federal Institute of Technology suggests that between 33% and 46% of crowd workers may be leveraging LLMs to complete their tasks (Veselovsky et al., 2023). Enterprises are paying human costs for LLM work that is 660x cheaper to produce.

Key Findings

Sources

Gilardi, Fabrizio, Meysam Alizadeh, and Maël Kubli. 2023. "ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks". Proceedings of the National Academy of Sciences 120(30)

Veselovsky, V., Horta Ribeiro, M., and West, R. 2023. "Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks", Swiss Federal Institute of Technology