SN33 Miners Outperform MTurk by 86%

Introduction

This study evaluates the performance of miners using their own fine-tuned models in comparison to the Mechanical Turk (MTurk) benchmark used as the gold standard for comparison of LLMs for annotation tasks. Previous studies like Gilardi et al. (2023) showed GPT-3.5 outperformed the Mechanical Turk benchmark by 25% on average.

Key Results

86% better than Mechanical Turk
51% better than GPT-4o with no further optimizations
92% improvement in top miner performance over 3 months
660x cheaper than MTurk annotation costs

Structured data is the foundation of successful AI models and applications. Annotations are critical in transforming raw information into high-quality, organized datasets that fuel AI development and performance. The process of data annotation faces significant challenges of cost inefficiency, temporal constraints, limited scalability, and annotation inconsistency.

Methodology

Sample Size and Data Collection

A sample of 1,270 transcriptions were used for this study. Each transcription was broken up into conversation windows which represents a distinct piece of dialogue that miners must annotate.

Participants

SN33 Miners: Top 5 miner scores historically, each using their own fine-tuned models
GPT-4o: Control group
Mechanical Turk workers: Each transcription sent to 3 MTurks (3,810 workers total)

Evaluation Metrics

Participants are presented with a data window extracted from a larger podcast conversation. Their task is to generate tags that accurately describe and categorize the content. Tags are scored via cosine distance calculation comparing the tag's embedding space relative to the ground truth embeddings.

The final score is a weighted combination of four components:

Top Unique Tags (TUT) — 55% weight

Mean score of the top 3 unique tags

Overall Mean (OM) — 25% weight

Mean score of all submitted tags

Median Score (MS) — 10% weight

Median score of all submitted tags

Top Score (TS) — 10% weight

Highest individual tag score

Results

Participant Group	Adjusted Score
Miners (Average)	0.96
GPT-4o	0.65
Mechanical Turk	0.52

Additional Context

The average MTurk time per task was 3 minutes and 57 seconds, with an associated cost of $0.12 per task. SN33 miners produce these annotations for 660x less cost.

Recent research from the Swiss Federal Institute of Technology suggests that between 33% and 46% of crowd workers may be leveraging LLMs to complete their tasks (Veselovsky et al., 2023). Enterprises are paying human costs for LLM work that is 660x cheaper to produce.

Key Findings

Top Miners outperformed Mechanical Turk by 71% and GPT-4o by 37%
Mechanical Turk annotations cost ~$0.12 per task — roughly 660x more than miner costs
Top Miners have continued to increase annotation quality over time
LLMs are more capable than humans on annotation tasks critical for AI expansion

Sources

Gilardi, Fabrizio, Meysam Alizadeh, and Maël Kubli. 2023. "ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks". Proceedings of the National Academy of Sciences 120(30)

Veselovsky, V., Horta Ribeiro, M., and West, R. 2023. "Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks", Swiss Federal Institute of Technology

SN33 Miners Outperform Mechanical Turk by 86% at 660x Lower Cost