SN33 Miners Outperform Mechanical Turk by 86% at 660x Lower Cost
Introduction
This study evaluates the performance of miners using their own fine-tuned models in comparison to the Mechanical Turk (MTurk) benchmark used as the gold standard for comparison of LLMs for annotation tasks. Previous studies like Gilardi et al. (2023) showed GPT-3.5 outperformed the Mechanical Turk benchmark by 25% on average.
Key Results
- 86% better than Mechanical Turk
- 51% better than GPT-4o with no further optimizations
- 92% improvement in top miner performance over 3 months
- 660x cheaper than MTurk annotation costs
Structured data is the foundation of successful AI models and applications. Annotations are critical in transforming raw information into high-quality, organized datasets that fuel AI development and performance. The process of data annotation faces significant challenges of cost inefficiency, temporal constraints, limited scalability, and annotation inconsistency.
Methodology
Sample Size and Data Collection
A sample of 1,270 transcriptions were used for this study. Each transcription was broken up into conversation windows which represents a distinct piece of dialogue that miners must annotate.
Participants
- SN33 Miners: Top 5 miner scores historically, each using their own fine-tuned models
- GPT-4o: Control group
- Mechanical Turk workers: Each transcription sent to 3 MTurks (3,810 workers total)
Evaluation Metrics
Participants are presented with a data window extracted from a larger podcast conversation. Their task is to generate tags that accurately describe and categorize the content. Tags are scored via cosine distance calculation comparing the tag's embedding space relative to the ground truth embeddings.
The final score is a weighted combination of four components:
Top Unique Tags (TUT) — 55% weight
Mean score of the top 3 unique tags
Overall Mean (OM) — 25% weight
Mean score of all submitted tags
Median Score (MS) — 10% weight
Median score of all submitted tags
Top Score (TS) — 10% weight
Highest individual tag score
Results
| Participant Group | Adjusted Score |
|---|---|
| Miners (Average) | 0.96 |
| GPT-4o | 0.65 |
| Mechanical Turk | 0.52 |
Additional Context
The average MTurk time per task was 3 minutes and 57 seconds, with an associated cost of $0.12 per task. SN33 miners produce these annotations for 660x less cost.
Recent research from the Swiss Federal Institute of Technology suggests that between 33% and 46% of crowd workers may be leveraging LLMs to complete their tasks (Veselovsky et al., 2023). Enterprises are paying human costs for LLM work that is 660x cheaper to produce.
Key Findings
- Top Miners outperformed Mechanical Turk by 71% and GPT-4o by 37%
- Mechanical Turk annotations cost ~$0.12 per task — roughly 660x more than miner costs
- Top Miners have continued to increase annotation quality over time
- LLMs are more capable than humans on annotation tasks critical for AI expansion
Sources
Gilardi, Fabrizio, Meysam Alizadeh, and Maël Kubli. 2023. "ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks". Proceedings of the National Academy of Sciences 120(30)
Veselovsky, V., Horta Ribeiro, M., and West, R. 2023. "Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks", Swiss Federal Institute of Technology