SLM-Bench

Production-Grade Edge AI Evaluation

Independent benchmarking for Small Language Models deployed at the edge. We evaluate what matters for production: function calling, structured extraction, and intent classification across real-world hardware platforms.

View Leaderboard Request Evaluation

📊

Production-Ready Benchmarks

EdgeJSON for structured extraction, EdgeIntent for classification, EdgeFuncCall for tool use. Evaluate models on tasks that reflect real-world deployment requirements.

Explore benchmarks →

🏆

Independent Rankings

Compare leading SLMs across standardized tasks. Open methodology, reproducible results, no pay-to-play rankings. Transparency by design.

View leaderboard →

🔬

Evaluation Service

Independent testing for your SLM. Comprehensive reports with energy measurement, cross-platform validation, and competitive benchmarking.

Request evaluation →

EdgeJSON Benchmark Leaderboard

Real-world structured JSON extraction performance

Updated: Nov 27, 2025

Rank	Model	Size	JSONExact	Field F1	Hardware	License
🥇 1	Maaza SLM-360M-JSON v1 CCT✓	360M	55.1%	0.729	Laptop, Pi 5	Apache 2.0
🥈 2	Maaza MLM-135M-JSON v1 CCT✓	135M	46.8%	0.534	Pi 5, Browser	Apache 2.0
🥉 3	DeepSeek-R1-Distill-Qwen-1.5B	1.5B	16.0%	0.317	GPU	MIT
4	Qwen 2.5-0.5B	500M	14.6%	0.195	CPU/GPU	Apache 2.0
5	SmolLM2-360M (base)	360M	11.4%	0.240	Laptop	Apache 2.0
6	Qwen 2.5-3B Instruct	3B	6.0%	0.105	GPU	Apache 2.0
7	Phi-3.5-Mini Instruct	3.8B	2.0%	0.031	GPU	MIT
8	SmolLM2-135M (base)	135M	0.6%	0.024	Pi 5, Browser	Apache 2.0

Metrics: JSONExact = exact JSON match accuracy (primary metric) • Field F1 = per-field precision/recall • All models evaluated on EdgeJSON v3 (158 test cases, 24 schemas)

Latest Update (Nov 27, 2025): Added DeepSeek-R1-1.5B (JSON-optimized), Qwen 2.5-3B, and Phi-3.5-Mini baselines. Key finding: Maaza-360M (fine-tuned) outperforms DeepSeek-R1-1.5B (4.2× larger, JSON mode) by 3.4× overall.

Performance by Complexity: Maaza excels on simple (2-4 fields: 78.9%) and medium schemas (5-8 fields: 51.4%). DeepSeek-R1 achieved 0.0% on medium schemas despite "JSON mode" training, validating specialist fine-tuning approach.

Testing Methodology: All models benchmarked on the same hardware with standardized prompts. Zero-shot models (DeepSeek, Qwen, Phi) tested with temperature=0.0 for deterministic output.

Have a model to evaluate? Submit it to the leaderboard.

Request Evaluation

Professional SLM Evaluation

Independent, rigorous, transparent

Single Verification

$20

CCT✓ Verified badge on leaderboard
Official certificate (PDF)
Detailed evaluation report
48-hour turnaround
Email support

Best Value

Verification Pack

$79 / 5 models

Everything in Single × 5
Multi-model comparison report
21% savings ($15.80/model)
Priority queue
Bulk certificate download

Enterprise

$499 / month

Unlimited verifications
API access (automated evaluation)
Private leaderboard
Priority support (24hr response)
Custom benchmarks (1/quarter)
White-label reports

Interested in getting your model evaluated?