SLM-Bench

Production-Grade Edge AI Evaluation

Independent benchmarking for Small Language Models deployed at the edge. We evaluate what matters for production: function calling, structured extraction, and intent classification across real-world hardware platforms.

๐Ÿ“Š

Production-Ready Benchmarks

EdgeJSON for structured extraction, EdgeIntent for classification, EdgeFuncCall for tool use. Evaluate models on tasks that reflect real-world deployment requirements.

Explore benchmarks โ†’
๐Ÿ†

Independent Rankings

Compare leading SLMs across standardized tasks. Open methodology, reproducible results, no pay-to-play rankings. Transparency by design.

View leaderboard โ†’
๐Ÿ”ฌ

Evaluation Service

Independent testing for your SLM. Comprehensive reports with energy measurement, cross-platform validation, and competitive benchmarking.

Request evaluation โ†’

EdgeJSON Benchmark Leaderboard

Real-world structured JSON extraction performance

Updated: Nov 27, 2025
Rank Model Size JSONExact Field F1 Hardware License
๐Ÿฅ‡ 1 Maaza SLM-360M-JSON v1 CCTโœ“ 360M 55.1% 0.729 Laptop, Pi 5 Apache 2.0
๐Ÿฅˆ 2 Maaza MLM-135M-JSON v1 CCTโœ“ 135M 46.8% 0.534 Pi 5, Browser Apache 2.0
๐Ÿฅ‰ 3 DeepSeek-R1-Distill-Qwen-1.5B 1.5B 16.0% 0.317 GPU MIT
4 Qwen 2.5-0.5B 500M 14.6% 0.195 CPU/GPU Apache 2.0
5 SmolLM2-360M (base) 360M 11.4% 0.240 Laptop Apache 2.0
6 Qwen 2.5-3B Instruct 3B 6.0% 0.105 GPU Apache 2.0
7 Phi-3.5-Mini Instruct 3.8B 2.0% 0.031 GPU MIT
8 SmolLM2-135M (base) 135M 0.6% 0.024 Pi 5, Browser Apache 2.0

Metrics: JSONExact = exact JSON match accuracy (primary metric) โ€ข Field F1 = per-field precision/recall โ€ข All models evaluated on EdgeJSON v3 (158 test cases, 24 schemas)

Latest Update (Nov 27, 2025): Added DeepSeek-R1-1.5B (JSON-optimized), Qwen 2.5-3B, and Phi-3.5-Mini baselines. Key finding: Maaza-360M (fine-tuned) outperforms DeepSeek-R1-1.5B (4.2ร— larger, JSON mode) by 3.4ร— overall.

Performance by Complexity: Maaza excels on simple (2-4 fields: 78.9%) and medium schemas (5-8 fields: 51.4%). DeepSeek-R1 achieved 0.0% on medium schemas despite "JSON mode" training, validating specialist fine-tuning approach.

Testing Methodology: All models benchmarked on the same hardware with standardized prompts. Zero-shot models (DeepSeek, Qwen, Phi) tested with temperature=0.0 for deterministic output.

Have a model to evaluate? Submit it to the leaderboard.

Request Evaluation

Professional SLM Evaluation

Independent, rigorous, transparent

Single Verification

$20
  • CCTโœ“ Verified badge on leaderboard
  • Official certificate (PDF)
  • Detailed evaluation report
  • 48-hour turnaround
  • Email support

Enterprise

$499 / month
  • Unlimited verifications
  • API access (automated evaluation)
  • Private leaderboard
  • Priority support (24hr response)
  • Custom benchmarks (1/quarter)
  • White-label reports

Interested in getting your model evaluated?