Production-Grade Edge AI Evaluation
Independent benchmarking for Small Language Models deployed at the edge. We evaluate what matters for production: function calling, structured extraction, and intent classification across real-world hardware platforms.
EdgeJSON for structured extraction, EdgeIntent for classification, EdgeFuncCall for tool use. Evaluate models on tasks that reflect real-world deployment requirements.
Explore benchmarks โCompare leading SLMs across standardized tasks. Open methodology, reproducible results, no pay-to-play rankings. Transparency by design.
View leaderboard โIndependent testing for your SLM. Comprehensive reports with energy measurement, cross-platform validation, and competitive benchmarking.
Request evaluation โReal-world structured JSON extraction performance
| Rank | Model | Size | JSONExact | Field F1 | Hardware | License |
|---|---|---|---|---|---|---|
| ๐ฅ 1 | Maaza SLM-360M-JSON v1 CCTโ | 360M | 55.1% | 0.729 | Laptop, Pi 5 | Apache 2.0 |
| ๐ฅ 2 | Maaza MLM-135M-JSON v1 CCTโ | 135M | 46.8% | 0.534 | Pi 5, Browser | Apache 2.0 |
| ๐ฅ 3 | DeepSeek-R1-Distill-Qwen-1.5B | 1.5B | 16.0% | 0.317 | GPU | MIT |
| 4 | Qwen 2.5-0.5B | 500M | 14.6% | 0.195 | CPU/GPU | Apache 2.0 |
| 5 | SmolLM2-360M (base) | 360M | 11.4% | 0.240 | Laptop | Apache 2.0 |
| 6 | Qwen 2.5-3B Instruct | 3B | 6.0% | 0.105 | GPU | Apache 2.0 |
| 7 | Phi-3.5-Mini Instruct | 3.8B | 2.0% | 0.031 | GPU | MIT |
| 8 | SmolLM2-135M (base) | 135M | 0.6% | 0.024 | Pi 5, Browser | Apache 2.0 |
Metrics: JSONExact = exact JSON match accuracy (primary metric) โข Field F1 = per-field precision/recall โข All models evaluated on EdgeJSON v3 (158 test cases, 24 schemas)
Latest Update (Nov 27, 2025): Added DeepSeek-R1-1.5B (JSON-optimized), Qwen 2.5-3B, and Phi-3.5-Mini baselines. Key finding: Maaza-360M (fine-tuned) outperforms DeepSeek-R1-1.5B (4.2ร larger, JSON mode) by 3.4ร overall.
Performance by Complexity: Maaza excels on simple (2-4 fields: 78.9%) and medium schemas (5-8 fields: 51.4%). DeepSeek-R1 achieved 0.0% on medium schemas despite "JSON mode" training, validating specialist fine-tuning approach.
Testing Methodology: All models benchmarked on the same hardware with standardized prompts. Zero-shot models (DeepSeek, Qwen, Phi) tested with temperature=0.0 for deterministic output.
Have a model to evaluate? Submit it to the leaderboard.
Request EvaluationIndependent, rigorous, transparent
Interested in getting your model evaluated?