← Back

Benchmark Leaderboard PREVIEW

Multi-level evaluation results across the Bench Labs benchmark suite โ€” measuring true generalization, transfer, and out-of-distribution reasoning.

๐Ÿ† June 2026 ยท 3 Benchmarks ยท 3 Models

๐Ÿ† Overall Rankings

Models ranked by normalized accuracy (acc_norm) across their respective benchmarks

Rank Model Parameters Benchmark Difficulty acc โ†• acc_norm โ†• Samples
1 Qwen2.5-0.5B 0.5B bench-mid-6-2026 Mid 58.06% ยฑ3.98% 60.00% ยฑ3.95% 155
2 Glint-1.3 ~1M bench-effortless-6-2026 Effortless 37.50% ยฑ3.10% 32.90% ยฑ3.00% 240
3 Qwen2.5-1.5B-Instruct 1.5B bench-easy-6-2026 Easy 75โ€“85%* โ€” 269

๐Ÿ’ก Key Insight

Direct comparison across difficulty tiers is not straightforward โ€” a model scoring 60% on Mid-tier demonstrates stronger reasoning than one scoring 60% on Effortless. The tier system is designed as a progression: Effortless โ†’ Easy โ†’ Mid โ†’ Hard โ†’ UltraHard.

๐Ÿ“ Benchmark Suite Overview

3
Benchmarks
664
Total Samples
6
Core Categories
3
Models Tested
Benchmark Tier Samples Categories Focus
bench-effortless-6-2026 Effortless 240 Math, Logic, Language, Knowledge, Commonsense, Pattern Recognition Basic reasoning & sanity check
bench-easy-6-2026 Easy 269 15+ subcategories Easy-tier reasoning for small models
bench-mid-6-2026 Mid 155 17 subcategories across 6 domains Generalization & abstraction
G

Glint-1.3 on bench-effortless-6-2026

~982,656 parameters ยท Base model (no SFT) ยท CPU evaluation
Effortless

Evaluation via lm-eval-harness 0.4.12 ยท Multiple-choice loglikelihood scoring ยท Zero-shot

37.5%
acc
32.9%
acc_norm
25.3%
Chance Baseline
240
Samples

Per-Category Breakdown

Language 54.8% (acc) / 42.9% (acc_norm) ยท N=42
Commonsense 54.1% (acc) / 2.7% (acc_norm) ยท N=37
Pattern Recognition 35.1% / 35.1% ยท N=37
Logic 35.7% (acc) / 64.3% (acc_norm) ยท N=42
Knowledge 26.2% / 31.0% ยท N=42
Math 20.0% / 17.5% ยท N=40

๐Ÿ” Reading the acc vs acc_norm Gap

Commonsense (54% โ†’ 2.7%): Raw acc ties the majority-class baseline (always answering the most frequent polarity). Once length-normalized, it flips to systematic anti-correlation โ€” the model is not reading the question.

Logic (36% โ†’ 64%): Flips the other way โ€” genuine signal hidden under length effects. The model lands on correct answers more often when normalized.

Genuine signal lives in Language, Pattern Recognition, and Knowledge โ€” consistent above-chance performance under both metrics.

Category acc acc_norm N Signal
Language 54.8% 42.9% 42 โœ“ Genuine
Commonsense 54.1% 2.7% 37 โœ— Frequency bias
Pattern Recognition 35.1% 35.1% 37 โœ“ Genuine
Logic 35.7% 64.3% 42 โœ“ Hidden signal
Knowledge 26.2% 31.0% 42 โœ“ Above chance
Math 20.0% 17.5% 40 โœ— Below chance
Q

Qwen2.5-1.5B-Instruct on bench-easy-6-2026

1.5B parameters ยท Instruct-tuned ยท 269 samples across 15+ categories
Easy

Easy-tier reasoning benchmark for small AI systems ยท Tests commonsense, language, knowledge, and pattern tasks

75โ€“85%
Strong Categories
0โ€“33%
Weak Categories
15+
Subcategories
269
Samples

Strong Performance (75โ€“85% accuracy)

Commonsense Simulation 75โ€“85%
Commonsense Causality & Reasoning 75โ€“85%
Language Transformation & Comprehension 75โ€“85%
Knowledge Definitions 75โ€“85%

Weak Performance (0โ€“33% accuracy)

Symbolic Logic 0โ€“33%
Math Pattern Continuation Low
Pattern Matching (raw) Low (format issues)

๐Ÿ”‘ Key Finding

Many apparent "failures" were actually format compliance issues rather than reasoning errors. The model understood the patterns but wrapped answers in verbose explanations, failing exact-match evaluation. This highlights the importance of distinguishing between capability failures and protocol failures in benchmark design.

Q

Qwen2.5-0.5B on bench-mid-6-2026

0.5B parameters ยท Base model ยท CPU, bf16 ยท lm-eval 0.4.12
Mid

Evaluation of Qwen/Qwen2.5-0.5B on bench-labs/bench-mid-6-2026 using multiple-choice loglikelihood scoring

58.06%
acc ยฑ3.98%
60.00%
acc_norm ยฑ3.95%
58.71%
soft_score ยฑ3.94%
61.03%
soft_score_norm ยฑ3.88%

Full Category Breakdown (17 subcategories)

Category โ†• N โ†• acc โ†• acc_norm โ†• soft_score โ†• soft_score_norm โ†•
Commonsense โ€” Causality 5 100.0% 100.0% 100.0% 100.0%
Commonsense โ€” Reasoning 10 70.0% 60.0% 71.0% 62.0%
Commonsense โ€” Simulation 10 30.0% 50.0% 30.0% 50.0%
Knowledge โ€” Basic 8 100.0% 100.0% 100.0% 100.0%
Knowledge โ€” Definitions 10 70.0% 90.0% 70.0% 90.0%
Language โ€” Comprehension 10 60.0% 80.0% 60.0% 80.0%
Language โ€” Structure 10 0.0% 10.0% 0.0% 10.0%
Language โ€” Transformation 10 50.0% 60.0% 59.0% 64.0%
Logic โ€” Consistency 5 20.0% 20.0% 20.0% 20.0%
Logic โ€” Deduction 10 50.0% 50.0% 50.0% 50.0%
Logic โ€” Pattern 10 40.0% 40.0% 40.0% 40.0%
Math โ€” Arithmetic 8 62.5% 62.5% 62.5% 62.5%
Math โ€” Pattern 10 80.0% 80.0% 80.0% 80.0%
Math โ€” Reasoning 10 60.0% 50.0% 60.0% 50.0%
Pattern โ€” Generation 9 77.8% 66.7% 77.8% 66.7%
Pattern โ€” Matching 10 100.0% 80.0% 100.0% 90.0%
Pattern โ€” Recognition 10 30.0% 30.0% 30.0% 30.0%

๐Ÿ“Š Performance Summary

Perfect scores (100%): Commonsense-causality, Knowledge-basic, Pattern-matching

Strong (>70%): Knowledge-definitions (90% norm), Math-pattern (80%), Language-comprehension (80% norm), Pattern-generation (77.8%)

Critical failures: Language-structure (0% acc, 10% norm), Logic-consistency (20%)

โš–๏ธ Cross-Benchmark Analysis

Comparing model capabilities across different difficulty tiers and evaluation protocols

๐Ÿง  Commonsense Reasoning

Qwen2.5-1.5B (Easy) 75โ€“85%
Qwen2.5-0.5B (Mid) 30โ€“100%
Glint-1.3 (Effortless) 2.7โ€“54.1%

๐Ÿ”ค Language Understanding

Qwen2.5-1.5B (Easy) 75โ€“85%
Qwen2.5-0.5B (Mid) 0โ€“80%
Glint-1.3 (Effortless) 42.9โ€“54.8%

๐Ÿงฎ Mathematics

Qwen2.5-1.5B (Easy) Low
Qwen2.5-0.5B (Mid) 50โ€“80%
Glint-1.3 (Effortless) 17.5โ€“20%

๐Ÿ”— Logic & Reasoning

Qwen2.5-1.5B (Easy) 0โ€“33%
Qwen2.5-0.5B (Mid) 20โ€“50%
Glint-1.3 (Effortless) 35.7โ€“64.3%

๐Ÿ“‹ Key Takeaways

1. Parameter count โ‰  capability: Qwen2.5-0.5B outperforms Glint-1.3 (~1M params) across all categories on harder tasks, showing architecture and training data matter more than raw parameters for base models.

2. Instruction tuning helps format compliance: Qwen2.5-1.5B-Instruct shows strong reasoning but struggles with symbolic logic โ€” a common failure mode for instruction-tuned models on formal reasoning tasks.

3. Base models reveal hidden capabilities: Glint-1.3's Logic score jumps from 36% to 64% with length normalization, suggesting the model has genuine signal that's masked by evaluation artifacts.

4. The tier system works: Models that score well on Effortless but poorly on Mid reveal the gap between "avoiding failure" and "true generalization."

Evaluation Protocol Comparison

Protocol Used By Pros Cons
lm-eval (MC loglikelihood) Glint-1.3, Qwen2.5-0.5B Standard, reproducible, handles base models correctly Requires multiple-choice format, length bias in raw acc
Generative (exact match) Qwen2.5-1.5B-Instruct Tests real-world output format Penalizes verbose models, format โ‰  capability