Multi-level evaluation results across the Bench Labs benchmark suite โ measuring true generalization, transfer, and out-of-distribution reasoning.
๐ June 2026 ยท 3 Benchmarks ยท 3 Models| Rank | Model | Parameters | Benchmark | Difficulty | acc | acc_norm | Samples |
|---|---|---|---|---|---|---|---|
| 1 | Qwen2.5-0.5B | 0.5B | bench-mid-6-2026 | Mid | 58.06% ยฑ3.98% | 60.00% ยฑ3.95% | 155 |
| 2 | Glint-1.3 | ~1M | bench-effortless-6-2026 | Effortless | 37.50% ยฑ3.10% | 32.90% ยฑ3.00% | 240 |
| 3 | Qwen2.5-1.5B-Instruct | 1.5B | bench-easy-6-2026 | Easy | 75โ85%* | โ | 269 |
Direct comparison across difficulty tiers is not straightforward โ a model scoring 60% on Mid-tier demonstrates stronger reasoning than one scoring 60% on Effortless. The tier system is designed as a progression: Effortless โ Easy โ Mid โ Hard โ UltraHard.
| Benchmark | Tier | Samples | Categories | Focus |
|---|---|---|---|---|
| bench-effortless-6-2026 | Effortless | 240 | Math, Logic, Language, Knowledge, Commonsense, Pattern Recognition | Basic reasoning & sanity check |
| bench-easy-6-2026 | Easy | 269 | 15+ subcategories | Easy-tier reasoning for small models |
| bench-mid-6-2026 | Mid | 155 | 17 subcategories across 6 domains | Generalization & abstraction |
Commonsense (54% โ 2.7%): Raw acc ties the majority-class baseline (always answering the most frequent polarity). Once length-normalized, it flips to systematic anti-correlation โ the model is not reading the question.
Logic (36% โ 64%): Flips the other way โ genuine signal hidden under length effects. The model lands on correct answers more often when normalized.
Genuine signal lives in Language, Pattern Recognition, and Knowledge โ consistent above-chance performance under both metrics.
| Category | acc | acc_norm | N | Signal |
|---|---|---|---|---|
| Language | 54.8% | 42.9% | 42 | โ Genuine |
| Commonsense | 54.1% | 2.7% | 37 | โ Frequency bias |
| Pattern Recognition | 35.1% | 35.1% | 37 | โ Genuine |
| Logic | 35.7% | 64.3% | 42 | โ Hidden signal |
| Knowledge | 26.2% | 31.0% | 42 | โ Above chance |
| Math | 20.0% | 17.5% | 40 | โ Below chance |
Many apparent "failures" were actually format compliance issues rather than reasoning errors. The model understood the patterns but wrapped answers in verbose explanations, failing exact-match evaluation. This highlights the importance of distinguishing between capability failures and protocol failures in benchmark design.
| Category | N | acc | acc_norm | soft_score | soft_score_norm |
|---|---|---|---|---|---|
| Commonsense โ Causality | 5 | 100.0% | 100.0% | 100.0% | 100.0% |
| Commonsense โ Reasoning | 10 | 70.0% | 60.0% | 71.0% | 62.0% |
| Commonsense โ Simulation | 10 | 30.0% | 50.0% | 30.0% | 50.0% |
| Knowledge โ Basic | 8 | 100.0% | 100.0% | 100.0% | 100.0% |
| Knowledge โ Definitions | 10 | 70.0% | 90.0% | 70.0% | 90.0% |
| Language โ Comprehension | 10 | 60.0% | 80.0% | 60.0% | 80.0% |
| Language โ Structure | 10 | 0.0% | 10.0% | 0.0% | 10.0% |
| Language โ Transformation | 10 | 50.0% | 60.0% | 59.0% | 64.0% |
| Logic โ Consistency | 5 | 20.0% | 20.0% | 20.0% | 20.0% |
| Logic โ Deduction | 10 | 50.0% | 50.0% | 50.0% | 50.0% |
| Logic โ Pattern | 10 | 40.0% | 40.0% | 40.0% | 40.0% |
| Math โ Arithmetic | 8 | 62.5% | 62.5% | 62.5% | 62.5% |
| Math โ Pattern | 10 | 80.0% | 80.0% | 80.0% | 80.0% |
| Math โ Reasoning | 10 | 60.0% | 50.0% | 60.0% | 50.0% |
| Pattern โ Generation | 9 | 77.8% | 66.7% | 77.8% | 66.7% |
| Pattern โ Matching | 10 | 100.0% | 80.0% | 100.0% | 90.0% |
| Pattern โ Recognition | 10 | 30.0% | 30.0% | 30.0% | 30.0% |
Perfect scores (100%): Commonsense-causality, Knowledge-basic, Pattern-matching
Strong (>70%): Knowledge-definitions (90% norm), Math-pattern (80%), Language-comprehension (80% norm), Pattern-generation (77.8%)
Critical failures: Language-structure (0% acc, 10% norm), Logic-consistency (20%)
1. Parameter count โ capability: Qwen2.5-0.5B outperforms Glint-1.3 (~1M params) across all categories on harder tasks, showing architecture and training data matter more than raw parameters for base models.
2. Instruction tuning helps format compliance: Qwen2.5-1.5B-Instruct shows strong reasoning but struggles with symbolic logic โ a common failure mode for instruction-tuned models on formal reasoning tasks.
3. Base models reveal hidden capabilities: Glint-1.3's Logic score jumps from 36% to 64% with length normalization, suggesting the model has genuine signal that's masked by evaluation artifacts.
4. The tier system works: Models that score well on Effortless but poorly on Mid reveal the gap between "avoiding failure" and "true generalization."
| Protocol | Used By | Pros | Cons |
|---|---|---|---|
| lm-eval (MC loglikelihood) | Glint-1.3, Qwen2.5-0.5B | Standard, reproducible, handles base models correctly | Requires multiple-choice format, length bias in raw acc |
| Generative (exact match) | Qwen2.5-1.5B-Instruct | Tests real-world output format | Penalizes verbose models, format โ capability |