Bench Labs — Benchmark Leaderboard

🏆 Overall Rankings

Models ranked by normalized accuracy (acc_norm) across their respective benchmarks

Rank	Model	Parameters	Benchmark	Difficulty	acc ↕	acc_norm ↕	Samples
1	Qwen2.5-0.5B	0.5B	bench-mid-6-2026	Mid	58.06% ±3.98%	60.00% ±3.95%	155
2	Glint-1.3	~1M	bench-effortless-6-2026	Effortless	37.50% ±3.10%	32.90% ±3.00%	240
3	Qwen2.5-1.5B-Instruct	1.5B	bench-easy-6-2026	Easy	75–85%*	—	269

💡 Key Insight

Direct comparison across difficulty tiers is not straightforward — a model scoring 60% on Mid-tier demonstrates stronger reasoning than one scoring 60% on Effortless. The tier system is designed as a progression: Effortless → Easy → Mid → Hard → UltraHard.

📐 Benchmark Suite Overview

Benchmarks

664

Total Samples

Core Categories

Models Tested

Benchmark	Tier	Samples	Categories	Focus
bench-effortless-6-2026	Effortless	240	Math, Logic, Language, Knowledge, Commonsense, Pattern Recognition	Basic reasoning & sanity check
bench-easy-6-2026	Easy	269	15+ subcategories	Easy-tier reasoning for small models
bench-mid-6-2026	Mid	155	17 subcategories across 6 domains	Generalization & abstraction

Glint-1.3 on bench-effortless-6-2026

~982,656 parameters · Base model (no SFT) · CPU evaluation

Effortless

Evaluation via lm-eval-harness 0.4.12 · Multiple-choice loglikelihood scoring · Zero-shot

37.5%

acc

32.9%

acc_norm

25.3%

Chance Baseline

240

Samples

Per-Category Breakdown

Language 54.8% (acc) / 42.9% (acc_norm) · N=42

Commonsense 54.1% (acc) / 2.7% (acc_norm) · N=37

Pattern Recognition 35.1% / 35.1% · N=37

Logic 35.7% (acc) / 64.3% (acc_norm) · N=42

Knowledge 26.2% / 31.0% · N=42

Math 20.0% / 17.5% · N=40

🔍 Reading the acc vs acc_norm Gap

Commonsense (54% → 2.7%): Raw acc ties the majority-class baseline (always answering the most frequent polarity). Once length-normalized, it flips to systematic anti-correlation — the model is not reading the question.

Logic (36% → 64%): Flips the other way — genuine signal hidden under length effects. The model lands on correct answers more often when normalized.

Genuine signal lives in Language, Pattern Recognition, and Knowledge — consistent above-chance performance under both metrics.

Category	acc	acc_norm	N	Signal
Language	54.8%	42.9%	42	✓ Genuine
Commonsense	54.1%	2.7%	37	✗ Frequency bias
Pattern Recognition	35.1%	35.1%	37	✓ Genuine
Logic	35.7%	64.3%	42	✓ Hidden signal
Knowledge	26.2%	31.0%	42	✓ Above chance
Math	20.0%	17.5%	40	✗ Below chance

Qwen2.5-1.5B-Instruct on bench-easy-6-2026

1.5B parameters · Instruct-tuned · 269 samples across 15+ categories

Easy

Easy-tier reasoning benchmark for small AI systems · Tests commonsense, language, knowledge, and pattern tasks

75–85%

Strong Categories

0–33%

Weak Categories

15+

Subcategories

269

Samples

Strong Performance (75–85% accuracy)

Commonsense Simulation 75–85%

Commonsense Causality & Reasoning 75–85%

Language Transformation & Comprehension 75–85%

Knowledge Definitions 75–85%

Weak Performance (0–33% accuracy)

Symbolic Logic 0–33%

Math Pattern Continuation Low

Pattern Matching (raw) Low (format issues)

🔑 Key Finding

Many apparent "failures" were actually format compliance issues rather than reasoning errors. The model understood the patterns but wrapped answers in verbose explanations, failing exact-match evaluation. This highlights the importance of distinguishing between capability failures and protocol failures in benchmark design.

Qwen2.5-0.5B on bench-mid-6-2026

0.5B parameters · Base model · CPU, bf16 · lm-eval 0.4.12

Mid

Evaluation of Qwen/Qwen2.5-0.5B on bench-labs/bench-mid-6-2026 using multiple-choice loglikelihood scoring

58.06%

acc ±3.98%

60.00%

acc_norm ±3.95%

58.71%

soft_score ±3.94%

61.03%

soft_score_norm ±3.88%

Full Category Breakdown (17 subcategories)

Category ↕	N ↕	acc ↕	acc_norm ↕	soft_score ↕	soft_score_norm ↕
Commonsense — Causality	5	100.0%	100.0%	100.0%	100.0%
Commonsense — Reasoning	10	70.0%	60.0%	71.0%	62.0%
Commonsense — Simulation	10	30.0%	50.0%	30.0%	50.0%
Knowledge — Basic	8	100.0%	100.0%	100.0%	100.0%
Knowledge — Definitions	10	70.0%	90.0%	70.0%	90.0%
Language — Comprehension	10	60.0%	80.0%	60.0%	80.0%
Language — Structure	10	0.0%	10.0%	0.0%	10.0%
Language — Transformation	10	50.0%	60.0%	59.0%	64.0%
Logic — Consistency	5	20.0%	20.0%	20.0%	20.0%
Logic — Deduction	10	50.0%	50.0%	50.0%	50.0%
Logic — Pattern	10	40.0%	40.0%	40.0%	40.0%
Math — Arithmetic	8	62.5%	62.5%	62.5%	62.5%
Math — Pattern	10	80.0%	80.0%	80.0%	80.0%
Math — Reasoning	10	60.0%	50.0%	60.0%	50.0%
Pattern — Generation	9	77.8%	66.7%	77.8%	66.7%
Pattern — Matching	10	100.0%	80.0%	100.0%	90.0%
Pattern — Recognition	10	30.0%	30.0%	30.0%	30.0%

📊 Performance Summary

Perfect scores (100%): Commonsense-causality, Knowledge-basic, Pattern-matching

Strong (>70%): Knowledge-definitions (90% norm), Math-pattern (80%), Language-comprehension (80% norm), Pattern-generation (77.8%)

Critical failures: Language-structure (0% acc, 10% norm), Logic-consistency (20%)

⚖️ Cross-Benchmark Analysis

Comparing model capabilities across different difficulty tiers and evaluation protocols

🧠 Commonsense Reasoning

Qwen2.5-1.5B (Easy) 75–85%

Qwen2.5-0.5B (Mid) 30–100%

Glint-1.3 (Effortless) 2.7–54.1%

🔤 Language Understanding

Qwen2.5-1.5B (Easy) 75–85%

Qwen2.5-0.5B (Mid) 0–80%

Glint-1.3 (Effortless) 42.9–54.8%

🧮 Mathematics

Qwen2.5-1.5B (Easy) Low

Qwen2.5-0.5B (Mid) 50–80%

Glint-1.3 (Effortless) 17.5–20%

🔗 Logic & Reasoning

Qwen2.5-1.5B (Easy) 0–33%

Qwen2.5-0.5B (Mid) 20–50%

Glint-1.3 (Effortless) 35.7–64.3%

📋 Key Takeaways

1. Parameter count ≠ capability: Qwen2.5-0.5B outperforms Glint-1.3 (~1M params) across all categories on harder tasks, showing architecture and training data matter more than raw parameters for base models.

2. Instruction tuning helps format compliance: Qwen2.5-1.5B-Instruct shows strong reasoning but struggles with symbolic logic — a common failure mode for instruction-tuned models on formal reasoning tasks.

3. Base models reveal hidden capabilities: Glint-1.3's Logic score jumps from 36% to 64% with length normalization, suggesting the model has genuine signal that's masked by evaluation artifacts.

4. The tier system works: Models that score well on Effortless but poorly on Mid reveal the gap between "avoiding failure" and "true generalization."

Evaluation Protocol Comparison

Protocol	Used By	Pros	Cons
lm-eval (MC loglikelihood)	Glint-1.3, Qwen2.5-0.5B	Standard, reproducible, handles base models correctly	Requires multiple-choice format, length bias in raw acc
Generative (exact match)	Qwen2.5-1.5B-Instruct	Tests real-world output format	Penalizes verbose models, format ≠ capability