Applied AIEvaluationEducation

TutorBench: Evaluating Large Language Models as Adaptive Tutors

ScaleJade Research, A. Pradana, M. Wibowo, S. Tan

Preprint · 2026 · May 12, 2026

Abstract

Large language models are increasingly deployed as tutors, yet existing benchmarks reward correct final answers rather than effective teaching. We introduce TutorBench, a benchmark of 3,200 multi-turn tutoring sessions spanning mathematics, programming, and reading comprehension, annotated for pedagogical quality. TutorBench evaluates models along three axes — adaptivity to a learner's evolving state, correctness of guidance, and pedagogical soundness — using a rubric validated against expert educators. Across 14 frontier and open models, we find that answer accuracy correlates only weakly with teaching quality, and that the strongest models still over-explain, reveal answers prematurely, and fail to diagnose misconceptions. We release the benchmark, rubric, and an automated judge to support reproducible progress on AI tutoring.

Motivation

Tutoring is a teaching task, not a question-answering task. A model that immediately reveals the answer may score well on accuracy benchmarks while being a poor tutor. We set out to measure the gap between answering and teaching.

The Benchmark

TutorBench contains 3,200 multi-turn sessions across mathematics, programming, and reading comprehension. Each turn is annotated for adaptivity, correctness, and pedagogical soundness using a rubric validated against expert educators.

An automated judge reproduces expert ratings with high agreement, enabling low-cost, reproducible evaluation of new models.

Findings

Across 14 frontier and open models, final-answer accuracy correlates only weakly with teaching quality. The strongest models still over-explain, reveal answers prematurely, and fail to diagnose learner misconceptions.

Prompting for Socratic behavior helps modestly but does not close the gap, suggesting that teaching is a capability that must be measured and trained for directly.

Cite

@article{scalejade2026tutorbench,
  title  = {TutorBench: Evaluating Large Language Models as Adaptive Tutors},
  author = {ScaleJade Research and Pradana, A. and Wibowo, M. and Tan, S.},
  year   = {2026},
  note   = {Preprint},
  url    = {https://www.scalejade.com/research/tutorbench}
}