TutorBench: Evaluating Large Language Models as Adaptive Tutors
ScaleJade Research, A. Pradana, M. Wibowo, S. Tan
Preprint · 2026 · May 12, 2026
Abstract
Large language models are increasingly deployed as tutors, yet existing benchmarks reward correct final answers rather than effective teaching. We introduce TutorBench, a benchmark of 3,200 multi-turn tutoring sessions spanning mathematics, programming, and reading comprehension, annotated for pedagogical quality. TutorBench evaluates models along three axes — adaptivity to a learner's evolving state, correctness of guidance, and pedagogical soundness — using a rubric validated against expert educators. Across 14 frontier and open models, we find that answer accuracy correlates only weakly with teaching quality, and that the strongest models still over-explain, reveal answers prematurely, and fail to diagnose misconceptions. We release the benchmark, rubric, and an automated judge to support reproducible progress on AI tutoring.
Motivation
Tutoring is a teaching task, not a question-answering task. A model that immediately reveals the answer may score well on accuracy benchmarks while being a poor tutor. We set out to measure the gap between answering and teaching.
The Benchmark
TutorBench contains 3,200 multi-turn sessions across mathematics, programming, and reading comprehension. Each turn is annotated for adaptivity, correctness, and pedagogical soundness using a rubric validated against expert educators.
An automated judge reproduces expert ratings with high agreement, enabling low-cost, reproducible evaluation of new models.
Findings
Across 14 frontier and open models, final-answer accuracy correlates only weakly with teaching quality. The strongest models still over-explain, reveal answers prematurely, and fail to diagnose learner misconceptions.
Prompting for Socratic behavior helps modestly but does not close the gap, suggesting that teaching is a capability that must be measured and trained for directly.
Cite
@article{scalejade2026tutorbench,
title = {TutorBench: Evaluating Large Language Models as Adaptive Tutors},
author = {ScaleJade Research and Pradana, A. and Wibowo, M. and Tan, S.},
year = {2026},
note = {Preprint},
url = {https://www.scalejade.com/research/tutorbench}
}