Towards a Multi-Dimensional Benchmark for Historical HTR

Is historical HTR "solved"? Not even close — but current benchmarks make it hard to see why. This poster proposes a multi-dimensional evaluation framework that moves beyond CER and WER to assess layout analysis quality, semantic accuracy, reliability, efficiency, and environmental footprint, while systematically categorising input sources by script, genre, layout complexity, and carrier condition. The result is a "Living Benchmark": a continuously updated repository of gold-standard materials that gives researchers a clear, visual diagnostic for choosing the right model for their specific corpus — and at what cost.