Human in the Middle Annotations — Evaluating LLM- and RAG-based Pipelines for Named Entity Recognition in 18th-Century Documents

How much human correction does an AI-generated entity tag actually require? This study tests leading commercial models (GPT-5, Claude 4.5, Gemini 2.5) against open-source RAG-based systems on the "Book of Orders" (1740), using manual intervention rate — not F1-score — as the primary measure of success. The finding: RAG-based systems incorporating project-specific gold standard data substantially reduce editorial workload, while still raising important questions about context windows, XML schema validity, and the trade-offs between commercial convenience and open-source transparency.