Building Databases from Historical Documents with Smart Extract Models

Transcription is just the beginning — the real value of historical documents often lies in the structured data hidden within them. This workshop focuses on Smart Extract Models (SEM): Transkribus's framework for turning handwritten and printed sources directly into database-ready output. The session goes beyond a general introduction to SEM and dives into the part that determines success or failure: schema design. Participants will learn how to define a schema tailored to their source material — working with styles, named entities, nested regions, tables, page classification, and transcription layers — and understand the architectural decisions that shape what a model can and cannot do. The workshop addresses SEM's key constraints honestly: high data requirements, the absence of coordinate output, and the reality that training a SEM is not the right choice for every project. Participants leave with a clear framework for evaluating whether SEM fits their use case, and a practical foundation for building one if it does.