2023-10-11 –, News Room
The British Library had 22,000 pages recording 18th century parliamentary acts digitsed from microfilm. For a project, they wanted these indexed and catalogued, and the solution involved a combination of Machine Learning, conventional programming and... human input.
We devised an approach involving a mixture of two techniques, after ML OCR: machine learning vision recognition to classify different kinds of pages; and bespoke heuristic programming which analysed OCR to segment and extract particular text elements. But when we applied it to the full data set we found some aspects of the problem which weren't susceptible to either of these techniques; it was most efficient to use human eyeballs to answer these questions. So a third aspect was developing simple workbenches (mostly using Google Sheets) to allow a human operator to play their part in the most efficient way.
The end result was a pipeline combining humans and computers which processed 20,000 images to generate over 1,500 catalogue entries.
In this session, we will describe the challenge, how our solution worked - and where it still falls short. We will discuss the expected and unexpected messiness of the data, the need for programmers who can do data entry, and when pragmatic reality should override programmer hubris! We also wish to discuss the importance of tooling, creating workbenches for evaluation and adjusting approaches, and how Google Sheets can be a powerful assistant in this kind of work, especially in combination with the IIIF imaging standard.