Ian Gregory
Ian Gregory is Distinguished Professor of Digital Humanities at Lancaster University. His main research interests are in how geospatial technologies can be used to better understand the humanities. He has written seven books, most recently Deep Mapping the Literary Lake District (with Joanna Taylor), and edited two collections. He has also published around 100 other articles, chapters and papers. He has been involved in around 25 funded projects including "Understanding space and time in narratives through qualitative representation, reasoning and visualisation" funded the the UK Economic & Social Research Council and the US National Science Foundation.
Session
There is a tendency in the spatial humanities to focus more on the modern era where sources, both textual and cartographic, are comparatively rich. Recent studies, such as the Map of Early Modern London project (https://mapoflondon.uvic.ca/), and Pelagios project (https://pelagios.org/), also show the potential for spatial humanities techniques in the Early Modern and Classical periods respectively. However, as yet, comparatively few similar studies have focused on the medieval period. Reasons for this might in part be due to the unevenness of textual sources from the Middle Ages, and also their variety and heterogeneity. This paper aims to address this by exploring the potential of using natural language processing (NLP) techniques to explore an English medieval corpus.
The paper starts by describing the creation of a suitable corpus for analysis. This consists of a collection of digitised state and administrative records in the form of Calendars of Charter and Patent Rolls, originally printed in the 19th century but now accessible digitally via British History Online. Under the Plantagenet kings of England (12th-15th centuries CE), these records were the main written instruments through which the royal administration and its officials ruled the realm. They provide documentation on legal and bureaucratic decisions concerning a wide range of governance issues, including information on property, people and places. Thanks to a grant from the Joy Welch Fund, the authors have been engaged with using the Calendars as a basis for an exploratory study to ‘excavate’ hidden geographies of the Plantagenet realm, and in particular to determine (1) what places were of interest to English government in the 13th and 14th centuries, and (2) what do contemporaries have to say about these places in the Calendars?
To answer these questions we begin with a multi-pronged approach to simplifying the textual sources for machine processing. As with many NLP pipelines, we began by downloading and converting the corpus data to use consistent UTF-8 formatting and a simple XML schema in order to represent the document structure. Inspired by existing approaches applied for modern texts by the Spatial Narratives project (https://spacetimenarratives.github.io/), we adapted a pipeline of NLP tools to lemmatise the input (match terms to their dictionary headwords), part-of-speech tag (assign major word classes such as noun, verb, adjective and adverb), and semantically tag the text with semantic fields from the PyMUSAS system (https://pypi.org/project/pymusas/) to group words and phrases together into major topic areas. We also cross referenced locations in the text using the Survey of English Place Names (https://epns.nottingham.ac.uk) combining the modern and historic geographic names to create a geographic lemma.
The resulting network of text, associated annotations and alternate forms was stored in a graph database (Neo4J) for subsequent explorative analysis via collocation and other relationships. This enables us to take a very source-led approach to understanding the corpora and the geographies that they contain as well as helping to support more hypothesis-driven approaches. Taking one example, that of medieval town formation, searching in the graph-space starting from a list of manually selected seed terms relevant to town formation enabled us to refer to the locations in any of their forms. In Figure 1 we present a fragment of the much larger graph showing a subset of the full annotations available within our data as the density of data even in a relatively small corpus precludes representing the entire graph.
Figure 1: A subgraph of a larger corpus-graph demonstrating the associated tags and their relations.
Using these techniques we can extract paths through the data connecting individual concepts, places, people, or other entities along either corpus tokens (the source texts) or conceptual relationships (links between times, places, normal forms and others). The next stage of work is to define graphlet or fragment templates and evaluate their relative accuracy for extracting the information. This paper reports on these findings and their significance, as well as the next steps for visualisations of social networks of the Plantagenet realm, as we seek to understand and demonstrate the potential digital tools have for exploring how geographically dispersed people and places across medieval Britain and Ireland were governed through spatial connections that linked them to the centralised rule of the monarch. In so doing we will also present a paper that illustrates the potential for spatial humanities approaches in helping to understand medieval geographies.