Spatial Humanities 2024

MEHDIE - Data Integration Tools for Spatial Humanities in the Middle East
2024-09-27 , MG1/02.05

Ancient civilizations in the Middle East overlapped in time and space with others. Many of their writings are originally unstructured text in ancient languages or ancient dialects of modern languages. Thus, humanities scholars studying these civilizations are required to specialize in one or a few of these languages, causing them to be siloed from other researchers and limited to these sources. Recent efforts in extracting information from historical scripts into place names (toponym) and people names databases (prosopographies) have followed this pattern, focusing on one civilization or even one scholar. For example, the Syriaca project (Gibson et al. 2017) presents a comprehensive gazetteer extracted from sources in the Syriac language. If we wish to allow scholars to interact with the works of others and integrate it with their own, we need a common database. A prominent example of such an initiative to create a common database is the World Historical Gazetteer (WHG, Grossner and Mostern, 2021), which allows researchers to upload their toponyms to a common repository. However, beyond being able to link their data to wikidata, the WHG does not provide a means to link one scholar’s work to another.
The Middle East Heritage Data Integration Endeavor (MEHDIE, Rusinek et al., 2023) is a project aiming to create language-aware spatial data integration tools for the alignment and matching of datasets created from different historical sources and for different purposes. We use several approaches to perform the matching. A syntactic approach augments known name variants with transliterated variants to other semitic languages to perform syntactic comparison between variant pairs in the same script type. A phonetic approach converts the toponym titles to their International Phonetic alphabet representation. A machine-learning approach utilizes a shared embedding space created for the languages and scripts in the Middle East allowing a semantic comparison between the meanings of the names. Finally, a graph-based approach utilizes related places to assess similarity. Related places are those whose distance or hierarchical relation to a place is known. We perform graph-learning over the created place-relations graph to calculate similarity between the sub-graphs.
The tool itself is publicly available to humanities scholars to use and attempt to match their own data with that of other scholars. We can extend the tool to handle other language families and hope to pursue such extensions in the near future. Figure 1 shows an example of the tool’s interface that allows the user to see a place (Dendara, in modern day Egypt) on the map and a navigable related-place graph for all the places that have been found to be related to this place by our matching tool and through external referencing.

Figure 1: an example from the MEHDIE tool: Dendara in the Kima dataset matched to places from other datasets.

Using the MEHDIE tool and hopping between the map and the linked place identities, historians and other humanities researchers can enrich their knowledge with related information. For example (Figure 2), a historian who studies the history of the coast of Arabia in the persian gulf can now enrich the military and cultural information she receives from reading Yaqut Al-Hamawi (A Muslim geographer) about the Qatif oasis, with new information about the pearl industry there, provided by Benjamin of Tudela (a medieval Jewish traveler). The scholar of Jewish history, on the other hand, may follow the graph from Qatif to its geographic parent, and learn from Yaqut about the history of Jews in the Caliphate country of Bahrain.

Figure 2: an example from the MEHDIE tool: ‘Katifa’ in Benjamin of Tudela on the left, matched with ‘Al-Qatif’ from Yaqut Al-Hamawi, on the right, with a graph visualizing their match and the relation of Qatif to Bahrain.

Keywords: Data Integration, Multi-lingual, Toponyms, .

References
Gibson, Nathan P., David A. Michelson, and Daniel L. Schwartz. "From manuscript catalogues to a handbook of Syriac literature: Modeling an infrastructure for Syriaca. org." Journal of Data Mining & Digital Humanities (2017).
Grossner, Karl, and Ruth Mostern. "Linked places in world historical gazetteer." Proceedings of the 5th ACM SIGSPATIAL International Workshop on Geospatial Humanities. 2021.
Rusinek, Sinai, Tomer Sagi, Moran Zaga, Efraim Lev, and Moshe Lavee. "MEHDIE: The Middle East Data Integration Endeavour." Digital Humanities 2023: Book of Abstracts, edited by Anne Baillot, Toma Tasovac, Walter Scholger, and Georg Vogeler, Zenodo, 2023, pp. 551-552. https://doi.org/10.5281/zenodo.7961821

Assistant professor at Aalborg University, the department of computer science. Researches AI-assisted data integration in collaboration with domain experts in a variety of domains: oceanography, medicine, art, history, and geography. Research focus is on creating practical tools for researchers to work with large-scale connected data using a variety of technologies.

Sinai Rusinek is a consultant, project lead, and entrepreneur of various Digital Humanities projects, currently mainly active in Elijah Lab, Haifa University, and OMILab, the Open University. Among recent projects that she has initiated, led, or participated in are an OCR improvement pipeline towards Computational Analysis of Historical Hebrew Newspapers, including a recent paper on leveraging vector similarity for place name retrieval from the corpus; training Named Recognition Model for Yiddish in the New Languages for NLP Princeton institute, and data modeling and handwritten Yiddish text recognition at the DYBBUK ERC funded project. She also took part in three successful applications for DH projects which are currently active: an ISF granted project on Hasidic stories, two MOST supported projects - MEHDIE: Middle Eastern Historical Data Integration Endeavor, and a project on computational stylistics of a Modern Hebrew literary corpus, as well as a DSRC granted research on an AI-based pipeline for processing archival material.