Frank Sauerburger
Frank became a self-employed software developer and consultant while studying Physics in Freiburg. During his Masters, he specialized in data analysis for particle physics at CERN and obtained a doctoral degree in 2022 working with the ATLAS collaboration. Since 2023, he has been the AI Technical Leader and AI Engineer Lead at MDPI, one of the largest open-access publishers.
MDPI
AI Technical Leader and AI Engineer Lead
Session
Mapping freeform research affiliations to persistent identifiers such as ROR (Research Organization Registry) is harder than it looks. Institution names appear in many forms such as abbreviations, alternate spellings, local languages, or legacy names, thus making a reliable mapping difficult to achieve at scale.
In this talk, we present a semantic retrieval pipeline that reframes institution identification as a search problem rather than a string-matching task. Our system combines named entity recognition to extract institution entities, dense embeddings to represent their semantic meaning, and vector search to retrieve the most likely ROR matches. This approach allows us to handle noisy, incomplete, and multilingual inputs while remaining resilient to variation in how institutions are referenced.
By treating institution matching as semantic retrieval, we improve recall and robustness without relying on heuristics or on a continuous expanding rule-based approach. The system scales naturally as new institutions are added and as naming conventions evolve, making it well suited for the dynamic research environment.
We will share implementation details, evaluation results, and practical lessons learned from deploying this pipeline in a real-world production setting.