EuroSciPy 2026

Finding the Right ROR: Semantic Search for Research Institutions
2026-07-20 , Room 1.19 (Ground Floor, Shannon)

Mapping freeform research affiliations to persistent identifiers such as ROR (Research Organization Registry) is harder than it looks. Institution names appear in many forms such as abbreviations, alternate spellings, local languages, or legacy names, thus making a reliable mapping difficult to achieve at scale.

In this talk, we present a semantic retrieval pipeline that reframes institution identification as a search problem rather than a string-matching task. Our system combines named entity recognition to extract institution entities, dense embeddings to represent their semantic meaning, and vector search to retrieve the most likely ROR matches. This approach allows us to handle noisy, incomplete, and multilingual inputs while remaining resilient to variation in how institutions are referenced.

By treating institution matching as semantic retrieval, we improve recall and robustness without relying on heuristics or on a continuous expanding rule-based approach. The system scales naturally as new institutions are added and as naming conventions evolve, making it well suited for the dynamic research environment.

We will share implementation details, evaluation results, and practical lessons learned from deploying this pipeline in a real-world production setting.


Research affiliation strings are messy in the real world. The same institution might appear as an acronym, a translated name, an outdated label, or a partially written reference. If you’ve ever tried to map these freeform inputs to persistent identifiers like ROR, you know that simple string matching quickly falls apart.

In this talk, we’ll look at institution matching from a different angle: treating it as a semantic retrieval problem instead of a normalization problem.

I’ll walk through a practical pipeline that uses named entity recognition to extract institutions, embeddings to represent them semantically, and vector search to retrieve the best ROR candidate. The goal isn’t just better accuracy, but to build a system that is maintainable as new institutions appear and naming conventions evolve.

This session focuses on real implementation experience, not just models. We’ll cover architecture decisions, evaluation strategies, common failure cases, and trade-offs between rule-based and embedding-based approaches. You’ll see what worked, what didn’t, and what we learned from running this in production.

If you’re working on entity resolution, search systems, metadata pipelines, or NLP in production, this talk will give you practical ideas you can reuse.


Expected audience expertise: Domain: some Expected audience expertise: Python: some Your relationship with the presented work/project: Original author or co-author, Active contributor, Maintainer of the presented library/project

Senior AI Engineer with 7+ years of experience architecting and deploying end-to-end ML solutions at scale. Specialized in NLP, Generative AI (LLM, RAG), Vector Search, and MLOps.

This speaker also appears in:

Frank became a self-employed software developer and consultant while studying Physics in Freiburg. During his Masters, he specialized in data analysis for particle physics at CERN and obtained a doctoral degree in 2022 working with the ATLAS collaboration. Since 2023, he has been the AI Technical Leader and AI Engineer Lead at MDPI, one of the largest open-access publishers.