Finding the Right ROR: Semantic Search for Research Institutions EuroSciPy 2026

Finding the Right ROR: Semantic Search for Research Institutions
.ical
2026-07-20 15:20–15:50, Room 1.19 (Ground Floor, Shannon)

Mapping freeform research affiliations to persistent identifiers such as ROR (Research Organization Registry) is harder than it looks. Institution names appear in many forms such as abbreviations, alternate spellings, local languages, or legacy names, thus making a reliable mapping difficult to achieve at scale.

In this talk, we present a semantic retrieval pipeline that reframes institution identification as a search problem rather than a string-matching task. Our system combines named entity recognition to extract institution entities, dense embeddings to represent their semantic meaning, and vector search to retrieve the most likely ROR matches. This approach allows us to handle noisy, incomplete, and multilingual inputs while remaining resilient to variation in how institutions are referenced.

By treating institution matching as semantic retrieval, we improve recall and robustness without relying on heuristics or on a continuous expanding rule-based approach. The system scales naturally as new institutions are added and as naming conventions evolve, making it well suited for the dynamic research environment.

We will share implementation details, evaluation results, and practical lessons learned from deploying this pipeline in a real-world production setting.

Research affiliation strings are messy in the real world. The same institution might appear as an acronym, a translated name, an outdated label, or a partially written reference. If you’ve ever tried to map these freeform inputs to persistent identifiers like ROR, you know that simple string matching quickly falls apart.

In this talk, we’ll look at institution matching from a different angle: treating it as a semantic retrieval problem instead of a normalization problem.

I’ll walk through a practical pipeline that uses named entity recognition to extract institutions, embeddings to represent them semantically, and vector search to retrieve the best ROR candidate. The goal isn’t just better accuracy, but to build a system that is maintainable as new institutions appear and naming conventions evolve.

This session focuses on real implementation experience, not just models. We’ll cover architecture decisions, evaluation strategies, common failure cases, and trade-offs between rule-based and embedding-based approaches. You’ll see what worked, what didn’t, and what we learned from running this in production.

If you’re working on entity resolution, search systems, metadata pipelines, or NLP in production, this talk will give you practical ideas you can reuse.

Expected audience expertise: Domain: some Expected audience expertise: Python: some Your relationship with the presented work/project: Original author or co-author, Active contributor, Maintainer of the presented library/project

Diogo Rodrigues

Senior AI Engineer with 7+ years of experience architecting and deploying end-to-end ML solutions at scale. Specialized in NLP, Generative AI (LLM, RAG), Vector Search, and MLOps.

This speaker also appears in:

Boring AI Works: When BERT Beats Billion-Parameter Models

Finding the Right ROR: Semantic Search for Research Institutions .ical 2026-07-20 15:20–15:50, Room 1.19 (Ground Floor, Shannon)

Finding the Right ROR: Semantic Search for Research Institutions
.ical
2026-07-20 15:20–15:50, Room 1.19 (Ground Floor, Shannon)