Building a Scientific Taxonomy at Scale with Graph Clustering, Embeddings, and LLMs EuroSciPy 2026

Building a Scientific Taxonomy at Scale with Graph Clustering, Embeddings, and LLMs
.ical
2026-07-20 12:10–12:30, Room 1.19 (Ground Floor, Shannon)

Scientific organizations struggle to extract actionable insights from publication data tagged with inconsistent and noisy keywords. Transforming hundreds of thousands of such keywords into a 110,000+ concept, semantically consistent taxonomy, and attaching them hierarchically at scale, requires more than ad-hoc normalization: it demands careful system design.

This talk presents a production-grade pipeline that extends OpenAlex's 4-level framework (Domain → Field → Subfield → Topic) with a granular Concept layer, resulting in a 5-level scientific taxonomy. The system combines SPECTER2 embeddings to model semantic similarity, Leiden graph clustering to group 100K+ concepts, and Qdrant for efficient vector-based hierarchical attachment.

A central contribution is a strategic, multi-stage integration of LLMs. Rather than using LLMs end-to-end, we deploy them at 5 targeted points where semantic judgment matters most: concept granularity filtering, field classification across 26 domains, cluster renaming, explanation generation and validation of topic assignments using multi-embedding comparisons. Deterministic methods ensure scalability and reproducibility, while LLMs provide semantic precision where embeddings alone fall.

The resulting taxonomy is used in production to automatically tag millions of publications, enabling real-time trend detection and portfolio-level analytics that support strategic decision-making.

The Problem

Large-scale publication databases rely on author-provided keywords that are noisy, inconsistent, and semantically ambiguous. Variants such as "machine learning," "ML," and "machine-learning" refer to the same concept, while other terms are overloaded or context-dependent. Manual curation does not scale, and simple string matching or rule-based normalization fails to capture semantic structure. To support reliable trend analysis, search, and analytics, organizations need a consistent, hierarchical scientific taxonomy built at scale.

This talk presents a production system that transforms hundreds of thousands of raw keywords into a structured, semantically grounded taxonomy with over 110,000 concepts.

Building a 5-Level Scientific Taxonomy

We extend OpenAlex's existing 4-level hierarchy (Domain → Field → Subfield → Topic) with a fifth Concept layer, creating a complete 5-level taxonomy suitable for fine-grained analysis. Raw keywords are normalized into candidate concepts and embedded using SPECTER2 to capture domain-specific semantic relationships.

To group related concepts, we construct a similarity graph over embeddings and apply the Leiden community detection algorithm using igraph, scaling to over 100K nodes while maintaining strong modularity and interpretability. The resulting clusters form the backbone of the concept layer.

Strategic Multi-Stage LLM Integration

A key design challenge was deciding when to rely on deterministic methods and when LLMs add unique value. Rather than using LLMs end-to-end, we integrate them at five targeted stages where semantic judgment is critical:

Granularity Filtering – Filtering candidate concepts to the appropriate Concept-level granularity, excluding terms that are too broad or too specific.
Field Classification – Assigning concepts to one of 26 OpenAlex fields in cases where embedding similarity alone is ambiguous.
Semantic Cluster Renaming – Generating interpretable, human-readable labels for concept clusters.
Explanation Generation – Producing concise semantic descriptions for each concept to support downstream validation and analytics.
Topic Assignment Validation – Validating hierarchical attachment to Level-4 topics using a combination of multi-embedding similarity and LLM-based classification, with support for multi-label assignments.

This hybrid approach preserves scalability and reproducibility while leveraging LLMs for nuanced semantic decisions that deterministic methods struggle with.

Technical Infrastructure

SPECTER2 embeddings provide domain-aware semantic representations trained on scientific citation networks.
Leiden clustering (igraph) enables scalable community detection over large similarity graphs.
Qdrant supports efficient vector search for hierarchical attachment and large-scale similarity queries.
Azure OpenAI is used for structured LLM inference with prompt patterns designed for consistency and cost control.
A validation framework combines human review, automated consistency checks, and AI-assisted quality control before concepts are finalized.

Production Impact

The resulting taxonomy is deployed in production to automatically tag millions of publications. It powers real-time academic trend detection, cross-journal portfolio analytics, and strategic decision support for research planning and resource allocation.

Key Takeaways

Attendees will learn:

How to decide where LLMs add value in large-scale NLP pipelines
How to scale graph clustering to 100K+ nodes in Python
Practical trade-offs between embeddings, graph methods, and LLMs
How to design hybrid embedding–LLM architectures that balance cost, accuracy, and scalability
Validation strategies for correctness-sensitive semantic systems in production

Expected audience expertise: Domain: none Expected audience expertise: Python: some Your relationship with the presented work/project: Original author or co-author

Daniele Raimondi

Daniele is a data scientist with expertise in statistics, data science and AI, passionate about exploring the intersection of machine learning and financial markets. Since 2023, he is working at MDPI, one of the largest open-access publishers. A former national 400m sprinter.

This speaker also appears in:

Automating Scientific Paper Classification at Scale with Retrieval–Reranking and LLMs

Feichi Lu

Feichi Lu is a Data Scientist at MDPI in Basel, where she works on building data-driven analytics for scientific publishing. She holds a Master’s degree in Data Science from ETH Zürich. Her experience spans large-scale data analysis, semantic modeling, and applied AI.