Building a Scientific Taxonomy at Scale with Graph Clustering, Embeddings, and LLMs PyData London 2026

Building a Scientific Taxonomy at Scale with Graph Clustering, Embeddings, and LLMs
.ical
2026-06-06 16:15–17:00, Grand Hall 1

Scientific publishers tag millions of articles with author-provided keywords, but these keywords are noisy, inconsistent, and semantically ambiguous. "Machine learning," "ML," and "machine-learning" all mean the same thing, while other terms shift meaning across disciplines.

This talk presents a production pipeline that extends OpenAlex's 4-level hierarchy with a fifth in-house Concept layer, producing a 115K-concept scientific taxonomy.

SPECTER2 embeddings model semantic similarity, and per-field Leiden clustering with CPM resolution groups 100K+ concepts via mutual kNN graphs — with hyperparameters selected through grid search and custom pair-based evaluation. Qdrant enables vector-based hierarchical attachment.

LLMs are deployed at five targeted stages — granularity filtering, field classification, cluster renaming, explanation generation, and topic-assignment validation — while deterministic methods handle everything else, ensuring scalability and reproducibility.

The resulting taxonomy powers a paper-tagging pipeline where SPECTER2 retrieves ~150 candidates per paper across multiple text-splitting strategies, deterministic filters prune by field/subfield distribution and near-synonym merging, and an LLM reranker selects the final 5–8 concepts. These assignments enable applications such as temporal trend detection over emerging research topics and more.

Attendees will learn when to integrate LLMs in large-scale NLP pipelines, how to scale graph clustering to 100K+ nodes, and how to design hybrid embedding–LLM systems that turn noisy metadata into reliable scientific intelligence.

The problem

If you've ever tried to make sense of author-provided keywords across millions of papers, you know the pain. "Machine learning", "ML", "machine-learning": same thing, three entries. Other terms look identical but mean completely different things depending on the field. Manual cleanup? Doesn't scale. Regex and string matching? Misses the semantics entirely.

What we built

We took OpenAlex's 4-level hierarchy (Domain → Field → Subfield → Topic) and added a fifth in-house Concept layer: 115K+ fine-grained concepts, each with a clear position in the tree.

The core idea: embed all candidate concepts with SPECTER2, build a mutual kNN similarity graph per field, and cluster it with Leiden (CPM resolution) at 100K+ node scale. We tuned hyperparameters via grid search, scored against hand-curated concept pairs - things like "Cryptocurrency" and "Crypto Currency" must land together, while "Decision Trees" and "Random Forest" must stay apart.

LLMs come in at five specific points where embeddings alone aren't enough: filtering concept granularity, classifying into fields, renaming clusters, generating explanations, and validating topic assignments. Everything else is deterministic: no LLM in the loop means reproducible and cheap.

Paper tagging

Once the taxonomy exists, we use it to tag papers. With SPECTER2 embeddings, we retrieve an initial pool of ~150 candidate concepts per paper (eight different text-splitting strategies over title, abstract, and keywords). Deterministic filters prune by field/subfield distribution and merge near-synonyms with Jaccard + union-find. Then an LLM reranker picks the final 5–8 concepts with domain verification and keyword mapping, ranked.

What comes next

With millions of papers tagged consistently, the obvious next step is trend detection: tracking how concept frequency and co-occurrence shift over time to spot emerging research areas. We'll sketch out the approach.

Tech stack

SPECTER2 (embeddings) · igraph + leidenalg (Leiden/CPM clustering) · hnswlib (ANN for kNN graphs) · Qdrant (vector search for hierarchical attachment) · Azure OpenAI (structured LLM inference) · human + automated validation framework

You'll walk away knowing

When LLMs actually help in large-scale NLP pipelines and when they're overkill
How to scale graph clustering to 100K+ nodes in Python
How to evaluate clustering with custom pair-based constraints
Practical trade-offs between embeddings, graph methods, and LLMs

Daniele Raimondi

Daniele is a Data Scientist with expertise in statistics, data science and AI, passionate about exploring the intersection of AI and financial markets.
Since 2023, he is working at MDPI, one of the largest open-access publishers.
A former national 400m sprinter.

Feichi Lu

Feichi Lu is a Data Scientist at MDPI in Basel, where she works on building data-driven analytics for scientific publishing. She holds a Master’s degree in Data Science from ETH Zürich. Her experience spans large-scale data analysis, semantic modeling, and applied AI.

Building a Scientific Taxonomy at Scale with Graph Clustering, Embeddings, and LLMs .ical 2026-06-06 16:15–17:00, Grand Hall 1