Feichi Lu
Feichi Lu is a Data Scientist at MDPI in Basel, where she works on building data-driven analytics for scientific publishing. She holds a Master’s degree in Data Science from ETH Zürich. Her experience spans large-scale data analysis, semantic modeling, and applied AI.
Mrs.
MDPI
Data Scientist
Sessions
Scientific organizations struggle to extract actionable insights from publication data tagged with inconsistent and noisy keywords. Transforming hundreds of thousands of such keywords into a 110,000+ concept, semantically consistent taxonomy, and attaching them hierarchically at scale, requires more than ad-hoc normalization: it demands careful system design.
This talk presents a production-grade pipeline that extends OpenAlex's 4-level framework (Domain → Field → Subfield → Topic) with a granular Concept layer, resulting in a 5-level scientific taxonomy. The system combines SPECTER2 embeddings to model semantic similarity, Leiden graph clustering to group 100K+ concepts, and Qdrant for efficient vector-based hierarchical attachment.
A central contribution is a strategic, multi-stage integration of LLMs. Rather than using LLMs end-to-end, we deploy them at 5 targeted points where semantic judgment matters most: concept granularity filtering, field classification across 26 domains, cluster renaming, explanation generation and validation of topic assignments using multi-embedding comparisons. Deterministic methods ensure scalability and reproducibility, while LLMs provide semantic precision where embeddings alone fall.
The resulting taxonomy is used in production to automatically tag millions of publications, enabling real-time trend detection and portfolio-level analytics that support strategic decision-making.
Scientific organizations struggle to extract actionable insights from publication data tagged with inconsistent and noisy author keywords. Automatically assigning papers to consistent, semantically grounded concepts is essential for reliable trend detection, search, and analytics.
This talk presents a production-grade, two-stage classification pipeline that tags hundreds of thousands of scientific papers against a 110K+ concept taxonomy. Given a fixed hierarchical taxonomy extending OpenAlex's 4-level structure with a granular concept layer, the system combines vector-based retrieval, cross-encoder reranking, and targeted LLM validation to achieve scalable and accurate paper classification.
In Stage 1 (Candidate Retrieval), paper metadata (title, abstract, author keywords) is embedded using SPECTER2 and queried against Qdrant to retrieve a small, high-recall candidate set from over 110,000 concepts. In Stage 2 (Reranking and Filtering), cross-encoder models perform fine-grained semantic matching, while LLMs (Azure OpenAI) are selectively applied to resolve ambiguous cases and produce confidence-scored assignments.
Deployed on millions of publications, the system standardizes noisy keywords and enriches paper metadata with semantically consistent concept tags, enabling downstream analytics at scale.