Feichi Lu
Feichi Lu is a Data Scientist at MDPI in Basel, where she works on building data-driven analytics for scientific publishing. She holds a Master’s degree in Data Science from ETH Zürich. Her experience spans large-scale data analysis, semantic modeling, and applied AI.
Session
Scientific publishers tag millions of articles with author-provided keywords, but these keywords are noisy, inconsistent, and semantically ambiguous. "Machine learning," "ML," and "machine-learning" all mean the same thing, while other terms shift meaning across disciplines.
This talk presents a production pipeline that extends OpenAlex's 4-level hierarchy with a fifth in-house Concept layer, producing a 115K-concept scientific taxonomy.
SPECTER2 embeddings model semantic similarity, and per-field Leiden clustering with CPM resolution groups 100K+ concepts via mutual kNN graphs — with hyperparameters selected through grid search and custom pair-based evaluation. Qdrant enables vector-based hierarchical attachment.
LLMs are deployed at five targeted stages — granularity filtering, field classification, cluster renaming, explanation generation, and topic-assignment validation — while deterministic methods handle everything else, ensuring scalability and reproducibility.
The resulting taxonomy powers a paper-tagging pipeline where SPECTER2 retrieves ~150 candidates per paper across multiple text-splitting strategies, deterministic filters prune by field/subfield distribution and near-synonym merging, and an LLM reranker selects the final 5–8 concepts. These assignments enable applications such as temporal trend detection over emerging research topics and more.
Attendees will learn when to integrate LLMs in large-scale NLP pipelines, how to scale graph clustering to 100K+ nodes, and how to design hybrid embedding–LLM systems that turn noisy metadata into reliable scientific intelligence.