EuroSciPy 2026

Automating Scientific Paper Classification at Scale with Retrieval–Reranking and LLMs
2026-07-21 , Room 1.38 (Ground Floor, Turing)

Scientific organizations struggle to extract actionable insights from publication data tagged with inconsistent and noisy author keywords. Automatically assigning papers to consistent, semantically grounded concepts is essential for reliable trend detection, search, and analytics.

This talk presents a production-grade, two-stage classification pipeline that tags hundreds of thousands of scientific papers against a 110K+ concept taxonomy. Given a fixed hierarchical taxonomy extending OpenAlex's 4-level structure with a granular concept layer, the system combines vector-based retrieval, cross-encoder reranking, and targeted LLM validation to achieve scalable and accurate paper classification.

In Stage 1 (Candidate Retrieval), paper metadata (title, abstract, author keywords) is embedded using SPECTER2 and queried against Qdrant to retrieve a small, high-recall candidate set from over 110,000 concepts. In Stage 2 (Reranking and Filtering), cross-encoder models perform fine-grained semantic matching, while LLMs (Azure OpenAI) are selectively applied to resolve ambiguous cases and produce confidence-scored assignments.

Deployed on millions of publications, the system standardizes noisy keywords and enriches paper metadata with semantically consistent concept tags, enabling downstream analytics at scale.


The Problem

Scientific papers are typically tagged with author-provided keywords that are inconsistent, ambiguous, and poorly aligned with standardized taxonomies. Variants such as "machine learning," "ML," and "machine-learning" refer to the same concept, while other terms are overloaded or context-dependent. Manual curation does not scale, and naive string matching fails to capture semantic meaning.

This talk focuses on the paper classification problem: given a large, fixed taxonomy, how can we automatically and reliably tag papers at scale?

The Foundation: A Large-Scale Concept Taxonomy

The classification pipeline assumes a 110K+ concept scientific taxonomy that extends OpenAlex's 4-level hierarchy with a granular concept layer. This structured taxonomy provides the semantic backbone that makes large-scale, consistent paper tagging possible. (Taxonomy construction is treated as given context; the focus of this talk is on classification and deployment.)

Stage 1: Candidate Retrieval with Bi-Encoders

Input Processing We extract paper metadata (title, abstract, author keywords) and generate SPECTER2 embeddings, a bi-encoder model trained on scientific text that supports efficient pre-computation and caching.

Vector Search Using Qdrant, we retrieve the top-N candidate concepts (typically N = 50–100) via cosine similarity. This step reduces the search space from over 110,000 concepts to a manageable candidate set while maintaining high recall.

Threshold Tuning We discuss similarity threshold strategies that balance recall (avoiding missed relevant concepts) and precision (limiting noise passed to later stages).

Stage 2: Reranking with Cross-Encoders and LLMs

Why Cross-Encoders? Bi-encoders scale well but miss fine-grained interactions between paper content and concept descriptions. Cross-encoders jointly encode paper–concept pairs, capturing nuanced semantic relationships at higher computational cost.

Reranking Architecture Cross-encoder models score each candidate pair, producing a high-precision ranking over Stage-1 results.

LLM-Based Filtering For ambiguous cases, we integrate Azure OpenAI for context-aware validation. LLMs help detect non-core mentions (e.g., negative references or future work) and resolve borderline assignments.

Final Assignment The system outputs ranked, multi-label concept assignments with calibrated confidence scores, selecting top-k concepts per paper based on learned thresholds.

Production Deployment and Impact

The pipeline is deployed on millions of papers, standardizing noisy author keywords and enriching metadata with semantically consistent concept tags. This enables:

  • Real-time trend detection (identifying emerging topics weeks early)
  • Cross-journal portfolio analytics
  • Data-driven strategic decision-making

We also discuss operational challenges, including batch processing, GPU utilization, cost–accuracy trade-offs for cross-encoders, evaluation metrics (Precision@k, NDCG), and production monitoring.

Key Takeaways

Attendees will learn:

  • Retrieval–reranking design patterns for large label spaces
  • When to use bi-encoders vs. cross-encoders in production
  • Practical Qdrant optimization for large-scale vector search
  • Cross-encoder deployment and cost trade-offs
  • Selective LLM integration for context-aware filtering
  • Thresholding and confidence calibration for multi-label classification
  • Batch processing and GPU optimization strategies
  • How structured taxonomies improve classification accuracy at scale

Audience

Data scientists, ML engineers, and NLP practitioners working on document classification, retrieval systems, or production NLP pipelines. Familiarity with embeddings, transformers, and vector databases is expected.


Expected audience expertise: Domain: none Expected audience expertise: Python: some Your relationship with the presented work/project: Original author or co-author

Daniele is a data scientist with expertise in statistics, data science and AI, passionate about exploring the intersection of machine learning and financial markets. Since 2023, he is working at MDPI, one of the largest open-access publishers. A former national 400m sprinter.

This speaker also appears in:

Feichi Lu is a Data Scientist at MDPI in Basel, where she works on building data-driven analytics for scientific publishing. She holds a Master’s degree in Data Science from ETH Zürich. Her experience spans large-scale data analysis, semantic modeling, and applied AI.

This speaker also appears in: