BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//pretalx.com//pydata-london-2026//talk//ZFR8VH
BEGIN:VTIMEZONE
TZID:GMT
BEGIN:STANDARD
DTSTART:20001029T030000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10
TZNAME:GMT
TZOFFSETFROM:+0100
TZOFFSETTO:+0000
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000326T020000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=3
TZNAME:BST
TZOFFSETFROM:+0000
TZOFFSETTO:+0100
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-pydata-london-2026-ZFR8VH@pretalx.com
DTSTART;TZID=GMT:20260606T161500
DTEND;TZID=GMT:20260606T170000
DESCRIPTION:Scientific publishers tag **millions of articles** with author-
 provided keywords\, but these keywords are noisy\, inconsistent\, and sema
 ntically ambiguous. *"Machine learning\," "ML\," and "machine-learning"* a
 ll mean the same thing\, while other terms shift meaning across discipline
 s.\n\nThis talk presents a **production pipeline** that extends [OpenAlex]
 (https://openalex.org/)'s 4-level hierarchy with a fifth in-house **Concep
 t** layer\, producing a **115K-concept scientific taxonomy**. \n\n**SPECTE
 R2** embeddings model semantic similarity\, and per-field **Leiden cluster
 ing with CPM resolution** groups 100K+ concepts via mutual kNN graphs — 
 with hyperparameters selected through **grid search** and **custom pair-ba
 sed evaluation**. **Qdrant** enables vector-based hierarchical attachment.
 \n\n**LLMs are deployed at five targeted stages** — granularity filterin
 g\, field classification\, cluster renaming\, explanation generation\, and
  topic-assignment validation — while **deterministic methods handle ever
 ything else**\, ensuring scalability and reproducibility.\n\nThe resulting
  taxonomy powers a **paper-tagging pipeline** where SPECTER2 retrieves ~15
 0 candidates per paper across multiple text-splitting strategies\, determi
 nistic filters prune by field/subfield distribution and near-synonym mergi
 ng\, and an **LLM reranker selects the final 5–8 concepts**. These assig
 nments enable applications such as **temporal trend detection** over emerg
 ing research topics and more.\n\n**Attendees will learn** when to integrat
 e LLMs in large-scale NLP pipelines\, how to scale graph clustering to 100
 K+ nodes\, and how to design hybrid embedding–LLM systems that turn nois
 y metadata into reliable scientific intelligence.
DTSTAMP:20260602T223345Z
LOCATION:Grand Hall 1
SUMMARY:Building a Scientific Taxonomy at Scale with Graph Clustering\, Emb
 eddings\, and LLMs - Daniele Raimondi\, Feichi Lu
URL:https://pretalx.com/pydata-london-2026/talk/ZFR8VH/
END:VEVENT
END:VCALENDAR