EuroSciPy 2024

A Qdrant and Specter2 framework for tracking resubmissions of rejected manuscripts in academia
2024-08-28 , Room 7

This presentation introduces a Qdrant vector DB and Specter2 model used to identify whether a rejected academic manuscript is later published in a competing journal. Our method combines AI, data science and analytics to ensure a good identification of manuscripts and authors. The findings offer insights into resubmission patterns, enhancing our understanding of academic publishing dynamics. The system is implemented in Python.


Understanding what happens to rejected manuscripts is crucial in academic publishing. We developed a system to track whether rejected manuscripts are later published in competing journals using advanced machine-learning techniques. By extracting rejected manuscript embeddings with the Specter2 model and storing them in a vector database, we compare these with published articles, focusing on title and abstract similarities. Author similarity and other checks ensure accurate identification despite author name variations.
Our system generates two key scores: manuscript similarity and author similarity. A machine learning approach classifies papers as the same or different, with thresholds fine-tuned through manual labelling and scatter plot analysis.
This approach combines AI, data science and analytics, providing valuable insights into resubmission patterns and enhancing our understanding of academic publishing dynamics.


Abstract as a tweet:

Our Python-based system tracks rejected academic manuscripts to see if they're published elsewhere, enhancing our understanding of publishing dynamics.

Category [Data Science and Visualization]:

Data Analysis and Data Engineering

Expected audience expertise: Domain:

some

Expected audience expertise: Python:

some

Daniele is a data scientist with expertise in statistics, data science and finance, passionate about exploring the intersection of machine learning and financial markets.
Since 2023, he is working at MDPI, one of the largest open-access publishers.
A former national 400m sprinter.