EuroSciPy 2025

Efficient processing pipelines for large scale molecular datasets in Python
2025-08-21 , Room 1.19 (Ground Floor)

We introduce an extensible Python framework for automated generation and preprocessing of large-scale chemical datasets. It is based on parallelized and distributed Dask processing for building molecular pipelines. RDKit, written in C++ with Python interface, is leveraged for molecular processing and computation of structural properties. This allows us to process hundreds of millions of molecules on regular-size server units. We also included a suite of analysis scripts for comparing dataset cardinality, scaffold diversity, and chemical space metrics. Created software enables efficient pretraining and benchmarking of molecular foundation models, applicable for varying applications in chemoinformatics.


Problem Statement
Public molecular collections are fragmented across multiple repositories, e.g. PubChem, UniChem, or COCONUT. Each has a distinct data representation, inclusion criteria, and other idiosyncrasies. This makes it difficult to assemble a single, high-quality dataset, e.g. for pretraining molecular foundation models, aimed at varied chemoinformatics applications.

Software
Our framework accepts one or more raw chemical datasets as inputs which are then delegated into their respective pipelines (described below). The output is a single, cleaned master dataset.

Pipeline Overview
Pipeline consumes raw data from each source and executes a sequence of steps:
1. Download: fetch original files
2. Preprocess: parse and normalize input formats
3. Standardize: sanitize and canonicalize SMILES via RDKit
4. Deduplicate: remove duplicates using InChI keys (per source, then globally)
The final output is a merged “master” dataset ready for downstream analysis or modeling.

Technologies & Tools
- Dask - distributed and parallel computing framework. Pipelines require significant use of Python libraries, hence we need a tool which handles parallel Python interpreters very well.
- RDKit - for chemical operations. Thanks to its core written in C++,the operations are high-performance.
- scikit-fingerprints - for molecular filters

Intended Uses
This software enables users to:
- Pretrain large ML models on unified chemical space
- Compare dataset cardinality, functional-group distributions, Bemis–Murcko scaffolds, and Circles metrics
- Benchmark self-supervised architectures (e.g., Mol2Vec, MolFormer) across varied dataset sizes and chemical domains
- Integrate new data sources or custom filters as requirements evolve


Expected audience expertise: Domain:

some

Expected audience expertise: Python:

some

Your relationship with the presented work/project:

Original author or co-author, Active contributor, Developed the presented feature, Maintainer of the presented library/project

Software engineer at Software Mansion
Computer Science engineering undergraduate (3rd year)
Mainly used technologies: Python, Rust