Franciszek Job
Software engineer at Software Mansion
Computer Science engineering undergraduate (3rd year)
Mainly used technologies: Python, Rust
Software Engineer, CS Engineering undergraduate AGH
Session
We introduce an extensible Python framework for automated generation and preprocessing of large-scale chemical datasets. It is based on parallelized and distributed Dask processing for building molecular pipelines. RDKit, written in C++ with Python interface, is leveraged for molecular processing and computation of structural properties. This allows us to process hundreds of millions of molecules on regular-size server units. We also included a suite of analysis scripts for comparing dataset cardinality, scaffold diversity, and chemical space metrics. Created software enables efficient pretraining and benchmarking of molecular foundation models, applicable for varying applications in chemoinformatics.