06.12.2025 –, Main Stream Sprache: English
What if the tools we use to predict disease risk don't work for entire populations, simply because very few people (studies) test them over there? That's the case for many polygenic risk score (PRS) models and pipelines, which often fail when applied to African genomes. PRS is a number that estimates how likely someone is to develop a disease based on their DNA. In this talk, I'll walk through how I am using Python to build something better: a reproducible pipeline for genomic tools that center inclusion.
More than a technical talk, this is a story about learning, persistence and impact. I'll share how I use accessible Python libraries like argparse, pandas, subprocess and shutil to automate preprocessing, handle containerised genomic tools with Docker, and stitch everything together using workflows like Nextflow. The tools I'm building are still a work in progress, but this talk will spotlight the imperfect, iterative process of building (while learning) pipelines that don't exclude underrepresented populations by design. I'll share how Python makes it possible to go from fragmented data to actionable results.
This talk is for anyone curious about bioinformatics, passionate about global inclusion or simply learning Python
This talk is a personal and technical journey into using Python as a tool for representation in genomics. It shares the story of how I (under the guidance of excellent mentors) began building a pipeline that challenges the norms of who gets represented in disease risk prediction tools.
Many widely used tools (in genomics) for analysing DNA, like those used to calculate polygenic risk scores (PRS), perform poorly on African datasets. In addition to data scarcity, this is a design issue rooted in various assumptions, some of which were never questioned.
This talk highlights how I'm using Python to answer that question, as a learner using accessible tools, libraries and open-source principles to create change from my corner of the world.
5 KEY POINTS THIS TALK WILL COVER
- Relevance of Talk:
I begin by grounding the audience in the problem: how underrepresentation of African genomes leads to poor health predictions and inequitable science. Most people outside of genetics don't realise that common tools perform dramatically worse on diverse populations, and this can have serious consequences for things like disease risk scoring, diagnosis and treatment decisions. The problem extends beyond data diversity, it lies in the inflexibility of existing pipelines to process such diversity effectively. Many genomic tools weren't built to work for African data, and in some cases, they completely breakdown. I'll share a few brief, easy-to-understand examples of where this happens in real datasets.
How Python Helps Me Build Solutions:
The heart of the talk is a practical, beginner-friendly look at how Python enables me to solve this problem, step-by-step. I'll introduce the libraries I use daily:
i. argparse: For making my scripts flexible and user-friendly
ii. pandas: For cleaning, transforming and exploring genetic summary statistics and phenotype data
iii. subprocess: For calling Docker containers and orchestrating commands programmatically
iv. Shutil: For managing folders and moving data during preprocessing
Finally, a quick mention of how this integrates with workflow tools like NextflowLearning in Public, Building While Learning:
One of the key themes of this talk is that, you don't have to be an expert to build something impactful.
I'll share:- What it looked like to go from having basic Python knowledge to writing scripts that trigger enter bioinformatics pipelines
- The many (many!) mistakes I made while figuring out how to manage data from diverse sources
- How I broke the big problem down into scripts and containers, even while learning pipeline management, Docker and bash along the way.
Why starting before you're ready is the most powerful thing you can do as a Python learner.
Challenges I'm Solving With Python:
Instead of detailing the benchmarking methodology, I'm spotlighting the problems my code is built to solve:- Handling messy metadata: How python scripts help me detect and resolve inconsistent column names and formats across datasets.
- Reproducibility on a shoestring: Why using Python + Docker gives me the ability to share reproducible, portable pipelines.
- Debugging inclusivity: How scripting helped me spot patterns of exclusion and fix issues that most pipelines would miss (e.g. default parameters that don't suit African LD structures or allele frequency ranges).
All of this is still a work in progress, and that's part of the point.
- Significance of Talk:
We often celebrate tools and technologies after they've become polished and complete. But in this talk, I want to celebrate the act of building while learning, especially when it's in service of equity and inclusion.
For too long, people from underrepresented regions and learners new to coding have been made to feel like they don't belong in technical spaces unless they're already experts.
This talk says otherwise.
Python isn't just a programming language. For me, it's a way to claim space, ask hard questions, and build tools that make health and data inclusive for all.
Chioma Oselu (Onyido) is a computational biologist developing reproducible pipelines to improve genetic risk prediction across ancestries. Her work bridges statistical genetics and software design, with a focus on building tools that perform reliably in underrepresented populations, especially African cohorts. She contributes to open-source efforts like BugSigDB in Bioconductor, co-leads global training initiatives, and is advancing a more transparent, population-aware approach to genomic method development.
