PyCon DE & PyData 2026

Shiva Banasaz Nouri

Shiva Banasaz Nouri is a Senior Data Scientist based in Berlin, Germany, working on applied machine learning with a focus on Python, NLP, computer vision, and generative AI. She builds production-grade AI systems across healthcare, legal, and enterprise domains using open-source technologies.

She is the Berlin Chapter Lead of Women in AI, where she actively fosters community building, knowledge sharing, and inclusive participation in the AI and Python ecosystems.


Session

04-15
15:00
45min
Building Non-Biased Synthetic Datasets: What Actually Works (and What Fails)
Shiva Banasaz Nouri

Synthetic data is often presented as an easy fix for missing or sensitive datasets, but in practice, it can silently introduce bias, leakage, and misleading evaluation results. This talk presents a practical, end-to-end pipeline for creating synthetic datasets that are reproducible, task-aligned, and bias-aware. We will walk through design decisions that matter: template-based generation vs. free-form generation, entity balancing, controlling distributional skew, filtering failure cases, and validating dataset quality before training any model. The session emphasizes what actually works in real pipelines, common failure modes that look fine at first glance, and concrete best practices for Python developers to apply when building synthetic datasets for machine learning, NLP, or evaluation.

PyData: Generative AI & Synthetic Data
Helium [3rd Floor]