PyData London 2026

Querying the queries: SQL Metaprogramming in Python
2026-06-07 , Doddington Forum

Large SQL codebases inevitably accumulate duplication, inconsistency, deep nesting, and subtle logic errors, making refactoring slow, risky, and often unrealistic to do by hand. This talk shows how Python metaprogramming can turn SQL itself into data that can be analyzed and transformed safely and automatically.

Instead of relying on fragile regex patterns or manual inspection, we use Python to parse queries into Abstract Syntax Trees (represented as nested dictionaries) using libraries such as sqloxide. Once SQL itself is encoded as data, entirely new workflows become possible.

The session walks through practical examples of treating SQL programmatically via tree operations in Python: computing subquery depth for linting, wrapping all denominators in NULLIF() with a simple AST rewrite, auto‑aliasing aggregate expressions, and generating dependency graphs of temporary tables used across pipelines, among others. Each example highlights how metaprogramming enables precise, automatable refactors that would be error‑prone or impossible through text manipulation alone. This talk is designed for analytics and data engineers who work with large SQL codebases.


SQL sits at the heart of most analytics and data engineering work, yet the way we maintain SQL rarely scales with the complexity of our pipelines. As codebases grow, SQL tends to accumulate structural debt: duplicated logic, subtle inconsistencies, deeply nested subqueries, and transformations that are difficult to apply reliably. Teams often end up relying on manual pattern‑matching, ad‑hoc scripts, or one‑off rewrites, approaches that are fragile and nearly impossible to generalise.

This talk presents a more systematic solution: treat queries as manipulable data through metaprogramming in Python. Instead of working with SQL as raw text, we use Python to parse queries into Abstract Syntax Trees (ASTs), unlocking the ability to inspect, analyze, and modify SQL with precision at scale.

After introducing the intuition behind SQL ASTs, we walk through what they look like in practice using Python libraries such as sqloxide. With queries represented as nested dictionaries, we can traverse them, detect patterns, and apply targeted modifications without breaking syntactic structure. The session demonstrates several real examples that highlight the power of this approach: evaluating subquery depth for complexity diagnostics, adding defensive transformations such as wrapping denominators in NULLIF(), generating consistent aliases for aggregation expressions, and extracting table references to infer dependency graphs across staging or temporary‑table‑heavy pipelines.

Rather than offering a single tool or framework, this talk focuses on the underlying metaprogramming techniques that empower engineers to build their own SQL analysis and refactoring utilities. Attendees will leave with a clear mental model of how SQL parsing works, how ASTs can be manipulated in Python, and how these patterns can be applied to enforce standards, build linters, or automate large‑scale refactors.

Background required:
- Intermediate familiarity with Python (nested dictionaries, basic tree algorithms).
- Intermediate familiarity with SQL (CTEs, subqueries, aggregates)
- No prior knowledge of compiler theory or ASTs is assumed

Outline:
- 0–3 min — Motivation: Why SQL Refactoring Is Hard
-- Structural debt in real SQL codebases: duplication, inconsistencies, nested logic
-- Why regex and manual review fail at scale

3–8 min — Key Idea: Treat SQL as Data
-- What is an Abstract Syntax Tree (AST)?
-- Using Python libraries (e.g., sqloxide) to parse SQL into manipulable structures

8–15 min — Demo: Exploring Real SQL ASTs in Python
-- Show nested dictionaries representing SQL structure
-- Simple tree traversal patterns

15–25 min — Practical Refactoring Examples
-- Computing subquery depth (complexity linting)
-- Auto‑aliasing aggregate expressions
-- Wrapping denominators with NULLIF()
-- Extracting table references for dependency graphs

25–32 min — Building Custom SQL Tooling
-- How these patterns generalize
-- Enforcing standards, writing linters, automating bulk rewrites
-- When AST‑based tooling is worth it

32–40 min — Lessons Learned & Limits + Q&A
-- Homoiconicity (Python vs Lisp for AST manipulation)

Michel Semaan is the Analytics Lead for Transaction Banking at Allica Bank, previously a Senior Analytics Engineer at Amazon. Beyond his day job, Michel teaches as a DataCamp instructor with two published SQL courses and as a Python and data science mentor with Great Learning and Springboard.