2026-06-07 –, Hardwick Hub
Text-to-SQL makes great demos, but in real systems generating queries is rarely the hard part - understanding data is. Modern data is increasingly S3-first and multimodal, where meaning is defined by Python workflows, not table schemas.
To work reliably, both agents and people need data context across multiple layers: storage context (what exists and where), metadata context (what’s inside files), dataset context (how files are grouped and versioned), and code context (the transformations that define semantics).
In this talk, I’ll share a practical framework for building these context layers in Python-first systems, and show how DataChain makes multimodal workflows agent-ready in domains like Physical AI and biotech.
Text-to-SQL is often presented as the future interface for AI-driven analytics: connect an LLM to your warehouse, ask questions, get answers. The demo works. But production systems reveal a deeper issue: SQL can query structure, but it cannot provide the context required to understand what data actually means.
After years of building data infrastructure, I’ve learned that context is the real bottleneck - for both people and agents. This becomes unavoidable in S3-first, multimodal environments: video, audio, medical scans, sensor streams, and model outputs. In these projects, the source of truth is object storage, and meaning is defined by Python pipelines.
To reason correctly, you need data context across multiple layers:
- Storage context - what exists, where it lives, and how it changes
- Metadata context - what’s inside files, extracted signals, and hierarchical structure
- Dataset context - how files are grouped, reused across datasets, and versioned
- Code context - the Python transformations that define semantics and intent
In this talk, I’ll present a practical framework for collecting and using these layers systematically. Using DataChain as a concrete example, I’ll show how typed schemas (e.g., Pydantic), vectorized metadata operations, and scalable Python execution make multimodal workflows understandable, reusable, and agent-ready - especially in Physical AI and biotech.
Attendees will leave with a clear mental model for building data platforms where meaning lives in code, and agents can operate with real context rather than isolated queries.
Dmitry Petrov is the creator of open-source tool DVC (Data Version Control), holds a PhD in Computer Science, previously worked as a Data Scientist at Microsoft, and is now the founder of DataChain.ai, a Python-first data platform for Physical AI.