PyCon Hong Kong 2024

PyCon Hong Kong 2024

Yang Cen

I'm now a senior software engineer at LanceDB, and working on building the efficient open source columnar Lance format, vector search algorithms and database for AI. I have strong passion for the open source community, and contribute to multiple open source projects like Lance, Golang, Milvus, Arrow, etc.

I'm focusing on vector searching, to make model inference / similarity searching more efficient and accurate.


Country / City

China

Company / Organisation

LanceDB


Session

11-16
15:15
15min
Power PyTorch Training with Centralized AI Data Lake and Advanced Data Selection Techniques
Yang Cen

AI data is often stored in separate silos: databases, parquets/ORC files in cloud storage, and embeddings in vector databases, creating complexities in data management.

To address the above issue, the Lance columnar format is specifically designed for multimodal AI. It has unique combination of capabilities including fast scan and point query, storing large blobs inline, and zero-cost schema evolutions, enabling the creation of a centralized, massive-scale, all-in-one data lake that can store all kinds of AI data—structured, unstructured, and embeddings—in one cohesive dataset.

Lance-Pytorch Dataset utilizes Lance’s embedded query engine. Written in Rust, it can quickly identify the most relevant and useful data for training without ever materializing such datasets using external systems. PyTorch training can leverage this unified data lake to seamlessly access and train from all data types, facilitating the creation of high-quality models.

This approach allows organizations to train or fine-tune foundation models that encompass comprehensive organizational knowledge, and significantly accelerates the training process while maintaining model quality.

Libraries / Tools
LT7