Language: English
11-16, 15:15–15:30 (Asia/Hong_Kong), LT7
AI data is often stored in separate silos: databases, parquets/ORC files in cloud storage, and embeddings in vector databases, creating complexities in data management.
To address the above issue, the Lance columnar format is specifically designed for multimodal AI. It has unique combination of capabilities including fast scan and point query, storing large blobs inline, and zero-cost schema evolutions, enabling the creation of a centralized, massive-scale, all-in-one data lake that can store all kinds of AI data—structured, unstructured, and embeddings—in one cohesive dataset.
Lance-Pytorch Dataset utilizes Lance’s embedded query engine. Written in Rust, it can quickly identify the most relevant and useful data for training without ever materializing such datasets using external systems. PyTorch training can leverage this unified data lake to seamlessly access and train from all data types, facilitating the creation of high-quality models.
This approach allows organizations to train or fine-tune foundation models that encompass comprehensive organizational knowledge, and significantly accelerates the training process while maintaining model quality.
I'm now a senior software engineer at LanceDB, and working on building the efficient open source columnar Lance format, vector search algorithms and database for AI. I have strong passion for the open source community, and contribute to multiple open source projects like Lance, Golang, Milvus, Arrow, etc.
I'm focusing on vector searching, to make model inference / similarity searching more efficient and accurate.