2025-04-24 –, Titanium3
The cloud native revolution has impacted all aspects of engineering, and data engineering is not exempt. One of the ongoing challenges in the data engineering world remains the local and distributed cloud native storage. In this talk we’ll explore working with distributed file systems in Python, through an intro to fsspec: a popular python library that is well-positioned to address the growing challenge of interacting with storage systems of different kinds in a consistent way.
In this talk we’ll show hands-on examples of working with fsspec with some of the most popular data tools in the Python community: Pandas, Tensorflow and PyArrow. We’ll demonstrate a real world implementation of fsspec and how it provides easy extensibility through open source tooling.
You’ll come away from this session with a better understanding for how to implement and extend fsspec to work with different cloud native storage systems.
1. Setting the Stage: Local vs. Distributed Storage (5 minutes)
- What’s the Big Deal with Storage?
- First, let’s talk about the shift from local storage (where we keep files on our own machines) to cloud-native storage (where data is spread across servers in the cloud).
- This shift is awsome but comes with new challenges: distributed systems can be tricky to work with, especially when you need to access them in a consistant way.
2. Enter fsspec: A Game Changer for File Systems (10 minutes)
- What is fsspec?
- fsspec is a Python library that makes working with any kind of file system—whether it's local, in the cloud, or on a distributed system—much easier.
-
It does this by giving us a unified way to interact with storage, no matter where the files actaully live.
-
Why is fsspec Awesome?
- It simplifies file operations (like opening and reading files) across different storage systems, saving us time and mental enery.
- Plus, it’s open-source, which means you can extend it and make it work for your own unique storage setup.
3. fsspec in Action: How It Works with Popular Python Tools (15 minutes)
A. Using fsspec with Pandas
- Pandas & fsspec:
- If you work with Pandas, you’re probably familiar with loading and saving data. fsspec helps make this process smoother by letting you pull data from cloud storage (like AWS S3) with no fuss.
- We’ll see how this works in practise, making it easy to work with large datasets in the cloud.
B. Using fsspec with TensorFlow
- TensorFlow & fsspec:
- If you’re building machine learning models, TensorFlow needs to access training data and models, sometimes stored in the cloud.
- With fsspec, TensorFlow can seamlessly interact with cloud storage, making your ML pipelines more streamlined and less frustraiting.
C. Using fsspec with PyArrow
- PyArrow & fsspec:
- PyArrow is great for high-performance data processing. When working with big data files like Parquet, fsspec makes it easy to load and save them from cloud storage without missing a beat.
4. Extending fsspec: Building Your Own Solutions (5 minutes)
- What if I Need Something Custom?
- Sometimes, you need to work with storage systems that aren’t “out of the box.” The cool part about fsspec is that it’s highly extensible.
- I’ll walk through how you can easily extend fsspec to work with your own custom storage systems, using a real-world example of how we did this.
5. Wrap-Up & Key Takeaways (5 minutes)
- The Big Picture:
- fsspec is a simple yet powerful tool for making cloud-native storage work seamlessly with Python data tools like Pandas, TensorFlow, and PyArrow.
-
It’s the tool you didn’t know you needed to simplify your cloud storage tasks.
-
Final Thought:
- With fsspec, working with distributed storage doesn’t have to be hard. It makes everything feel like you’re working with local files, even when they’re scattered across the cloud.
6. Q&A Session (5 minutes)
Intermediate
Expected audience expertise: Python:Intermediate
Dr. Einat Orr has 20+ years of experience building R&D organizations and leading the technology vision at multiple companies, the latest being Similarweb, that IPO in NYSE last May. Currently she serves as Co-founder and CEO of Treeverse, the company behind lakeFS, an open source platform that delivers a git-like experience to object-storage based data lakes. She received her PhD. in Mathematics from Tel Aviv University, in the field of optimization in graph theory.