Efficient data storage is an integral part of successful data applications. Cloud object stores prove to be an efficient choice but come with downsides when storing structured, tabular data. There is a way out, though.
Storing and processing data efficiently is an integral part of successful data-driven applications. An efficient and scalable way to store big data is by using object stores of public cloud providers like ABS, S3 or GCS. These storages come with downsides attached which make the management of tabular data distributed over many objects not a trivial task.
Kartothek is a recently open sourced Python library we develop and use at Blue Yonder – JDA Software to manage tabular data in cloud object stores. It is built on Apache Arrow, Apache Parquet and is powered by Dask. It’s specification is compatible with the de-facto standard storage layouts used by other big data processing tools like Apache Spark and Hive but offers a native, seamless integration into the Python ecosystem.
What Kartothek offers, includes:
* Consistent dataset state at all times.
* Add or remove files with an atomic operation.
* Read without any locking mechanisms.
* Strongly typed and enforced table schemas using Apache Arrow.
* O(1)
remote storage calls to plan job dispatching.
* Inverted indices for fast and efficient querying.
* Integrated predicate pushdown on row group level using Apache Parquet.
* Seamless integration to pandas, Dask and the Python ecosystem.
At the end of this talk, I want you to leave with the knowledge about what the struggles in modern big data storage are and how to deal with them.
some
Python Skill Level:none
Abstract as a tweet:Kartothek - Table management for cloud object stores powered by @ApacheArrow and @dask_dev
Domains:Big Data, Data Engineering
Public link to supporting material: