vtext: fast text processing in Python using Rust
09-05, 15:45–16:00 (UTC), Track 3 (Oteiza)

In this talk, we present some of the benefits of writing extensions for Python in Rust. We then illustrate this approach on the vtext project, that aims to be a high-performance library for text processing.


Scientific Python has historically relied on compiled extensions for performance critical parts of the code. In this talk, we outline how to write Rust extensions for Python using rust-numpy,
project. Advantages and limitations of this approach as compared to Cython or wrapping Fortran, C or C++ are also discussed.

In the second part, we introduce the vtext project that allows fast text processing in Python using Rust. In particular, we consider the problems of text tokenization, and (parallel) token counting resulting in a sparse vector representation of documents. These can then be used as input in machine learning or information retrieval applications. We outline the approach used in vtext and compare to existing solutions of these problems in the Python ecosystem.


Project Homepage / Git

https://github.com/rth/vtext

Project Homepage / Git

https://github.com/rth/vtext

Abstract as a tweet

vtext: fast text processing and vectorization in Python using Rust

Python Skill Level

professional

Domain Expertise

none

Domains

General-purpose Python, Open Source

Roman Yurchak has a background in computational physics, and is currently working
as an independent consultant for data science related projects. He is also an open
source contributor to several Open-Source projects, mostly in Python.