vtext: text processing in Rust with Python bindings Roman Yurchak PyConDE & PyDataBerlin 2019 conference

vtext: text processing in Rust with Python bindings

Roman Yurchak

Wednesday 15:30 in Saal 10 wednesday wednesday-1530

Type/Track Talk PyData

Scientific Python has historically relied on compiled extensions for performance critical parts of the code. In this talk, we outline how to write Rust extensions for Python using rust-numpy, project. Advantages and limitations of this approach as compared to Cython or wrapping Fortran, C or C++ are also discussed.

In the second part, we introduce the vtext project that allows fast text processing in Python using Rust. In particular, we consider the problems of text tokenization, and (parallel) token counting resulting in a sparse vector representation of documents. These can then be used as input in machine learning or information retrieval applications. We outline the approach used in vtext and compare to existing solutions of these problems in the Python ecosystem.

Tags Natural Language Processing

Level Domain Expertise some Python Skill Level basic

Roman Yurchak

Roman Yurchak has a background in computational physics, and is currently working as an independent consultant for data science related projects. He is also a scikit-learn core developer.

visit the speaker at: Twitter • Github