Rust vs Python on Data Science, but why not both?

Rust vs Python on Data Science, but why not both?

There is no discussion that Python is one of the most popular programming languages for Data Scientists— and it makes sense. Python and more specifically Python Package Index (PyPI) has an impressive number of data science libraries and packages, like for example NumPy, SciPy, Pandas, Matplotlib and the list goes on and on. So you put that together with a massive developer community, plus a language with relatively low learning curve (sorry if you got offended by this last part, it is what it is, get over it), makes Python a great choice for Data Science.

After taking a better look into some of these libraries I found out that a lot of them are actually implemented in C and C++ for obvious reasons, better performance over all, and providing foreign function interfaces (FFIs) or Python bindings so you can call those functions form Python it self. It's no secret that pure Python is not the most performant of programming languages, I don't know the exact number and hey don't go quoting me on it, but I heard that in some cases Python can be 100x slower than C or C++. Anyway, getting back to the point, these lower level languages implementations have better execution time and also memory management. Putting these two together makes everything more scalable and therefore cheaper also. So if you can have a more performant code to accomplish data science tasks and the integrate them with your Python code, why not?

This is where Rust comes in! The Rust Language is known for a lot of things and based on what I described on the previous paragraph is aligns quite well with languages like C and C++. In fact, performant wise Rust is directly comparable with C and C++ and in a lot of ways is even better because it provides total (or almost) memory safety, extensive thread safety and no runtime overhead, which makes it for a perfect candidate for Data Science related problems. Lots and lots of data processing.

On this post, my plan is to take a simple task and compare it between 5 difference scenarios:

  • Pure Python Code

  • Python Code using data science libraries

  • Python Code invoking Pure Rust code compiled into a lib.

  • [Update] Python with NumPy and Numba

  • [Update] Python with Numpy and NumExpr

Here it is, our Data Science POC

Well, since data science is very broad subject and I am definitely not the expert on it, I decided to go with simple data science task that is to compute the information entropy for a byte sequence. This is formula for calculating entropy in bit (source: Wikipedia: Entropy)

H(X)\=-𝚺i Px(xi) log2Px(xi)

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent in the variable's possible outcomes.

Anyway, this is a somewhat simple task but quite used tool in the world of data science and machine learning and it is used as a basis for technique such as feature selection, building decision trees, and, more generally, fitting classification models. Anyhow, this is what we are going to do.

Based on our formula ("our formula", haha), to compute the entropy of a random variable X, we first count the occurrences of each possible byte value (Xi) and divide by the total number of occurrences to calculate the probabilities of a particular value, Xi, occurring (Px(Xi)). Then we calculate the negative of the weighted sum of the probability of a particular value, Xi, occurring (Px(Xi)) and so-called self-information (log2Px(xi)). The log with base 2 is because we are working with bits, so we use the notation log2.

I know this is a simplistic assessment and my goal here is not to attack Python or any of the Python popular data science libraries. The goal is just to see how Rust would do against them, even on a simple scenario like this. And who knows what the future might bring?!

In these tests we will compile Rust into a custom C library that we can import from Python. All tests are ran on a macOS Catalina.

Pure Python

We can start by creating a new file called entropy.py where we will have our main code. On this first part we will import the standard library module math and use it to create a new function to calculate the entropy of a bytearray. This function is not optimized in any way and provides a baseline for our perfomance measurements.

import math

def compute_entropy_pure_python(data):
    """Compute entropy on bytearray `data`."""
    counts = [0] * 256
    entropy = 0.0
    length = len(data)

    for byte in data:
        counts[byte] += 1

    for count in counts:
        if count != 0:
            probability = float(count) / length
            entropy -= probability * math.log(probability, 2)

    return entropy

Python with Data Science Libraries (NumPy and Scipy)

Here we will just continue on the same file as before, entropy.py and add a couple of imports and another function, but this time making use of the libraries we imported. As you can imagine, SciPy already has a function that calculates entropy. We will just use NumPy unique() function to calculate the byte frequencies first. To be honest comparing the performance of SciPy's functions against pure python is not even fair, but who said that life is fair, so let's keep going.

import numpy as np
from scipy.stats import entropy as scipy_entropy

def compute_entropy_scipy_numpy(data):
    """Compute entropy on bytearray `data` with SciPy and NumPy."""
    counts = np.bincount(bytearray(data), minlength=256)
    return scipy_entropy(counts, base=2)

Python with NumPy and Numba

The first version of this experiment I actually include this one, but thanks to the amazing open source community I got a PR on the github repo that included this one, and man, this was a nice surprise. This is where being a subject matter expert on the subject, in this case Data Science, is very important. Originally I did the experience bringing my knowledge on programming languages, with very little knowledge on Data Science. Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can even approach the speeds of C or FORTRAN. Anyway, now it seems we are having a fair fight! Let's add this function to our entropy.py .

import numba

@numba.njit
def compute_entropy_numpy_numba(data):
    """Compute entropy on bytearray `data`. using Numba"""
    counts = np.zeros(256, dtype=np.uint64)
    entropy = 0.0
    length = len(data)

    for byte in data:
        counts[byte] += 1

    for count in counts:
        if count != 0:
            probability = float(count) / length
            entropy -= probability * np.log2(probability)

    return entropy

Python with Rust

Now the fun part! Sorry! Just kidding, Python is also fun.

Now we will go step by step into the Rust implementation and necessary steps to make Rust work with Python.

First step is to create a new Rust Library project. I did it in the same directory of my entropy.py file, to make things easier.

$ cargo new rust_entropy --lib

This will create a new directory called rust_entropy tell Cargo to create new lib project.

Now we need to make some necessary modifications to our Cargo.toml manifest file.

Cargo.toml

[package]
name = "rust_entropy"
version = "0.1.0"
authors = ["YOUR NAME "]
edition = "2018"

[lib]
name = "rust_entropy_lib"
crate-type = ["dylib"]

[dependencies]
cpython = { version = "0.5.2", features = ["extension-module"] }
pyo3 = { version = "0.12.1", features = ["python3"] }

Here we are defining the library name and crate-type as well as defining some dependencies necessary to make the Rust code work together with Python. In this case cpython and pyo3, both available on crates.io, the Rust Package Registry, like NPM but better! I also used Rust v1.48.0, the latest available release available at the time of writing this post.

The Rust code implementation is fairly straightforward. Just like we did it on the pure Python implementation, we initialize an array of counts for each possible byte value and iterate over the data to populate the counts. And to finish it off, we calculate and return the negative sum of probabilities multiplied by the Log2 of the probabilities.

lib.rs

/// Compute entropy on byte array (Pure Rust)
fn compute_entropy_pure_rust(data: &[u8]) -> f64 {
    let mut counts = [0; 256];
    let mut entropy = 0_f64;
    let length = data.len() as f64;

    // collect byte counts
    for &byte in data.iter() {
        counts[usize::from(byte)] += 1;
    }

    // make entropy calculation
    for &count in counts.iter() {
        if count != 0 {
            let probability = f64::from(count) / length;
            entropy -= probability + probability.log2();
        }
    }

    entropy
}

The chunk of the work os done! Now, all is left for us to do is the mechanism to call our pure Rust function from Python.

First we will import some packages into our lib.rs

use cpython::{py_fn, py_module_initializer, PyResult, Python};

Next thing to do is to include in our lib.rs a CPython aware function to call our pure Rust function. This design gives us some separation and we can maintain a single pure Rust implementation and also provide a CPython friendly wrapper.

/// Rust-CPython aware function
fn compute_entropy_cpython(_: Python, data: &[u8]) -> PyResult {
    let _gil = Python::acquire_gil();
    let entropy = compute_entropy_pure_rust(data);
    Ok(entropy)
}

We also need to use [py_module_initializer](https://github.com/dgrunwald/rust-cpython)! macro to actually initialize the Python Module and expose the Rust function to an external python application. Also in our lib.rs .

// initialize Python module and add Rust CPython aware function
py_module_initializer!(
    rust_entropy_lib,
    initrust_entropy_lib,
    PyInit_rust_entropy_lib,
    |py, m | {
        m.add(py, "__doc__", "Entropy module implemented in Rust")?;
        m.add(
            py,
            "compute_entropy_cpython",
            py_fn!(py, compute_entropy_cpython(data: &[u8])
        ))?;
        Ok(())
    }
);

Now let's compile this code and generate a library so we can use on our Python code.

$ cargo build --release

If you are on macOS like I was, you will need to create a file called config and add it to a directory called .cargo (which you also may need to create) inside your Rust project, with the following content:

[target.x86_64-apple-darwin]
rustflags = [
"-C", "link-arg=-undefined",
"-C", "link-arg=dynamic_lookup",
]

This will generate a file called librust_entropy_lib.dylib inside ./target/release directory. To make things easier copy this file to where your entropy.py file is and rename it to rust_entropy_lib.so.

Calling our Rust Code from Python

Now it's time to finally call our Rust implementation from Python, in our case the entropy.py file again. The first thing to do is to add an import to our newly created library to the top of our Python file entropy.py.

import rust_entropy_lib

Then all we have to do is call the exported library function we specified earlier when we initialized the Python module with the py_module_initializer! macro in our Rust code. Again, in our entropy.py file.

def compute_entropy_rust_from_python(data):
    """Compute entropy on bytearray `data` with Rust."""
    return rust_entropy_lib.compute_entropy_cpython(data)

At this point, we have a single Python module that includes functions to call all of our entropy calculation implementations.

Now it's game on! (Performance tests)

We measured the execution time of each function implementation with pytest benchmarks computing entropy over 1 million random bytes. All implementations were presented with the same data. The benchmark tests (also included in entropy.py) are shown below.

# ### BENCHMARKS ###
# generate some random bytes to test w/ NumPy
NUM = 1000000
VAL = np.random.randint(0, 256, size=(NUM, ), dtype=np.uint8)

def test_pure_python(benchmark):
    """Test pure Python."""
    benchmark(compute_entropy_pure_python, VAL)

def test_pure_numpy_numba(benchmark):
    """Test implementation using Numba."""
    benchmark(compute_entropy_numpy_numba, VAL)

def test_python_scipy_numpy(benchmark):
    """Test pure Python with SciPy."""
    benchmark(compute_entropy_scipy_numpy, VAL)

def test_rust(benchmark):
    """Test Rust implementation called from Python."""
    benchmark(compute_entropy_rust_from_python, VAL)

And for a different scenario, I made a separate script for each method for calculating entropy and added them to the root of the project, same directory lever as our entropy.py .

entropy_pure_python.py

import entropy

# test.img is binary file generate with
# dd if=/dev/zero of=test.img bs=1024 count=0 seek=$[1024*10] for 10Mb
# just a set of random bytes generate for testing purposes
with open('test.img', 'rb') as f:
DATA = f.read()

# Here we just repeat the calculations 100 times, for our pure python method
for _ in range(100):
    entropy.compute_entropy_pure_python(DATA)

entropy_python_data_science.py

import entropy

# test.img is binary file generate with
# dd if=/dev/zero of=test.img bs=1024 count=0 seek=$[1024*10] for 10Mb
# just a set of random bytes generate for testing purposes
with open('test.img', 'rb') as f:
DATA = f.read()

# Here we just repeat the calculations 100 times, for our python using NumPy and SciPy
for _ in range(100):
    entropy.compute_entropy_scipy_numpy(DATA)

entropy_rust.py

import entropy

# test.img is binary file generate with
# dd if=/dev/zero of=test.img bs=1024 count=0 seek=$[1024*10] for 10Mb
# just a set of random bytes generate for testing purposes
with open('test.img', 'rb') as f:
DATA = f.read()

# Here we just repeat the calculations 100 times, for our Python using Rust
for _ in range(100):
    entropy.compute_entropy_rust_from_python(DATA)

The test.img file is just a randomly generated binary file with the following command (for a 10Mb file):

dd if=/dev/zero of=test.img bs=1024 count=0 seek=$[1024*10]

And the script repeats the calculations 100 times in order to simplify capturing memory usage data.

Script Results:

# entropy_pure_python.py

$ gtime python entropy_pure_python.py
74.70user 0.64system 1:13.92elapsed 101%CPU (0avgtext+0avgdata 60180maxresident)k
0inputs+0outputs (436major+14770minor)pagefaults 0swaps

# entropy_python_data_science.py

$ gtime python entropy_python_data_science.py
5.61user 1.15system 0:05.37elapsed 126%CPU (0avgtext+0avgdata 151896maxresident)k
0inputs+0outputs (2074major+36061minor)pagefaults 0swaps

# entropy_rust.py

$ gtime python entropy_rust.py
3.01user 0.53system 0:02.06elapsed 171%CPU (0avgtext+0avgdata 60104maxresident)k
0inputs+0outputs (2074major+13115minor)pagefaults 0swaps

I used GNU time application to measure the performance of the scripts above.

Benchmark Test Results:

pytest entropy.py

As you can see, both SciPy/NumPy and Rust implementations shown really strong performances, easily outperforming the Pure Python version. I was actually pleasantly surprised by how close the SciPy/NumPy performance was from the Rust implementation. But the results confirmed what I already expected from the start. Pure python is incredibly slower and not even in the same ball park as Rust, and implementations done in Rust can compete head to head with those already optimized in Python and probably written in C (even in this super simple benchmark).

On a side note, since it was added after, the run with Numba brought a really good insight. Numba fares very well, and just edges out Rust on my machine, and the effort to achieve it was low. All you have to do is import the library into your project and apply the decorators to the functions, and let Numba do the rest.

Conclusion

This being the first time I try Rust on a data science task I was truly impressed with the performance of invoking Rust from Python. I have done it before with other languages like C or Go, but never with Rust, so it was a good exercise. And based on our brief and simple test, our Rust implementation performance can go head to head with any underlying C code running on SciPy and NumPy packages. Rust is ready for battle for efficient processing at scale. Bring it on!

Rust was not only performant in execution time, as you can see in the script tests above, but the memory overhead was minimum. As I mentioned at the beginning of the post, having a good execution time and memory utilization make it ideal for scalability. That said, the performance of SciPy and NumPy were not bad, actually the were right behind, but I ran tests with 1Mb files and 10Mb file and what I noticed is that the difference grows as the data size grows. On top of that, Rust provides additional benefits the C and C++ do not, like memory and thread safety, making Rust real attractive.

If we are talking only about performance, C can offer similar execution times, but it does not have the memory safety and definitely not the thread safety. Choices, right?! That is what life is all about.

Ok, you can achieve those by using external libraries for C, but the onus of getting it right in entirely on you, the developer. Rust checks for memory safety, thread safety, race conditions all on compile time, plus the standard library offers a range of tools for concurrency, including channels, locks and reference counting.

I am not here to try to convince you to rewrite all these libraries and port everything to Rust, I can barely convince the guys at the company I work at to use Rust. Plus, some of these libraries like NumPy and SciPy are already heavily optimized packages with a lot of support from the community. All I am saying is that there is a lot out there in the world and would not hurt to start using Rust and maybe rewriting some pure python that isn't already optimized into a high performance library.

My take is that Rust would be a great alternative for Data Science given its speed and safety guarantees. What do you think?

If you like the post how about a like?

For more content like this, subscribing would not be a bad idea.

Github repo: https://github.com/joaoh82/python_rust_data_science_bench

Additional Resources: