TODAY'S ISSUE
TODAY’S DAILY DOSE OF DATA SCIENCE
Accelerate Pandas 20x using FireDucks
​Two of the biggest problems with Pandas is that:
- It always adheres to a single-core computation on a CPU.
- It creates bulky DataFrames.
Moreover, since Pandas follows an eager execution mode (every operation triggers immediate computation), it cannot prepare a smart execution plan that optimizes the entire sequence of operations.
FireDucks is a heavily optimized alternative to Pandas with exactly the same API as Pandas’ that addresses these limitations.
Let’s learn more about it today!
How to use it?
First, install the library:
Next, there are three ways to use it:
- If you are using IPython or Jupyter Notebook, load the extension as follows:
- Additionally, FireDucks also provides a pandas-like module (
fireducks.pandas
), which can be imported instead of using Pandas. Thus, to use FireDucks in an existing Pandas pipeline, replace the standard import statement with the one from FireDucks:
- Lastly, if you have a Python script, executing it as shown below will automatically replace the Pandas import statement with FireDucks:
Done!
It’s that simple to use FireDucks.
The speedup is evident from the gif below from my personal experimentation:
Speedups typically vary from system to system since FireDucks is driven with multiple cores. The same code above, on a system with more CPU cores, will most likely result in more speedup. In my experimentation, I used the standard Google Colab runtime and FireDucks 1.0.3.
As per FireDucks’ official benchmarks, it can be ~20x faster than Pandas and ~2x faster than Polars, as shown below on several queries:
Considering the above benchmarks, FireDucks outperforms Polars on 14 out of 22 benchmarks.
Moreover, another thing that stands in favor of FireDucks is that, unlike Polars, we don’t need to make any code changes.
How does it work?
Whenever %load_ext fireducks.pandas
is executed, the “import pandas as pd
” statement does not import the original Pandas library, which we use all the time.
Instead, it imports another library that contains accelerated and optimized implementations of all Pandas methods.
This is evident from the image below:
This alternative implementation preserves the entire syntax of Pandas. So, if you know Pandas, you already know how to use FireDucks.
Moreover, unlike Pandas, FireDucks is driven by lazy execution.
This means that transformations do not produce immediate results.
Instead, the computations are deferred until an action is triggered, such as:
- Viewing/printing the data.
- Writing the data to a storage source.
- Converting the data to Python lists, etc.
By lazily evaluating DataFrame transformations and executing them ONLY WHEN THEY ARE NEEDED, FireDucks can build a logical execution plan and apply possible optimizations.
For instance:
- In the following code,
df2
is never used. Thus, FireDucks will never read from the CSV. Pandas, however, will load it anyway.
- Pandas will perform the sort operation on the entire DataFrame, requiring more memory.
- But since FireDucks is driven by lazy evaluation, it will build the following optimal plan instead to generate the result:
In the following code, assume df
contains 10 columns. However, we are only making use of two of them:
Of course, several other optimizations are involved, which I haven’t covered here, but I hope you get the point.
This way, FireDucks turns out to be much more optimal than Pandas.
That said, FireDucks does support an eager execution mode like Pandas if you prefer to use that. Here’s how to enable it:
One slight limitation is that it is currently available only for Linux on the x86_64 architecture. As per the official docs, Windows and MacOS versions are currently under development, and their respective beta versions will be released soon.
There is, however, a way to use it on Windows, which you can find here.
You can find the code here: Google Colab.
FireDucks documentation is available here: FireDucks docs.
🙌 A big thanks to FireDucks, who very kindly partnered with us on this post and let us share our thoughts openly.
👉 Over to you: What are some other ways to accelerate Pandas operations in general?
TRULY REPRODUCIBLE ML
​Data Version Control
Versioning GBs of datasets is practically impossible with GitHub because it imposes an upper limit on the file size we can push to its remote repositories.
That is why Git is best suited for versioning codebase, which is primarily composed of lightweight files.
However, ML projects are not solely driven by code.
Instead, they also involve large data files, and across experiments, these datasets can vastly vary.
To ensure proper reproducibility and experiment traceability, it is also necessary to ​version datasets​.
Data version control (DVC) solves this problem.
The core idea is to integrate another version controlling system with Git, specifically used for large files.
​̱Here's everything you need to know ​​(with implementation)​​ about building 100% reproducible ML projects →​
MODEL OPTIMIZATION
Model compression to optimize models for production
Model accuracy alone (or an equivalent performance metric) rarely determines which model will be deployed.
Much of the engineering effort goes into making the model production-friendly.
Because typically, the model that gets shipped is NEVER solely determined by performance — a misconception that many have.
Instead, we also consider several operational and feasibility metrics, such as:
- Inference Latency: Time taken by the model to return a prediction.
- Model size: The memory occupied by the model.
- Ease of scalability, etc.
For instance, consider the image below. It compares the accuracy and size of a large neural network I developed to its pruned (or reduced/compressed) version:
Looking at these results, don’t you strongly prefer deploying the model that is 72% smaller, but is still (almost) as accurate as the large model?
Of course, this depends on the task but in most cases, it might not make any sense to deploy the large model when one of its largely pruned versions performs equally well.
We discussed and implemented 6 model compression techniques in the article ​here​, which ML teams regularly use to save 1000s of dollars in running ML models in production.
​Learn how to compress models before deployment with implementation →
THAT'S A WRAP
No-Fluff Industry ML resources to
Succeed in DS/ML roles
At the end of the day, all businesses care about impact. That’s it!
- Can you reduce costs?
- Drive revenue?
- Can you scale ML models?
- Predict trends before they happen?
We have discussed several other topics (with implementations) in the past that align with such topics.
Here are some of them:
- Learn sophisticated graph architectures and how to train them on graph data in this crash course.
- So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
- Run large models on small devices using Quantization techniques.
- Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
- Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
- Learn how to scale and implement ML model training in this practical guide.
- Learn 5 techniques with implementation to reliably test ML models in production.
- Learn how to build and implement privacy-first ML systems using Federated Learning.
- Learn 6 techniques with implementation to compress ML models.
All these resources will help you cultivate key skills that businesses and companies care about the most.
SPONSOR US
Advertise to 450k+ data professionals
Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.