15 Pandas ↔ Polars ↔ SQL ↔ PySpark Translations

TODAY’S DAILY DOSE OF DATA SCIENCE

15 Pandas ↔ Polars ↔ SQL ↔ PySpark Translations

I created the following visual, which depicts the 15 most common tabular operations in Pandas and their corresponding translations in SQL, Polars, and PySpark.

While the motivation for Pandas and SQL is clear and well-known, let me tell you why you should care about Polars and PySpark.

Why Polars?

Pandas has many limitations, which Polars addresses, such as:

Pandas always adheres to single-core computation → Polars is multi-core.
Pandas offers no lazy execution → Polars does.
Pandas creates bulky DataFrames → Polars’ DFs are lightweight.
Pandas is slow on large datasets → Polars is remarkably efficient.

In fact, if we look at the run-time comparison on some common operations, it’s clear that Polars is much more efficient than Pandas:

Why Spark?

While tabular data space is mainly dominated by Pandas and Sklearn, one can hardly expect any benefit from them beyond some GBs of data due to their single-node processing.

A more practical solution is to use distributed computing instead — a framework that disperses the data across many small computers.

Spark is among the best technologies used to quickly and efficiently analyze, process, and train models on big datasets.

That is why most data science roles at big tech demand proficiency in Spark. It’s that important.

We covered this in detail in a recent deep dive as well: Don’t Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark.

👉 Over to you: What are some other faster alternatives to Pandas that you are aware of?

IN CASE YOU MISSED IT

LoRA/QLoRA—Explained From a Business Lens

Consider the size difference between BERT-large and GPT-3:

GPT-4 (not shown here) is 10x bigger than GPT-3.

I have fine-tuned BERT-large several times on a single GPU using traditional fine-tuning:

But this is impossible with GPT-3, which has 175B parameters. That's 350GB of memory just to store model weights under float16 precision.

This means that if OpenAI used traditional fine-tuning within its fine-tuning API, it would have to maintain one model copy per user:

If 10 users fine-tuned GPT-3 → they need 3500 GB to store model weights.
If 1000 users fine-tuned GPT-3 → they need 350k GB to store model weights.
If 100k users fine-tuned GPT-3 → they need 35 million GB to store model weights.

And the problems don't end there:

OpenAI bills solely based on usage. What if someone fine-tunes the model for fun or learning purposes but never uses it?
Since a request can come anytime, should they always keep the fine-tuned model loaded in memory? Wouldn't that waste resources since several models may never be used?

LoRA (+ QLoRA and other variants) neatly solved this critical business problem.

We covered this in detail here →

PRIVACY-PRESERVING ML

Train models on private data with federated learning

There’s so much data on your mobile phone right now — images, text messages, etc.

And this is just about one user — you.

But applications can have millions of users. The amount of data we can train ML models on is unfathomable.

The problem?

This data is private.

So consolidating this data into a single place to train a model.

The solution?

Federated learning is a smart way to address this challenge.

The core idea is to ship models to devices, train the model on the device, and retrieve the updates:

But this isn't as simple as it sounds.

1) Since the model is trained on the client side, how to reduce its size?

2) How do we aggregate different models received from the client side?

3) [IMPORTANT] Privacy-sensitive datasets are always biased with personal likings and beliefs. For instance, in an image-related task:

Some devices may only have pet images.
Some devices may only have car images.
Some people may love to travel, and may primarily have travel-related images.
How to handle such skewness in data distribution?

Learn how to implement federated learning systems (beginner-friendly) →

No-Fluff Industry ML resources to

Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Develop Industry ML skills

Here are some of them:

Learn sophisticated graph architectures and how to train them on graph data in this crash course.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
Run large models on small devices using Quantization techniques.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
Learn how to scale and implement ML model training in this practical guide.
Learn 5 techniques with implementation to reliably test ML models in production.
Learn how to build and implement privacy-first ML systems using Federated Learning.
Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Advertise to 600k+ data professionals

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →

The Full MCP Blueprint: Practical MCP Integration with 4 Popular Agentic Frameworks

The Full MCP Blueprint: Testing, Security, and Sandboxing in MCPs (Part B)

The Full MCP Blueprint: Testing, Security and Sandboxing in MCPs (Part A)

15 Pandas ↔ Polars ↔ SQL ↔ PySpark Translations

TODAY’S DAILY DOSE OF DATA SCIENCE

15 Pandas ↔ Polars ↔ SQL ↔ PySpark Translations

Why Polars?

Why Spark?

IN CASE YOU MISSED IT

​LoRA/QLoRA—Explained From a Business Lens​​

PRIVACY-PRESERVING ML

​Train models on private data with federated learning

No-Fluff Industry ML resources to

Succeed in DS/ML roles

SPONSOR US

Advertise to 600k+ data professionals

Read next

The Full MCP Blueprint: Practical MCP Integration with 4 Popular Agentic Frameworks

The Full MCP Blueprint: Testing, Security, and Sandboxing in MCPs (Part B)

The Full MCP Blueprint: Testing, Security and Sandboxing in MCPs (Part A)

Join the Daily Dose of Data Science Today!

LoRA/QLoRA—Explained From a Business Lens

Train models on private data with federated learning