Accuracy Can Be Deceptive

TODAY’S DAILY DOSE OF DATA SCIENCE

Accuracy can be deceptive while improving probabilistic multiclass-classification models

If some technique improves the model, we can say it was effective:

But at times, you may be making good progress in improving the model, but “Accuracy” is not reflecting that (yet).

I have seen this when building probabilistic multiclass-classification models.

Let's understand!

On a side note, in addition to the discussion below, we covered:

Pitfall of Accuracy

In probabilistic multiclass classification models, Accuracy is determined using the highest probability output label:

Now imagine this:

In version 1 of the model, the output probability of “Class C” (the true label) was the lowest, as depicted below:

But in version 2 of the same model, the output probability of “Class C” (the true label) was 2nd highest, as depicted below:

In both cases, the final prediction is incorrect, which is okay.

However, going from “Version 1” to “Version 2” did improve the model.

Nonetheless, Accuracy does not consider this since it only cares about the final prediction.

Solution

If you are iteatively improving a probabilistic multiclass classification model, always use the top-k accuracy score.

It computes whether the correct label is among the top “k” labels predicted probabilities or not.

As you may have guessed, the top-1 accuracy is the traditional accuracy.

This is a better indicator for assessing model improvement efforts.

For instance, if the top-3 accuracy score increases from 75% to 90%, this tells us we are headed in the right direction:

Earlier, the correct prediction was in the top 3 labels only 75% of the time.
But now, the correct prediction is in the top 3 labels 90% of the time.

That said, you use it to assess the model improvement efforts since true predictive power is determined using traditional accuracy.

Ideally, “Top-k Accuracy” will increase with iterations. But Accuracy can stay the same, as depicted below:

Top-k accuracy score is also available in Sklearn here.

Isn’t that a great way to assess your model improvement efforts?

If you are looking for more, we covered:

👉 Over to you: What are some other ways to assess model improvement efforts?

TREE-BASED MODEL BUILDING

Formulating and Implementing XGBoost From Scratch

If you consider the last decade (or 12-13 years) in ML, neural networks have dominated the narrative in most discussions.

In contrast, tree-based methods tend to be perceived as more straightforward, and as a result, they don't always receive the same level of admiration.

However, in practice, tree-based methods frequently outperform neural networks, particularly in structured data tasks.

This is a well-known fact among Kaggle competitors, where XGBoost has become the tool of choice for top-performing submissions.

One would spend a fraction of the time they would otherwise spend on models like linear/logistic regression, SVMs, etc., to achieve the same performance as XGBoost.

Learn about its internal details by formulating and implementing it from scratch here →

MODEL OPTIMIZATION

Model compression to optimize models for production

Model accuracy alone (or an equivalent performance metric) rarely determines which model will be deployed.

Much of the engineering effort goes into making the model production-friendly.

Because typically, the model that gets shipped is NEVER solely determined by performance — a misconception that many have.

Instead, we also consider several operational and feasibility metrics, such as:

Inference Latency: Time taken by the model to return a prediction.
Model size: The memory occupied by the model.
Ease of scalability, etc.

For instance, consider the image below. It compares the accuracy and size of a large neural network I developed to its pruned (or reduced/compressed) version:

Looking at these results, don’t you strongly prefer deploying the model that is 72% smaller, but is still (almost) as accurate as the large model?

Of course, this depends on the task but in most cases, it might not make any sense to deploy the large model when one of its largely pruned versions performs equally well.

We discussed and implemented 6 model compression techniques in the article here, which ML teams regularly use to save 1000s of dollars in running ML models in production.

Learn how to compress models before deployment with implementation →

No-Fluff Industry ML resources to

Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Develop Industry ML skills

Here are some of them:

Learn sophisticated graph architectures and how to train them on graph data in this crash course.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
Run large models on small devices using Quantization techniques.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
Learn how to scale and implement ML model training in this practical guide.
Learn 5 techniques with implementation to reliably test ML models in production.
Learn how to build and implement privacy-first ML systems using Federated Learning.
Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Advertise to 600k+ data professionals

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →

The Full MCP Blueprint: Testing, Security, and Sandboxing in MCPs (Part B)

The Full MCP Blueprint: Testing, Security and Sandboxing in MCPs (Part A)

The Full MCP Blueprint: Integrating Sampling into MCP Workflows