How to Assess Correlation on Ordinal Data?

The limitations of Pearson correlation.

How to Assess Correlation on Ordinal Data?
πŸ‘‰
Hey! Enjoy our free data science newsletter! Subscribe below and receive a free data science PDF (530+ pages) with 150+ core data science and machine learning lessons.

TODAY'S ISSUE

TODAY’S DAILY DOSE OF DATA SCIENCE

Correlation on ordinal data

If your correlation analysis includes ordinal features (those with a natural encoding, like t-shirt size, grade, etc.)...

...the choice of encoding can largely affect the correlation results.

For instance, consider this dataset.

Here, we have:

  • An ordinal categorical feature: t-shirt size (S, M, L, XL).
  • A continuous feature: weight.

The graph above shows a monotonic relationship between the two features.

If we use Pearson correlation (as shown below), the choice of encoding determines the actual correlation value:

  • In the left plot, we have this encoding β†’ S(1), M(2), L(3) and XL(4).
  • In the right plot, we have this encoding β†’ S(1), M(2), L(4) and XL(8).

However, using Spearman correlation solves this issue:

This time, the correlation value is the same.

This happens because the Spearman correlation is rank-based.

In other words, since it operates on the β€œranks” of the data, it is more suitable for such cases of correlation analysis.

Using Spearman correlation is pretty simple. If you are using Pandas, specify the desired correlation measure as follows:

That said, there’s one more thing to take care of when you are training ML models to predict ordinal categorical data.

Ordinal datasets are quite prevalent in the industry, and typical classification models almost always produce suboptimal results in such cases.

With special attention and techniques, however, one can not only add more interpretability to ML models but also produce more accurate machine learning models.

We covered them in detail here: You Are Probably Building Inconsistent Classification Models Without Even Realizing.

πŸ‘‰ Over to you: What are some other measures to determine the correlation between categorical data and continuous data?

TRUSTWORTHY ML

Build confidence in model's predictions with conformal predictions​​​

Conformal prediction has gained quite traction in recent years, which is evident from the Google trends results:

The reason is quite obvious.

ML models are becoming increasingly democratized lately. However, not everyone can inspect its predictions, like doctors or financial professionals.

Thus, it is the responsibility of the ML team to provide a handy (and layman-oriented) way to communicate the risk with the prediction.

For instance, if you are a doctor and you get this MRI, an output from the model that suggests that the person is normal and doesn’t need any treatment is likely pretty unuseful to you.

This is because a doctor's job is to do a differential diagnosis. Thus, what they really care about is knowing if there's a 10% percent chance that that person has cancer or 80%, based on that MRI.

​Conformal predictions​ solve this problem.

A somewhat tricky thing about conformal prediction is that it requires a slight shift in making decisions based on model outputs.

Nonetheless, this field is definitely something I would recommend keeping an eye on, no matter where you are in your ML career.

​Learn how to practically leverage conformal predictions in your model →​

IN CASE YOU MISSED IT

​Run open-source LLMs locally with Ollama​

It's easier to run an open-source LLM locally than most people think.

Recently, we covered a step-by-step, hands-on demo of this using Ollama.

Here's what the final outcome looks like:

We ran Microsoft's phi-2 using Ollama, a framework to run open-source LLMs (Llama2, Llama3, and many more) directly from a local machine.

​Learn how to run LLMs locally using Ollama →​​

THAT'S A WRAP

No-Fluff Industry ML resources to

Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!

  • Can you reduce costs?
  • Drive revenue?
  • Can you scale ML models?
  • Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Here are some of them:

  • Learn sophisticated graph architectures and how to train them on graph data in this crash course.
  • So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
  • Run large models on small devices using Quantization techniques.
  • Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
  • Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
  • Learn how to scale and implement ML model training in this practical guide.
  • Learn 5 techniques with implementation to reliably test ML models in production.
  • Learn how to build and implement privacy-first ML systems using Federated Learning.
  • Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Our newsletter puts your products and services directly in front of an audience that matters β€” thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today β†’


Join the Daily Dose of Data Science Today!

A daily column with insights, observations, tutorials, and best practices on data science.

Get Started!
Join the Daily Dose of Data Science Today!

Great! You’ve successfully signed up. Please check your email.

Welcome back! You've successfully signed in.

You've successfully subscribed to Daily Dose of Data Science.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.