TODAY'S ISSUE
TODAYβS DAILY DOSE OF DATA SCIENCE
Correlation on ordinal data
If your correlation analysis includes ordinal features (those with a natural encoding, like t-shirt size, grade, etc.)...
...the choice of encoding can largely affect the correlation results.
For instance, consider this dataset.
Here, we have:
- An ordinal categorical feature: t-shirt size (S, M, L, XL).
- A continuous feature: weight.
The graph above shows a monotonic relationship between the two features.
If we use Pearson correlation (as shown below), the choice of encoding determines the actual correlation value:
- In the left plot, we have this encoding β S(1), M(2), L(3) and XL(4).
- In the right plot, we have this encoding β S(1), M(2), L(4) and XL(8).
However, using Spearman correlation solves this issue:
This time, the correlation value is the same.
This happens because the Spearman correlation is rank-based.
In other words, since it operates on the βranksβ of the data, it is more suitable for such cases of correlation analysis.
Using Spearman correlation is pretty simple. If you are using Pandas, specify the desired correlation measure as follows:
That said, thereβs one more thing to take care of when you are training ML models to predict ordinal categorical data.
Ordinal datasets are quite prevalent in the industry, and typical classification models almost always produce suboptimal results in such cases.
With special attention and techniques, however, one can not only add more interpretability to ML models but also produce more accurate machine learning models.
We covered them in detail here: You Are Probably Building Inconsistent Classification Models Without Even Realizing.
π Over to you: What are some other measures to determine the correlation between categorical data and continuous data?
TRUSTWORTHY ML
Build confidence in model's predictions with conformal predictionsβββ
Conformal prediction has gained quite traction in recent years, which is evident from the Google trends results:
The reason is quite obvious.
ML models are becoming increasingly democratized lately. However, not everyone can inspect its predictions, like doctors or financial professionals.
Thus, it is the responsibility of the ML team to provide a handy (and layman-oriented) way to communicate the risk with the prediction.
For instance, if you are a doctor and you get this MRI, an output from the model that suggests that the person is normal and doesnβt need any treatment is likely pretty unuseful to you.
This is because a doctor's job is to do a differential diagnosis. Thus, what they really care about is knowing if there's a 10% percent chance that that person has cancer or 80%, based on that MRI.
βConformal predictionsβ solve this problem.
A somewhat tricky thing about conformal prediction is that it requires a slight shift in making decisions based on model outputs.
Nonetheless, this field is definitely something I would recommend keeping an eye on, no matter where you are in your ML career.
βLearn how to practically leverage conformal predictions in your model ββ
IN CASE YOU MISSED IT
βRun open-source LLMs locally with Ollamaβ
It's easier to run an open-source LLM locally than most people think.
Recently, we covered a step-by-step, hands-on demo of this using Ollama.
Here's what the final outcome looks like:
We ran Microsoft's phi-2 using Ollama, a framework to run open-source LLMs (Llama2, Llama3, and many more) directly from a local machine.
THAT'S A WRAP
No-Fluff Industry ML resources to
Succeed in DS/ML roles
At the end of the day, all businesses care about impact. Thatβs it!
- Can you reduce costs?
- Drive revenue?
- Can you scale ML models?
- Predict trends before they happen?
We have discussed several other topics (with implementations) in the past that align with such topics.
Here are some of them:
- Learn sophisticated graph architectures and how to train them on graph data in this crash course.
- So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
- Run large models on small devices using Quantization techniques.
- Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
- Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
- Learn how to scale and implement ML model training in this practical guide.
- Learn 5 techniques with implementation to reliably test ML models in production.
- Learn how to build and implement privacy-first ML systems using Federated Learning.
- Learn 6 techniques with implementation to compress ML models.
All these resources will help you cultivate key skills that businesses and companies care about the most.
SPONSOR US
Advertise to 450k+ data professionals
Our newsletter puts your products and services directly in front of an audience that matters β thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.