Nov 22, 2024 Drift

Identify Drift Using Proxy-labeling

An intuitive way to detect drift.

Avi Chawla

👉

TODAY'S ISSUE

TODAY’S DAILY DOSE OF DATA SCIENCE

Identify Drift Using Proxy-labeling

Almost all real-world ML models gradually degrade in performance due to a drift in feature distribution:

It is a serious problem because we trained the model on one distribution, but it is being used to generate predictions on another distribution in production.

The following visual summarizes a technique I often use to detect drift:

There are four steps:

Step 1) Consider two versions of the dataset—the old version (one on which the model was trained) and the current version (one on which the model is generating predictions):

Step 2) Append a label=1 column to the old dataset and label=0 column to the current dataset.

Step 3) Now, train a supervised learning classification model on the combined dataset that predicts the appended column:

Step 4) Measure feature importance

The choice of the classification model could be arbitrary, but you should be able to determine feature importance.

Thus, I personally prefer a random forest classifier because it has an inherent mechanism to determine feature importance:

That said, it is not necessary to use a random forest.

Techniques like shuffle feature importance (which we discussed here) illustrated below can be used as well on typical classification models:

Moving on…

If the feature importance values suggest that there are features with high feature importance, this means that those features have drifted.

Why?

This is because if some features can reliably distinguish between the two versions of the dataset, then it is pretty likely that their distribution corresponding to label=1 and label=0 (conditional distribution) are varying.

If there are distributional differences, the model will capture them.
If there are no distributional differences, the model will struggle to distinguish between the classes.

This idea makes intuitive sense as well.

Of course, this is not the only technique to determine drift.

Autoencoders can also help. We discussed them here in a recent newsletter issue.

👉 Over to you: What are some other ways you use to determine drift?

CRASH COURSE (37 MINS)

Generate true probabilities with model calibration

Modern neural networks being trained today are highly misleading.

They appear to be heavily overconfident in their predictions.

For instance, if a model predicts an event with a 70% probability, then ideally, out of 100 such predictions, approximately 70 should result in the event occurring.

However, many experiments have revealed that modern neural networks appear to be losing this ability, as depicted below:

The average confidence of LeNet (an old model) closely matches its accuracy.
The average confidence of the ResNet (a relatively modern model) is substantially higher than its accuracy.

Calibration solves this.

A model is calibrated if the predicted probabilities align with the actual outcomes.

Handling this is important because the model will be used in decision-making and an overly confident can be fatal.

To exemplify, say a government hospital wants to conduct an expensive medical test on patients.

To ensure that the govt. funding is used optimally, a reliable probability estimate can help the doctors make this decision.

If the model isn't calibrated, it will produce overly confident predictions.

There has been a rising concern in the industry about ensuring that our machine learning models communicate their confidence effectively.

Thus, being able to detect miscalibration and fix is a super skill one can possess.

Learn how to build well-calibrated models in this crash course →

CRASH COURSE (56 MINS)

Graph Neural Networks

Google Maps uses graph ML for ETA prediction.
Pinterest uses graph ML (PingSage) for recommendations.
Netflix uses graph ML (SemanticGNN) for recommendations.
Spotify uses graph ML (HGNNs) for audiobook recommendations.
Uber Eats uses graph ML (a GraphSAGE variant) to suggest dishes, restaurants, etc.

The list could go on since almost every major tech company I know employs graph ML in some capacity.

Becoming proficient in graph ML now seems to be far more critical than traditional deep learning to differentiate your profile and aim for these positions.

A significant proportion of our real-world data often exists (or can be represented) as graphs:

Entities (nodes) are connected by relationships (edges).
Connections carry significant meaning, which, if we knew how to model, can lead to much more robust models.

The field of graph neural networks (GNNs) intends to fill this gap by extending deep learning techniques to graph data.

Learn sophisticated graph architectures and how to train them on graph data in this crash course →

Published on Nov 22, 2024

Identify Drift Using Proxy-labeling

TODAY’S DAILY DOSE OF DATA SCIENCE

Identify Drift Using Proxy-labeling

CRASH COURSE (37 MINS)

Generate true probabilities with model calibration​

CRASH COURSE (56 MINS)

​Graph Neural Networks

Generate true probabilities with model calibration

Graph Neural Networks