Daily Dose of Data Science

Conformal Predictions: Build Confidence in Your ML Model's Predictions

Avi Chawla — Sun, 21 Jul 2024 13:15:00 GMT

The problem

Blindly trusting an ML model’s predictions can be fatal at times, especially in high-stakes environments where decisions based on these predictions can affect human lives, financial stability, or critical infrastructure.

While an accuracy of “95%” looks good on paper, the model will never tell you which specific 5% of its predictions are incorrect when used in downstream tasks.

To put it another way, although we may know that the model’s predictions are generally correct a certain percentage of the time, but the model typically cannot tell much about a specific prediction.

As a result, in high-risk situations like medical diagnostics, for instance, blindly trusting every prediction can result in severe health outcomes.

Similarly, in finance, even a single erroneous prediction could cause substantial monetary losses or misguided investment strategies.

Thus, to mitigate such risks, it is essential to not only solely rely on the predictions but also understand and quantify the uncertainty associated with these predictions.

If we don't have any way to understand the model's confidence, then an accurate prediction and a guess can look the same.

💡

This is especially important in today’s era where ML is getting more and more democratized. Not everyone can inspect ML models, like doctors, or financial professionals. Providing a handy (and layman-oriented) way to communicate the risk with the prediction is also important.

Let's build some more motivation around this idea and why we need prediction intervals for models like neural networks.

To begin, we know these models are likely going to be used in downstream decision-making tasks.

For instance, if you are a doctor and you get this MRI along, an output from the model that suggests that the person is normal and doesn't need any treatment is likely pretty unuseful to you.

This is because a doctor's job is to do a differential diagnosis. Thus, what they really care about is knowing if there's a 10% percent chance that that person has cancer, based on that MRI.

Another thing that will be more useful is getting a prediction set, which is a set of diagnoses that are guaranteed with high probability.

For instance, this prediction set could indicate that there's a 90% probability that the true diagnosis is contained within the prediction set.

If this is the output, the doctor can use that in the process and rule out diagnoses.

The solution

Conformal prediction, also known as conformal inference, provides a framework for generating prediction intervals or sets that come with strong statistical guarantees.

What sets conformal prediction apart is its distribution-free nature, meaning it does not rely on any assumptions about the underlying data distribution or the specific model used.

Moreover, it requires no additional training. The whole model can be a black box and still be used to make conformal predictions.

Due to its nature, integrating conformal prediction with any pre-trained machine learning model allows practitioners to transform single-point predictions into intervals or sets that are (almost) guaranteed to contain the true outcome with a user-specified probability.

For example, in a classification task like medical diagnosis, the traditional approach would output a single class label. This single-point prediction leaves no room for understanding the confidence in the prediction or for considering other plausible diagnoses.

In contrast, a conformal prediction approach would generate a set of possible diagnoses, ensuring that the true diagnosis is included within this set with a specified confidence level, such as 95%.

For instance, instead of merely predicting "disease A" with 95% probability, a conformal predictor might output a set that includes "disease A" and "disease B," guaranteeing that the correct diagnosis lies within this set 95% of the time.

It's okay if nothing is clear to you yet since we will understand this in detail in the next section.

More specifically, in the following sections:

We will explore the theoretical foundations of conformal prediction, detailing how it achieves its distribution-free guarantees and why it is effective in uncertainty quantification.
We will also dive into practical applications, demonstrating how to implement conformal prediction in various machine learning tasks, like classification.

Let's begin!

Conformal Prediction

Imagine we have a dataset, say, images, and there are multiple labels (a total of $K$) in this dataset:

In conformal predictions, a new set, called the "calibration set" is extracted from the training data. It does not have to be super large. For instance, in the case of a dataset of 60k images, it would be okay to separate out just 500-750 images as the calibration set:

The relative size of the calibration set and training set in this image has nothing to do with the actual ratio. It is just shown for demonstration purposes.

To keep things simple, also imagine that we have already trained a model on the above training dataset, say, a neural network:

Mathematically, this model can be denoted as follows:

$f_y(x)$ denotes the model's predicted probability that the target variable $Y$ takes on the value $y$ ($y$ can be any value from the above class labels) on a specific input $X=x$.
$P[Y=y | X=x]$ represents the true conditional probability of $Y=y$ given $X=x$.

So, $f_y(x)$ is the model's approximation of the true underlying conditional probability $P[Y=y | X=x]$. The goal of the model is to accurately estimate this probability, enabling reliable predictions and uncertainty quantification.

Once the model has been trained, our goal is to take any new example $X'$ (say the model is being used in a downstream application, so the label is unknown) and generate a prediction set $\phi(X')$ such that it is expected to contain the true class $Y'$ with a high probability.

Quantization: Optimize ML Models to Run Them on Tiny Hardware

Avi Chawla — Sun, 14 Jul 2024 16:54:00 GMT

In an earlier deep dive on model compression, we discussed various ways to compress the ML model after training it.

Model Compression: A Critical Step Towards Efficient Machine Learning

Four critical ways to reduce model footprint and inference time.

Daily Dose of Data ScienceAvi Chawla

It could be that you have trained a model, but it's too big to be deployed. Compression techniques can help you do that.
Or, it could be that you intend to use an open-source model on your local machine and it's too big for your machine to load in memory.

In that article, we discussed various techniques that help us reduce the memory footprint of ML models and make them more accessible/deployable.

One of the techniques we discussed in that article was Quantization, which we discussed in a bit of detail, but not in the exact detail that this topic deserves to be understood by machine learning engineers.

So in this article, we shall dive into the technical details of Quantization, what it is, how it works, why it is promising and is considered one of the most powerful techniques to make use of large models on regular machines.

👉

As a machine learning engineer, awareness about techniques that help your employer save money is genuinely well appreciated. Thus, learning about techniques like quantization, model compression, model optimization, etc., will surely put you on par to become an indispensable asset to your team.

Let's begin!

Motivation

Typically, the parameters of a neural network (layer weights) are represented using 32-bit floating-point numbers.

HDBSCAN: The Supercharged Version of DBSCAN — An Algorithmic Deep Dive

Avi Chawla — Sat, 06 Jul 2024 19:49:00 GMT

We have covered quite a few advanced clustering algorithms before here, like the following:

Gaussian Mixture Models (GMMs)

Gaussian Mixture Models: A more robust alternative to KMeans.

Daily Dose of Data ScienceAvi Chawla

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering

Addressing major limitations of the most popular density-based clustering algorithm — DBSCAN.

Daily Dose of Data ScienceAvi Chawla

In this article, let's continue the clustering series and cover HDBSCAN, a powerful hierarchical clustering algorithm that extends the capabilities of DBSCAN by adding hierarchical clustering and handling varying densities more effectively.

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise.
HDBSCAN stands for Hierarchical Density-Based Spatial Clustering of Applications with Noise.

We covered it briefly in the newsletter as well recently:

HDBSCAN vs. DBSCAN

What limitations does HDBSCAN address?

Daily Dose of Data ScienceAvi Chawla

To reiterate, DBSCAN assumes that the local density of data points is (somewhat) globally uniform. This is governed by its eps parameter.

For instance, consider the clustering results obtained with DBSCAN on the dummy dataset below, where each cluster has different densities:

It is clear that DBSCAN produces bad clustering results.

Now compare it with HDBSCAN results depicted below:

On a dataset with three clusters, each with varying densities, HDBSCAN is found to be more robust.

Towards the end, most readers also showed interest in a deep dive into HDBSCAN, so it appears to be a great topic to cover.

In this article, we shall cover the key concepts that will help you understand how HDBSCAN works and why it is a powerful clustering algorithm.

Let's begin!

What’s the data like?

Any clustering dataset begins with some dataset (unlabeled) and a few assumptions about it.

Before devising (or utilizing) any clustering algorithm, the first question we need to answer is:

What do we know about the given data? What type of data are we trying to cluster?

Ideally, we would want to make as few assumptions so two core assumptions that we can proceed with are:

The dataset has noise, which means that not every point will belong to a cluster.
There are some clusters (we don't know how many) of arbitrary shapes in the given dataset.

In this article, we shall focus and build the foundations for HDBSCAN clustering using the following 2D dataset:

Being a 2D dataset, it's pretty easy to visually assess that there are 6 clusters in this dataset, and others are noise points:

But of course, visual methods are not feasible in all cases (especially high-dimensional datasets), so we expect to have a clustering algorithm that can identify clusters for us in any high-dimensional space.

Using KMeans

One of the first instincts in all clustering is to try KMeans. As the number of clusters is known to us in this specific instance, let's run KMeans clustering and plot the results:

Plotting the dataset with clustering labels (labels), we get the following plot:

For a moment, let’s just assume that we already have an implementation for the HDBSCAN algorithm and we run it on the above dataset:

Plotting the dataset with clustering labels (labels), we get the following plot:

Despite specifying the number of clusters in KMeans, it failed to cluster the dataset into appropriate clusters. However, HDBSCAN did a better clustering (based on visual assessment) without specifying the number of clusters.

Why did KMeans fail?

Let's understand!

Limitations of KMeans

While its simplicity often makes it the most preferred clustering algorithm, KMeans has many limitations that hinder its effectiveness in many scenarios.

#1) KMeans does not account for cluster variance and shape

For instance, KMeans does not account for cluster variance and shape.

In other words, one of the primary limitations of KMeans is its assumption of spherical clusters.

One intuitive and graphical way to understand the KMeans algorithm is to place a circle at the center of each cluster, which encloses the points.

Cluster shapes by KMeans when the data has circular clusters

💡

In 3-dimensions, circles can be replaced with spheres. In higher dimensions, they can be thought of as a hyper-sphere.

As KMeans is all about placing circles, its results aren’t ideal when the dataset has irregular shapes or varying sizes, as shown below:

Cluster shapes by KMeans when the data has ovular clusters

Instead, an ideal clustering should cluster the data as follows:

Ideal cluster shape when the data has ovular clusters

The same can be observed in the following dataset:

This rigidity of KMeans to only cluster globular clusters often leads to misclassification and suboptimal cluster assignments.

#2) Every data point is assigned to a cluster

Another significant limitation of the KMeans clustering algorithm is its inherent assumption that every data point must be assigned to a cluster.

A Crash Course on Causality – Part 2

Avi Chawla — Sat, 29 Jun 2024 19:36:00 GMT

In the first part of the causality deep dive, we went into quite a lot of detail about the core idea behind this topic, why it is essential in business contexts, and some common techniques used in causal inference.

A Crash Course on Causality – Part 1

A guide to building robust decision-making systems in businesses with causal inference.

Daily Dose of Data ScienceAvi Chawla

We also spent some time understanding counterfactual learning and how it relates to causal scenarios.

This is part 2 of the causality deep dive and we will continue from where we ended part 1.

Quick recap

Towards the end of part 1, we discussed observational studies. As the name suggests, observational studies involve observing and analyzing existing data without manipulating the environment or conditions.

This approach allows researchers to study real-world scenarios where randomization is impossible or unethical.

However, there are usually many more differences between the observations in addition to the treatment that affects the outcome.

For instance, consider the marketing campaign example we discussed in part 1, where we determined its causal impact on sales.

In the earlier discussion, we only considered whether a person was shown the marketing campaign or not.

However, in real life, there will be various other factors that may potentially impact sales, such as the following:

Customer Demographics: Age, gender, income level, and education can all influence purchasing behavior. Younger customers might respond differently to the campaign compared to older customers.
Geographical Location: Customers from different regions might have varying levels of exposure to the campaign and different purchasing power. Cultural differences and local economic conditions can also play significant roles.
Purchase History: Customers with a history of frequent purchases might react differently to the campaign compared to those who rarely buy. Their previous interactions with the brand can influence their responsiveness to marketing efforts.
Economic Conditions: Broader economic factors, such as inflation rates, employment levels, and consumer confidence, can impact overall spending habits.
And so on...

Let's assume that we have gathered data for all such potential factors and fit a regression model:

Here:

$y$ is the outcome variable.
$x$ is the treatment indicator.
$a$ is the intercept.
$b$ is the slope (causal effect of the treatment).
$C_i$ is the control variable we discussed above, and $\beta_i$ is its coefficient in the regression model.

Unlike the RCT case, this time, we cannot confidently say that $b$ is the true causal impact of the marketing campaign on the outcome variable (sales).

This is primarily because we cannot control for every variable. Moreover, there are variables impacting sales that we don’t have data on, don’t know that they are important, or simply cannot measure (think about "customer preferences," "competitor actions," or "economic conditions").

This is commonly known as the Omitted Variable Bias in causality.

This is a serious problem because if we leave out important variables that ideally must be accounted for, the model attributes their impact to those variables we included in the study. This biases the estimate of the causal impact.

Okay, so what can we do to handle this?

Randomization has already been dismissed, and we have identified issues with observational study too.

Instrumental Variables

Introduction

An instrumental variable is a variable that is correlated with the treatment variable but not directly with the outcome variable, except through its effect on the treatment. This allows us to isolate the causal effect of the treatment on the outcome.

To use an instrumental variable, we follow a process called two-stage least squares (2SLS):

First Stage: Regress the treatment variable on the instrumental variable.
Second Stage: Use the predicted values from the first stage as the independent variable in a regression with the outcome variable.

For example, if $z$ is our instrumental variable and $x$ is the treatment:

First stage:

Second stage:

While this requires us to fit two models, in practice, we do not need to do that.

This is because one can compute the coefficient of interest ($b_2$) or the causal impact of treatment $x$ on the outcome $y$ using the following closed-form solution:

Yet again, we can use the same simple linear regression idea which we discussed in part 1.

First, we calculate the expected values and the coefficient $b_1$ in the first stage as follows:

Next, we calculate the expected values and the coefficient $b_2$ in the second stage as follows. But before doing that, we substitute the value of $x$ in terms of $z$:

A Crash Course on Causality – Part 1

Avi Chawla — Sat, 22 Jun 2024 19:30:00 GMT

Introduction

“Because” is possibly one of the most powerful words in business decision-making.

“Our customer satisfaction improved because we introduced personalized recommendations.”
“The energy consumption dropped because of the new efficiency standards implemented.”

Backing any observation/insights with causality gives so much ability to confidently use the word “because” in business/regular discussions.

Identifying these causal relationships is vital because these relationships typically require an additional inspection that goes beyond just the typical correlation analysis, which almost anyone can do these days.

Thus, in this two-part article series, we shall dive into the details of causality and understand some of the most widely used techniques at the forefront of business decision-making, so that you can add value to your job and projects with your diversified skill set.

If you aspire to make valuable contributions to your data science job, this series will be super helpful.

A motivating example

It is well known that while correlation can show a relationship between two variables, it doesn’t imply that one causes the other.

For instance:

Studying correlation is likely to suggest that around the whole year, ice cream sales and AC sales are correlated. But this doesn’t mean eating ice cream causes people to buy air conditioners or vice versa. Both are influenced by a third factor – temperature.

👉

I’m not suggesting that companies should act on this idea, but consider something for a moment. We know that temperature is a causal factor for AC sales. Thus, it would be tempting for AC companies to launch campaigns that increase temperatures. This way, they can boost their sales and it would a pretty effective strategy. This demonstrates the power of causal analysis. On the flip side, if AC companies only relied on correlation analysis for decision-making, they may notice a correlation with ice cream sales. However, would they really succeed if they tried to boost ice cream sales to increase their own? No, right? This example highlights why understanding causality is so important in decision-making.

If we were able to accurately establish causality, it would immensely optimize the company’s operations.

In this series, we shall cover four statistical tools that provide a scientific basis for using the word “because.”

Only by rigorously establishing causality can you justifiably use the word “because.”

My experience with Causality

In 2021, I was a data scientist at Mastercard and mentored an intern. Causal inference was not something me and my time had deep experience in so we were still exploring.

💡

Causal inference is a field of study focused on understanding the cause-and-effect relationships between variables. Unlike correlation, which only indicates that two variables are correlated, causal inference seeks to determine whether changes in one variable directly cause changes in another.

Here's how we decided to approach it, but first, let me give you some context.

Mastercard handles millions of transactions per day. Every transaction goes through a fraud-detection model. Post the authentication phase, whether the transaction will be approved is decided based on the binary outcome of this model:

Fraud $\rightarrow$ reject the transaction.
Non-fraud $\rightarrow$ approve the transaction.

Now, once a label has been assigned to a transaction, it becomes a fact in this universe. Of course, the prediction may be wrong, but it has become a fact, and we cannot change it.

In the case of non-fraud transactions, the way we (Mastercard) ascertained whether the model made the correct prediction was determined based on whether the cardholder reached out to his bank or not.

So, let's say a transaction happened through your card right now, but you did not do it. However, the model classified it as non-fraud.

You would contact your bank, claiming you did not do that transaction. Of course, the bank will block your card immediately, but they may not trust you because you might be committing a friendly fraud.

💡

Friendly fraud, also known as chargeback fraud, occurs when a customer makes a legitimate purchase but later disputes the charge, claiming it was not executed by them. In this scenario, the bank needs to differentiate between genuine fraud and potential friendly fraud. I have had experience working on a friendly fraud use case too. I can share what I learned in another article if you wish to learn more. Let me know.

Assuming it is not a case of friendly fraud, Mastercard waits for about 30-45 days to know (receive the label) whether a transaction classified as non-fraud was actually a fraud or not. Banks usually take this long to get back to Mastercard with a true label.

In other words, the feedback exists far into the future.

As we would see ahead, a big part of causal inference also revolves around counterfactual learning. As the name suggests:

Counter $\rightarrow$ Refers to something that is opposite or different.
Factual $\rightarrow$ Relates to actual events or facts that have occurred.
Learning $\rightarrow$ Involves acquiring knowledge or understanding through study and experience.

Counterfactual learning involves analyzing what would have happened under different circumstances. It helps us understand the impact of actions had we taken a different decision in the past.

For example, in the context of fraud detection:

Counterfactual $\rightarrow$ What if the transaction had been classified as fraud instead of non-fraud?
Learning $\rightarrow$ Gaining insights from these counterfactual scenarios to improve future decision-making and model accuracy.

We wanted to study this:

If a transaction was originally classified as fraud, what would have happened if it was classified as non-fraud instead? Would the cardholder have contacted the bank, claiming it was a fraud?

As you may have already understood, the most tricky thing about this is that we never get to know about the alternative reality. If the transaction has been classified as fraud, it becomes a fact, and now we cannot go back in time, reverse that decision, and observe the universe again.

This is why addressing questions of causality necessitates using rigorous -statistical tools, which we will explore in the article ahead.

Potential Outcome Model

The historical utility of causality estimations comes from the medical industry, where causality was primarily used for treatment evaluation.

More specifically, it was used to evaluate whether a specific treatment would cause sick patients' health to improve.

Expose some patients to a treatment.
Keep the other patients unexposed.
Measure the difference in outcome between the two groups while ensuring the overall conditions stayed the same.

So, before jumping to the core causality techniques, we need to understand a commonly used framework to analyze causality.

It's called the Potential Outcome Model.

Let’s define some notations before we proceed ahead:

A Practical Guide to Scaling ML Model Training

Avi Chawla — Wed, 12 Jun 2024 18:58:00 GMT

Over the last few weeks, we covered several details around scaling ML models using techniques like multi-GPU training, DDP, understanding the underlying details of CUDA programming, and more.

A Beginner-friendly Guide to Multi-GPU Model Training

Models are becoming bigger and bigger. Learn how to scale models using distributed training.

Daily Dose of Data ScienceAvi Chawla

Implementing (Massively) Parallelized CUDA Programs From Scratch Using CUDA Programming

A beginner-friendly guide for curious minds who don’t know the internal workings of model.cuda().

Daily Dose of Data ScienceAvi Chawla

Today, we shall continue learning in this direction, and I'm excited to bring you a special guest post by Damien Benveniste. He is the author of The AiEdge newsletter and was a Machine Learning Tech Lead at Meta.

The AiEdge Newsletter | Substack

A newsletter for continuous learning about Machine Learning applications, Machine Learning System Design, MLOps, the latest techniques and news. Subscribe and receive a free Machine Learning book PDF! Click to read The AiEdge Newsletter, a Substack publication with tens of thousands of subscribers.

SubstackDamien Benveniste

In today’s machine learning deep dive, he will provide a detailed discussion on scaling ML models using more advanced techniques. He shall also do a recap of what we have already discussed in the previous deep dive on multi-GPU training, and conclude with a practical demo.

Every section of the deep dive is also accompanied by a video if you prefer that.

Over to Damien.

Introduction

More than ever, we need efficient hardware to accelerate the training process. So, we are going to look at the differences between CPU, GPU, and TPUs. We are going to look at the typical GPU architecture. We also need to explore the strategy to distribute training computations across multiple GPUs for different parallelism strategies. In the end, I am going to show you how we can use the accelerate package by Hugging face to train a model with data parallelism on AWS Sagemaker.

CPU vs GPU vs TPU
The GPU Architecture
Distributed Training
Data Parallelism
Model Parallelism
Zero Redundancy Optimizer Strategy
Distributing Training with the Accelerate Package on AWS Sagemaker

CPU vs GPU vs TPU

0:00

/7:09

There used to be a time when TPUs were much faster than GPUs (“Benchmarking TPU, GPU, and CPU Platforms for Deep Learning”), but the gap is closing with the latest GPUs. TPUs are only effective for large Deep Learning models and long model training time (weeks or months) that require ONLY matrix multiplications (Matrix multiplication means highly parallelizable).

So why do we prefer GPUs or TPUs for deep learning training? A CPU processes instructions on scalar data in an iterative fashion with minimal parallelizable capabilities.

GPU is very good at dealing with vector data structures and can fully parallelize the computation of a dot product between 2 vectors. Matrix multiplication can be expressed as a series of vector dot products, so a GPU is much faster than a CPU at computing matrix multiplication.

A TPU uses a Matrix-Multiply Unit (MMU) that, as opposed to a GPU, reuses vectors that go through dot-products multiple times in matrix multiplication, effectively parallelizing matrix multiplications much more efficiently than a GPU. More recent GPUs are also using matrix multiply-accumulate units but to a lesser extent than TPUs.

Only deep learning models can really utilize the parallelizable power of TPUs as most ML models are not using matrix multiplications as the underlying algorithmic implementation (Random Forest, GBM, KNN, …).

The GPU Architecture

Implementing KANs From Scratch Using PyTorch

Avi Chawla — Fri, 31 May 2024 18:13:00 GMT

Introduction

In last week's article, we understood the underlying details of Kolmogorov Arnold Networks (KAN) and how they work.

A Beginner-friendly Introduction to Kolmogorov Arnold Networks (KAN)

What are KANs, how are they trained, and what makes them so powerful?

Daily Dose of Data ScienceAvi Chawla

Towards the end of the article, we decided to do another article on KANs, in which we shall implement KANs using PyTorch.

Why?

I find this important because we all know how to build and train a neural network with regular weight matrices.

However, KANs are based on a different idea. The matrices KANs possess in a layer do not contain weights but functions, which are applied to the input of that layer.

Thus, by implementing KANs, we can learn how a network that does not contain the traditional weight matrices but rather univariate functions can be trained.

Let's begin!

Recap

As discussed in that article, KANs challenge traditional neural network design and offer a new paradigm to design and train them.

In a gist, the idea is that instead of just stacking many layers on top of each other (which are always linear transformations, and we deliberately introduce non-linearity with activation functions), KANs recommend an alternative approach.

The authors used complex learnable functions (called B-Splines) that help us directly represent non-linear input transformations with relatively fewer parameters than a traditional neural network.

In the above image, if you notice closely:

In a neural network, activation functions are fixed across the entire network, and they exist on each node (shown below).

In KAN, however, activation functions exist on edges (which correspond to the weights of a traditional neural network), and every edge has a different activation function.

Thus, it must be obvious to expect that KAN is much more accurate and interpretable with fewer parameters.

They are based on the Kolmogorov-Arnold Representation Theorem, which asserts that any multivariate continuous function can be represented as the composition of a finite number of continuous functions of a single variable.

If we expand the sum terms, we get the following:

To put it all together, all we are doing in a single KAN layer is taking the input $(x_1, x_2, \cdots, x_n)$ and applying a transformation $\phi$ to it.

$\phi^1$ denotes the transformation in the first layer

Thus, the transformation matrix $\phi^1$ (corresponding to the first layer) can be represented as follows:

In the above matrix:

$n$ denotes the number of inputs.
$m$ denotes the number of output nodes in that layer.
[IMPORTANT] The individual entries are not numbers, they are univariate functions. For instance:
- $\phi_{11}$ could be $2x^2 - 3x + 4$.
- $\phi_{12}$ could be $4x^3 + 5x^2 + x - 2$.
- and so on...

So to generate a transformation, all we have to do is take the input vector and pass it through the corresponding functions in the above transformation matrix:

This will result in the following vector:

The above is the output of the first layer, which is then passed through the next layer for another function transformation.

Thus, the entire KAN network can be condensed into one formula as follows:

Where:

$x$ denotes the input vector
$\phi^k$ denotes the function transformation matrix of layer $k$.
$KAN(x)$ is the output of the KAN network.

In the case of KANs, the matrices $\phi^k$ themselves are non-linear transformation matrices, and each univariate function can be quite different.

But how do we estimate the univariate functions in each of the transformation matrices?

As discussed in the KANs introductory article, we do this using B-splines.

Recall that whenever we create B-splines, we already have our control points $(P_1, P_2, \dots, P_n)$, and the underlying basis functions based on the degree $k$.

And by varying the position of the control points, we tend to get different curves, as depicted in the video below:

0:00

/0:12

So here's the core idea of KANs:

Let's make the positions of control points learnable in the activation function so that the model is free to learn any arbitrary shape activation function that fits the data best.

That's it.

By adjusting the positions of the control points during training, the KAN model can dynamically shape the activation functions that best fit the data.

Start with the initial positions for the control points, just like we do with weights.
During the training process, update the positions of these control points through backpropagation, similar to how weights in a neural network are updated.
Use an optimization algorithm (e.g., gradient descent) to adjust the control points so that the model can minimize the loss function.

Mathematically speaking, in KANs, every activation function $\phi(x)$ is defined as follows:

Within this:

The computation involves a basis function $b(x)$ (similar to residual connections).
$spline(x)$ is learnable, specifically the parameters $c_i$, which denotes the position of the control points.
There's another parameter $w$. The authors say that, in principle, $w$ is redundant since it can be absorbed into $b(x)$ and $spline(x)$. Yet, they still included this $w$ factor to better control the overall magnitude of the activation function.

Once this has been defined, the network can be trained just like any other neural network.

More specifically:

We initialize each of the $phi$ matrices as follows:

Run the forward pass:

Calculate the loss and run backpropagation.

Done!

KAN Implementation

In this section, we shall implement a KAN using PyTorch.

A Beginner-friendly Introduction to Kolmogorov Arnold Networks (KAN)

Avi Chawla — Sat, 18 May 2024 19:28:00 GMT

Introduction

In the last few days, you may likely have at least heard of the Kolmogorov Arnold Networks (KAN). It's okay if you don't know what they are or how they work; this article is precisely intended for that.

However, KANs are gaining a lot of traction because they challenge traditional neural network design and offer an exciting new paradigm to design and train them.

The authors used complex learnable functions (called B-Splines) that help us directly represent non-linear input transformations, with relatively fewer parameters than a traditional neural network.

In the above image, if you notice closely:

In a neural network, activation functions are fixed across the entire network, and they exist on each node (shown below).

In KAN, however, activation functions exist on edges (which correspond to the weights of a traditional neural network), and every edge has a different activation function.

Thus, it must be obvious to expect that KAN is much more accurate and interpretable with fewer parameters.

Going ahead, we shall dive into the technical details of how they work.

Let’s begin!

Universal Approximation Theorem

The universal approximation theorem (UAT) is the theoretical foundation behind the neural networks we use.

In simple words, it states that a neural network with just one hidden layer containing a finite number of neurons can approximate ANY continuous function to a reasonable accuracy on a compact subset of $\mathbb{R}^n$, given suitable activation functions.

The key points are:

Network Structure: Single hidden layer.
Function Approximation: Any continuous function can be approximated.
Activation Function: Must be non-linear and suitable (e.g., sigmoid, ReLU).

Mathematically speaking, for any continuous function $f$ and $\epsilon > 0$, there always exists a neural network $\hat{f}$ such that:

It was only proved for sigmoid activation when it was first proposed in 1989. However, in 1991, it was proven to be applicable to all activation functions.

The fundamental thought process involved here was that a piecewise function can approximate any continuous function on a bounded set.

For instance, the cosine curve between $-\frac{\pi}{2}$ and $\frac{\pi}{2}$ can be approximated with several steps fitted to this curve.

Naturally, the more the number of steps, the better the approximation. To generalize, neural networks with a single hidden layer can map each of its neurons to one piecewise function.

By using weights and biases as gates, each neuron can determine if an input falls within its designated region.

For inputs falling in a neuron’s designated section, a large positive weight will push the neuron’s output towards $1$ when using a sigmoid activation function. Conversely, a large negative weight will push the output towards $0$ if the input does not belong to that section.

While today’s neural networks are much more complex, and large and do not just estimate step functions, there is still some element of “gating” involved, wherein certain neurons activate based on specific input patterns.

Moreover, as more hidden layers are added, the potential for universal approximation grows exponentially — second-layer neurons form patterns based on patterns of first-layer, and so on.

Kolmogorov-Arnold Representation Theorem

This is another theorem that’s based on approximating/representing continuous functions.

More formally, the Kolmogorov-Arnold representation theorem asserts that any multivariate continuous function can be represented as the composition of a finite number of continuous functions of a single variable.

Let’s break that down a bit:

"multivariate continuous function" is a function that accepts multiple parameters:

"a finite number of continuous functions of a single variable" means the following:

We don't stop here though. The above sum is passed through one more function $psi$:

Finally, we also apply a composition to the last step and make this more general as follows:

So, to summarize, the Kolmogorov-Arnold representation theorem asserts that any multivariate continuous function can be represented as the composition of continuous functions of a single variable.

If we expand the sum terms, we get the following:

Here's a simple toy example:

Now, if you go back to the KAN network image shown earlier, this is precisely what's going on in the proposed network.

At first:

The input $x_1$ is passed through univariate functions ($\phi_{11}, \phi_{12}, \cdots, \phi_{15}$) to get ($\phi_{11}(x_1), \phi_{12}(x_1), \cdots, \phi_{15}(x_1)$).
The input $x_2$ is passed through univariate functions ($\phi_{21}, \phi_{22}, \cdots, \phi_{25}$) to get ($\phi_{21}(x_2), \phi_{22}(x_2), \cdots, \phi_{25}(x_2)$).

Next, the two corresponding outputs are aggregated (summed), so the $\psi$ function in this case is just the identity operation ($\psi(z) = z$):

This forms one KAN layer.

Next, we pass the above output through one more KAN layer.

So the above output is first passed through the $\phi$ function of the next layer and summed to get the final output:

Key Differences

In terms of the scope, the Universal Approximation Theorem applies explicitly to neural networks, while the Kolmogorov-Arnold theorem is a broader mathematical result, which the authors appear to extend to neural networks that supposedly estimate some unknown continuous function.

Talking about the approach, the Universal Approximation Theorem deals with approximating functions using neural networks. In contrast, the Kolmogorov-Arnold theorem provides a way to exactly represent continuous functions through sums of univariate functions, which is expected to make them more accurate.

Finally, from the application perspective, the Universal Approximation Theorem is widely used in machine learning to justify the power of neural networks, while the Kolmogorov-Arnold theorem is aligned towards a theoretical insight that is now being extended to machine learning through architectures like Kolmogorov Arnold Networks.

Thus, since the Kolmogorov-Arnold theorem is more aligned towards exactly representing the multivariate function, embedding it in neural networks is expected to be much more efficient and precise.

In fact, this exact representation could potentially lead to models that can interpreted well and may require fewer resources for training, as the architecture would inherently incorporate the mathematical properties of continuous functions being learned.

How are KANs trained?

Okay, so far, I hope everything is clear.

Quick Summary

Just to reiterate...

All we are doing in a single KAN layer is taking the input $(x_1, x_2, \cdots, x_n)$ and applying a transformation $\phi$ to it.

$\phi^1$ denotes the transformation in the first layer

Thus, the transformation matrix $\phi^1$ (corresponding to the first layer) can be represented as follows:

In the above matrix:

$n$ denotes the number of inputs.
$m$ denotes the number of output nodes in that layer.
[IMPORTANT] The individual entries are not numbers, they are univariate functions. For instance:
- $\phi_{11}$ could be $2x^2 - 3x + 4$.
- $\phi_{12}$ could be $4x^3 + 5x^2 + x - 2$.
- and so on...

So to generate a transformation, all we have to do is take the input vector and pass it through the corresponding functions in the above transformation matrix:

This will result in the following vector:

The above is the output of the first layer, which is then passed through the next layer for another function transformation.

Thus, the entire KAN network can be condensed into one formula as follows:

Where:

$x$ denotes the input vector
$\phi^k$ denotes the function transformation matrix of layer $k$.
$KAN(x)$ is the output of the KAN network.

The above formulation can appear quite similar to what we do in neural networks:

The only difference is that the parameters $\theta^i$ are linear transformations, and $\sigma$ denotes the activation function used for non-linearity, and it is the same activation function across all layers.

In the case of KANs, the matrices $\phi^k$ themselves are non-linear transformation matrices, and each univariate function can be quite different.

But you might be wondering how we train/generate the activation functions that are present on every edge.

More specifically, how do we estimate the univariate functions in each of the transformation matrices?

To understand this, we must first understand the concept of Bezier curves and B-Splines.

Bezier curves

Imagine a gaming character that must pass through the following four points:

The most obvious way is to traverse them as follows:

But this movement does not appear natural, does it?

In computer graphics, we would desire a smooth traversal that looks something like this:

One way to do this is by estimating the coefficients of a higher-degree polynomial by solving a system of linear equations:

More specifically, we know the location of points $(A, B, C, D)$, so we can substitute them into the above function and determine the values of the coefficients $(a, b, c, d)$.

However, what if we have hundreds of data points?

To formalize a bit, if we have N data points, we must solve $N$ equations to determine the parameters. Solving such a system of linear equations will be computationally expensive and almost infeasible for almost real-time computer graphics applications.

Bezier curves solve this problem efficiently. They provide a way to represent a smooth curve that passes near a set of control points without needing to solve a large system of equations.

For instance, consider we have these two points, and we need a traversal path:

👉

Points $P_1$ and $P_2$ are called control points.

We can determine the curve as follows:

When $t=0$, the position $P$ will be $P_1$.
When $t=1$, the position $P$ will be $P_2$.
We can vary the parameter $t$ from [0,1] and obtain the trajectory.

What if we have three points?

This time, we can extend the above case of 2 points to three points as follows:

First, we determine a point between $P_1$ and $P_2$ as follows:

Next, we can determine another point between $P_2$ and $P_3$ as follows:

Finally, between points $Q_1$ and $Q_2$, we can create a linear interpolation:

The final curve looks like this:

The final path is denoted by the red curve

We vary $t$ from $[0,1]$ and obtain the red curve.

The final equation for the position of point $P$ is given by:

To obtain this path, substitute all values in terms of $P_1$, $P_2$, and $P_3$.

Moving on, the same idea can be extended to 4 points as well to get the following curve:

The final path is denoted by the red curve

We begin by interpolating between
- $P_0$ and $P_1$ to get $Q_1$.
- $P_1$ and $P_2$ to get $Q_2$.
- $P_2$ and $P_3$ to get $Q_3$.
Next, we interpolate between:
- $Q_1$ and $Q_2$ to get $R_1$.
- $Q_2$ and $Q_3$ to get $R_2$.
Finally, we interpolate between:
- $R_1$ and $R_2$ to get the final point $P$ (red curve above).

The final formula for the trajectory is given by:

To obtain this path, substitute all values in terms of $P_1$, $P_2$, $P_3$ and $P_4$.

To summarize, these are the trajectory formulas we have obtained so far for 2 points, 3 points, and 4 points:

If you notice closely, the coefficients in these formulas match with the binomial coefficients of $(1+x)^n$, so we can extend the above formula for traversal curve to a higher dimension as follows:

In the above formula, the $c_{i,n}$ is a function of $t$, and at every time stamp, it denotes the contribution of point $P_i$ to the final curve.

Let’s go back to the case where we had 4 points:

If we plot the coefficient terms (marked in yellow boxes above), we get the following plot:

In the above plot, at every timestamp $t$, the plot denotes the contribution of every point to the final curve. For instance:

At $t=0$, except $P_0$, coefficients of all points are zero, which means the curve starts from $P_0$.
From $t \in (0.25, 0.5)$, the coefficient of point $P_1$ is max, which means that the curve is closest to $P_1$ in that duration.
From $t \in (0.5, 0.75)$, the coefficient of point $P_2$ is max, which means that the curve is closest to $P_2$ in that duration.
At $t=1$, except $P_3$, coefficients of all points are zero, which means the curve ends at $P_3$.

Problem with Bezier curves

Everything looks good so far.

We have a standard formula for bezier curves that can be extended to any number of points:

However, the problem is still the same as before.

Having 100 data points will result in a polynomial of degree 99, which will be computationally expensive. Moreover, the factorial terms are going to be equally hard to compute as well.

We need to find a better way.

B-Splines

B-Splines provide a more efficient way to represent curves, especially when dealing with a large number of data points.

Unlike high-degree polynomials, B-Splines use a series of lower-degree polynomial segments, which are connected smoothly.

In other words, instead of extending Bezier curves to tens of hundreds of data points, which leads to an equally high degree of the polynomial, we use multiple lower-degree polynomials and connect them together to form a smooth curve.

To exemplify, consider the set of data points in the image below:

Given that we have 6 points, we can generate a bezier curve of degree 5. That's always an option. However, as discussed above, this is still computationally expensive and not desired.

Instead, we can create curves of smaller degrees (say, 3), and then connect them.

For instance, a full B-spline can be created as follows:

Some part of it can come from a curve of degree 3 from points $(P_1, P_2, P_3, P_4)$
Some part of it can come from a curve of degree 3 from points $(P_2, P_3, P_4, P_5)$
Some part of it can come from a curve of degree 3 from points $(P_3, P_4, P_5, P_6)$

These individual curves are represented below:

Note: In this diagram, the individual Bezier curves don't appear to be connected that well, but in reality, the final curve is smooth.

When we have $n$ control pints (6 in the diagram above), and we create $k$ degree polynomial Bezier curves, we get $(n-k)$ Bezier curves in the final Bsplines.

While the actual mathematics behind B-splines is beyond this deep dive, this is a great lecture to understand them in detail:

In a gist, the core idea is to ensure certain continuity conditions at the points where the curves meet.

Position Continuity ($C^0$ Continuity):
- Ensures that the curves meet at shared points.
- This means ensuring that the end point of the first segment is the same as the starting point of the second segment, and so forth.
Tangent Continuity ($C^1$ Continuity):
- Ensures that the direction of the curves is consistent at the join points.
- This is achieved by ensuring that the first derivative is continuous at the join points. It is not necessary to ensure this through the curve because the individual Bezier curves are polynomials, so they are always continuous.
Curvature Continuity ($C^2$ Continuity):
- Ensures that the curvature of the curve is smooth across the joins.
- This involves higher-order derivatives and is more complex but provides even smoother transitions.

Similar to what we saw in Bezier curves, the final B-spline curve is represented as a linear combination of the points $P_i$:

$P_i$: Control points that define the shape of the curve.
$𝑁_{𝑖,𝑘}(𝑡)$: B-spline basis functions of degree $k$ associated with each control point $P_i$, and they are similar to what we saw earlier in the case of Bezier curves and are fixed.

The basis functions of a 3-degree Bezier curve

Building KAN layer

With those details in mind, we are ready to understand how B-splines are used in KAN to train those higher-degree activation functions.

Recall that whenever we create B-splines, we already have our control points $(P_1, P_2, \dots, P_n)$, and the underlying basis functions based on the degree $k$.

And by varying the position of the control points, we tend to get different curves, as depicted in the video below:

0:00

/0:12

So here's the core idea of KANs:

Let's make the positions of control points learnable in the activation function so that the model is free to learn any arbitrary shape activation function that fits the data best.

That's it.

By adjusting the positions of the control points during training, the KAN model can dynamically shape the activation functions that best fit the data.

Start with the initial positions for the control points, just like we do with weights.
During the training process, update the positions of these control points through backpropagation, similar to how weights in a neural network are updated.
Use an optimization algorithm (e.g., gradient descent) to adjust the control points so that the model can minimize the loss function.

Mathematically speaking, in KANs, every activation function $\phi(x)$ is defined as follows:

Within this:

The computation involves a basis function $b(x)$ (similar to residual connections).
$spline(x)$ is learnable, specifically the parameters $c_i$, which denotes the position of the control points.
There's another parameter $w$. The authors say that, in principle, $w$ is redundant since it can be absorbed into $b(x)$ and $spline(x)$. Yet, they still included this $w$ factor to better control the overall magnitude of the activation function.

For initialization purposes:

Each activation function is initialized with $spline(x) \approx 0$. This is done by drawing B-spline coefficients $c_i ∼ \mathcal{N} (0, \sigma^2)$ with a small $σ$ around $0.1$.

Moreover, $w$ is initialized according to the Xavier initialization:

Done!

Once this has been defined, the network can be trained just like any other neural network.

More specifically:

We initialize each of the $phi$ matrices as follows:

Run the forward pass:

Calculate the loss and run backpropagation.

Done!

KAN vs MLP

Parameter count

Let's compare the number of parameters of a KAN and MLP, both with $L$ layers and an equal number of neurons in each layer ($N$):

The number of edges from one layer to another is the same in both cases – $N^2$.
While the edges of MLP just hold one weight, the edges of KAN hold more parameters because of splines.

For a B-spline with $G$ control points and degree $k$, the number of basis functions are $G+k-1$. In KAN, as each basis function is associated with one parameter, the number of parameters is also the same – $G+k-1$.

So, every edge in a KAN holds $G+k-1$ parameters.

The final parameter count comes out to be as follows (including all layers):

While MLPs appear to be more efficient than KANs, a point to note is that based on their experiments, KANs usually don't require as much large $N$ as MLPs do. This saves parameters while also achieving better generalization.

Performance

The authors have presented many performance-related figures that compare the performance of KANs with MLP on various dummy/toy datasets.

Consider the following image:

In both plots,
- KANs consistently outperform MLPs, achieving significantly lower test loss across a range of parameter, and at much lower network depth (number of layers).
- KANs demonstrate superior efficiency, with steeper declines in loss, particularly noticeable with fewer parameters.
- MLP's performance almost stagnates with increasing the number of parameters.
The theoretical lines, $𝑁^{−4}$ for KAN and $𝑁^{−2}$ for ideal models (ID), show that KANs closely follow their expected theoretical performance.

Here are the results for two more functions:

Yet again, in both plots, KANs considerably outperform MLPs by achieving significantly lower test loss across a range of parameter, and at much lower network depth (number of layers).

These are some more results across various data shapes, and yet again, KANs' test RMSE is lower than that of MLP.

Continual Learning

Moving on, another incredible thing that KANs possess is continual learning.

For more context, it is commonly observed that when a neural network is trained on a particular task and then shifted to being trained on task 2, the network will soon forget about how to perform task 1.

This happens because there is no notion of locality-storage of knowledge in MLPs, which is found in humans.

Quoting from the paper – “human brains have functionally distinct modules placed locally in space. When a new task is learned, structure re-organization only occurs in local regions responsible for relevant skills, leaving other regions intact. MLPs do not have this notion of locality, which is probably the reason for catastrophic forgetting.”

This is evident from the image below:

As depicted above, as new data is added, an MLP's learned fit drastically changes.

However, now look at the performance of KANs in the figure below:

As depicted above, as new data is added, KAN retains the previously learned fit and can also adapt to the new changes.

This happens due to the locality property of B-Splines. To understand better, consider the B-Spline below and notice what happens when I move this point in the animation below this image:

The left part of the B-splines is unaffected by any movement of the above control point.

Since spline bases are local, a sample will only affect a few nearby spline coefficients, leaving far-away coefficients intact (which is desirable since far-away regions may have already stored information that we want to preserve).

However, since MLPs usually use global activations, e.g., ReLU/Tanh/SiLU, etc., any local change may propagate uncontrollably to regions far away, destroying the information being stored there.

This makes intuitive sense as well.

That said, the authors do mention that they are unclear whether this can be generalized to more realistic setups, and they have left investigating this as future work.

Interpretability

Since KANs learn univariate functions at all levels, which can be inspected if needed, it is pretty easy to determine the structure learned by the network using those formulas.

For instance, consider the KAN network below, which learns $f(x, y) = xy$.

Inspecting the B-splines learned by KAN, we notice the following:

Both inputs $x$ and $y$ get transformed to their squares in the first KAN later.
- Activation #1 $\rightarrow$ maps $x$ to $x$.
- Activation #2 $\rightarrow$ maps $x$ to $x^2$.
- Activation #3 $\rightarrow$ maps $y$ to $y$.
- Activation #4 $\rightarrow$ maps $y$ to $y^2$.
In the second layer:
- The sum of Activation #1 and #3, which is $(x+y)$, gets squared by activation #5. This outputs $(x+y)^2$.
- The sum of Activation #2 and #4, which is $(x^2+y^2)$, gets negated by activation #6. This outputs $-(x^2+y^2)$.
At the output node, we get the final output, which is the sum of activation #5 and activation #6:

The additional "2" here is just a constant factor and can be adjusted.

For instance, it's possible that activation #5 and #6 learned the following:

Activation #5 $\rightarrow$ maps input $a$ to $\frac{a^2}{2}$.
Activation #6 $\rightarrow$ maps input $a$ to $\frac{-a}{2}$.

The scaling factors are not visible to us in the above figure but I hope you get the point.

By inspecting the nodes, we can determine the function learned by the KAN network.

Run-time

The authors of KAN have been honest about one of its biggest bottlenecks, which is its slow training.

They noticed that KANs are usually 10x slower than MLPs when both have the same number of parameters.

However, the overall performance difference could be slight because, as discussed earlier, KANs usually don't require as many parameters as MLPs.

The authors mention that they did not try much to optimize KANs’ efficiency, and they treat bottlenecks more like an engineering problem, which can be improved in future iterations of KAN released by them or the community.

And this did come true.

Within a week's time, someone released an efficient implementation of KAN, called efficient-KAN, which can be found in the GitHub repo below.

GitHub - Blealtan/efficient-kan: An efficient pure-PyTorch implementation of Kolmogorov-Arnold Network (KAN).

An efficient pure-PyTorch implementation of Kolmogorov-Arnold Network (KAN). - Blealtan/efficient-kan

GitHubBlealtan

As an exercise, by reading the README of the above repo, can you identify the major bottleneck of the original implementation of KAN, and how this implementation improved it?

Conclusion and Final Thoughts

With this, we come to an end of this deep dive on KANs.

If you have read through so far, you may have understood that KANs are not entirely based on any novel idea but instead based on a common intuition of how neural networks can model data more efficiently, which is quite comprehensible.

While writing this article, I went through a bunch of videos, which I am including here for further reference:

A quick introduction to KANs:

For those who understand a bit about KANs, this video distills some of the ongoing discussions since it was released.

Finally, this is a comprehensive lecture on KANs:

That said, I know we did not discuss much about the code in this article, but I do intend to cover that pretty soon, possibly in an article dedicated to implementing KANs from scratch using PyTorch.

Also, while writing this article, I found this repository, which is a curated list of awesome libraries, projects, tutorials, papers, and other resources related to the Kolmogorov-Arnold Network (KAN).

GitHub - mintisan/awesome-kan: A comprehensive collection of KAN(Kolmogorov-Arnold Network)-related resources, including libraries, projects, tutorials, papers, and more, for researchers and developers in the Kolmogorov-Arnold Network field.

A comprehensive collection of KAN(Kolmogorov-Arnold Network)-related resources, including libraries, projects, tutorials, papers, and more, for researchers and developers in the Kolmogorov-Arnold N…

GitHubmintisan

As always, thanks for reading!

Any questions?

Feel free to post them in the comments.

If you wish to connect privately, feel free to initiate a chat here:

Connect via chat

Implementing (Massively) Parallelized CUDA Programs From Scratch Using CUDA Programming

Avi Chawla — Fri, 10 May 2024 19:22:00 GMT

Introduction

Due to advancements in open-source frameworks like PyTorch, utilizing GPUs to accelerate the training of deep learning models just takes one simple step, as demonstrated below:

While this encapsulation is as easy as it can get, the underlying implementation of how GPUs accelerate computing tasks is still unknown to many.

More specifically, what happens under the hood when we do a .cuda() call?

In this article, let’s understand the mechanics of GPU programming.

More specifically, we shall understand how CUDA, a programming interface developed by NVIDIA, allows developers to run processes on their GPU devices and how the underlying implementations work.

Thus, we shall do a hands-on demo on CUDA programming and implement parallelized implementations of various operations we typically perform in deep learning.

Let’s begin!

What is CUDA?

Simply put, CUDA, also known as Compute Unified Device Architecture, is a parallel computing platform and application programming interface (API) model created by NVIDIA (as mentioned above).

It allows developers to use a CUDA-enabled graphics processing unit (GPU) for general-purpose processing.

💡

For more context, general-purpose processing means leveraging computational capabilities for a broad range of tasks beyond its specialized functions. In the context of GPUs, which were originally designed for handling graphics rendering in games, general-purpose processing allows us to adapt them to perform computation-intensive tasks that are not necessarily related to graphics.

In order to leverage the processing power of GPU, CUDA provides an interface implemented in C/C++. This allows us to access the GPU’s memory and run compute operations.

In the context of deep learning, these are typical mathematical operations and general operations like:

Adding matrics/vectors
Multiplying matrices
Transforming a matrix by applying a function, such as an activation function, dropout, etc.
Moving data from CPU to GPU, and then back to CPU.
And more.

To put it another way, just like Pandas allows developers to interact seamlessly with tabular datasets through high-level data structures and operations, CUDA enables a similar level of abstraction but for processing on NVIDIA GPUs.

By abstracting the underlying complexities associated with GPU, such as memory management, thread handling, and handling blocks, developers get to focus more on solving the computational problems at hand rather than the intricacies of the hardware they are running on.

Of course, as mentioned above, the CUDA provides a C/C++ interface, and deep learning frameworks like PyTorch made this simpler by building a Python-based wrapper around it:

So when we develop deep learning models, we use a Python API, which, under the hood, has implemented C/C++ instructions provided by CUDA to talk to GPU.

Thus, just to be clear, the objective of this deep dive is to understand the CUDA -> GPU instructions and how they are implemented.

While some proficiency in C/C++ is good to have, it is not entirely necessary as the API design is quite intuitive. Yet, I will provide some supporting texts at every stage of the programming if you have never used C++ before.

The problem with CPUs

When it comes to parallelization, the traditional approach with CPUs involves leveraging threads, which carry instructions that the processor can execute independently.

For instance, consider this simple for-loop written in C, which performs the operations of a typical linear layer:

Notice the for-loop in the above code. It iterates over the elements of the arrays one by one, computes the individual result, and stores it in the output array.

Here, if you notice closely, all these individual operations are independent of each other.

$i^{th}$ operation does not depend on $j^{th}$ operation

Thus, they can be executed in parallel. However, the above implementation uses a sequential approach where each operation waits for the previous one to complete before starting.

Given that modern CPUs can handle multiple threads simultaneously through techniques like multi-threading, we can make use of it to allow the loop to spawn multiple threads, potentially one per core, to handle different parts of the array simultaneously.

And this must be obvious to understand that typically, the more threads we add, the higher the level of parallelism we can achieve.

In deep learning, however, things are different.

The building blocks of deep learning models – vectors and matrices, can have millions of elements.

However, by their very nature, CPUs are limited in the degree of parallelism they can achieve because of a limited number of cores (even high-end consumer CPUs rarely have more than 16 cores).

GPUs are modern architectures that can run millions of threads in parallel. This enhances the run-time performance of these mathematical operations when dealing with massively sized vectors and matrices.

A motivating example

Let’s consider the matrix multiplication operation depicted below, which is quite prevalent in deep learning:

At its core, matrix multiplication is just a series of various independent vector dot products, as depicted in the animation below:

$1^{st}$ row of left matrix is multiplied with:
- $1^{st}$ column of the right matrix.
- $2^{nd}$ column of the right matrix.
- $3^{rd}$ column of the right matrix.
- And so on.
$2^{nd}$ row of left matrix is multiplied with:
- $1^{st}$ column of the right matrix.
- $2^{nd}$ column of the right matrix.
- $3^{rd}$ column of the right matrix.
- And so on.
And so on...

If you look closely, all these operations are independent of one another. As a result, all these operations can be potentially executed in parallel.

However, as discussed above, since CPUs have fairly limited threads, they are not ideal for exploiting the full potential of parallelizing these matrix multiplication operations.

In other words, CPUs typically handle only a few dozen threads simultaneously, which limits their efficiency in executing numerous independent operations at once, such as those required by large-scale matrix multiplications.

How do GPUs help?

Here, it is essential to clarify one point that CPUs are not "bad."

There's a reason why all modern computers always come with a CPU, but they may or may not have a “fast” GPU. This is because the CPU and GPU are designed to accomplish completely different goals.

To begin, CPU computations are usually faster than that of GPU for a single operation. In other words, they are designed to quickly execute a sequence of single-threaded operations. Thus, to maintain that speed, they can only execute a few threads in parallel.

In contrast, GPUs are designed to execute millions of threads in parallel at the cost of the speed of individual threads.

So while a thread may take less time to execute on a CPU, but as GPUs can execute millions of them in parallel, it tremendously boosts the overall run-time.

A real-life analogy

One real-life analogy I heard when I first learned about them many years back compared the CPU to an F1 car and the GPU to a bus.

If the objective is to move just one person from point A to point B, then the F1 car, i.e., the CPU, will be an ideal choice.

On the other hand, if the objective is to move many people from one point to another, then the bus, i.e., the GPU, will be an ideal choice. This is despite the fact that an F1 car will take less time to travel from point A to point B.

While the bus can transport everyone in one trip, an F1 car would require multiple trips.

CUDA programming

Now that we have covered the basics, let’s get into the programming related details.

In this section, we shall learn how CUDA programs are written. We shall first start by understanding the components of CUDA programming and then get into the implementations.

Components of CUDA programming

There are three components in CUDA programming – host, device, and kernel. These three components are foundational to how CUDA interfaces with the hardware and manages computations.

Here's a brief overview of each of them:

#1) Host

In CUDA terminology, the "host" refers to the CPU and its memory. It's where your program starts and runs before offloading any parallel compute-intensive tasks to the GPU.

The host controls the entire application flow, it initiates data transfers to and from the GPU memory, and it launches GPU kernels.

In other words, it orchestrates the preparation and execution of GPU tasks from a higher level.

#2) Device

The "device" refers to the GPU itself and its associated memory. In the context of CUDA, when we mention the device, we're typically talking about the CUDA-enabled GPU that will perform the actual parallel computations.

The device executes the code specified in the CUDA kernels, which are functions written to run on the GPU.

It handles the intensive computational tasks that have been offloaded from the CPU, utilizing its massively parallel architecture to process data more efficiently for specific types of tasks.

#3) Kernel

A kernel in CUDA is a function written in CUDA C/C++ that runs on the GPU. This is the core piece of code that is executed in parallel by multiple threads on the CUDA device.

When a kernel is launched, the GPU executes it across many threads in parallel.

Each thread executes an instance of the kernel and operates on different data elements. The kernel defines the compute operations each thread will perform, making it the primary means of parallel computation in CUDA.

👉

If things are not clear, don't worry. They will become clear shortly when we dive into the CUDA programming.

Execution Workflow

So, to recap, here’s what this process looks like:

Preparation on the host: The host CPU executes the main part of the CUDA program, setting up data in its own memory, and preparing instructions.
Data transfer: Before the GPU can begin processing, the necessary data must be transferred from the host’s memory to the device’s memory.
Kernel launch: The host directs the device to execute a kernel, and the GPU schedules and runs the kernel across its many threads.
Post-processing: After the GPU has finished executing the kernel, the results are typically transferred back to the host for further processing or output, completing the compute cycle.

Threads, Blocks and Grids

When it comes to GPU computing, the key advantage is its ability to execute many operations (specified in the kernel) in parallel.

Thus, instead of executing the kernel just once and iterating through the computations one by one, we execute it $N$ times in parallel.

However, this parallel execution is not just about blasting multiple instances of the same operation on the GPU.

Instead, it’s about structuring the entire computation in a way that maximizes the GPU's architectural strengths—mainly, its capacity to handle a vast number of simultaneous threads.

We achieve this using the hierarchical organization of GPU computation, called threads, blocks, and grids.

Threads

A thread is the smallest unit of execution. Each thread executes the set of instructions specified in the kernel that are specific to the thread that is invoking the kernel.

Also, each thread is mapped to a single CUDA core for execution:

Blocks

A block is a group of threads that execute the same kernel and can cooperate by sharing data and synchronizing their execution.

Blocks are, in essence, a way to organize threads into manageable, cooperative groups that can efficiently execute part of a larger problem. Typically, a single block can contain up to 1024 threads, but this may vary depending on the computing capability of the GPU.

Each block is mapped to a corresponding CUDA core for execution:

Grid

A grid is the highest level of thread organization in CUDA. The blocks within a grid can operate independently, meaning they do not share data directly nor synchronize with each other.

The entire kernel launched is executed as one grid, which is mapped onto the entire device:

CUDA variables

Moving on...

Structurally speaking, Threads inside a Block can be organized in up to three dimensions, as depicted below:

Consider the 1D case shown above, where Threads are arranged in a single dimension.

Furthermore, as there is a limit on the number of Threads a block can hold, we can have many blocks, which, for simplicity, may also arranged in a single dimension inside the entire grid:

Here are the parameters associated with this configuration, which we shall reference later in the kernel function:

Every block has a width variable of blockDim.
- We can obtain the width along the x-axis using blockDim.x variable.
- We can obtain the width along the y-axis using blockDim.y variable (which will be 1 in the above case).

Understanding LoRA-derived Techniques for Optimal LLM Fine-tuning

Avi Chawla — Mon, 29 Apr 2024 19:24:00 GMT

Recap

In a recent article, we learned about LoRA, which stands for Low-Rank Adaptation.

It is a technique used to fine-tune large language models (LLMs) on new data. We also implemented it using PyTorch and the Huggingface PEFT library:

Implementing LoRA From Scratch for Fine-tuning LLMs

Understanding the challenges of traditional fine-tuning and addressing them with LoRA.

Daily Dose of Data ScienceAvi Chawla

As also discussed in the article on vector databases, fine-tuning means adjusting the weights of a pre-trained model on a new dataset for better performance. This is depicted in the animation below:

The motivation for traditional fine-tuning is pretty simple.

When the model was developed, it was trained on a specific dataset that might not perfectly match the characteristics of the data a practitioner may want to use it on.

The original dataset might have had slightly different distributions, patterns, or levels of noise compared to the new dataset.

Fine-tuning allows the model to adapt to these differences, learning from the new data and adjusting its parameters to improve its performance on the specific task at hand.

However, a problem arises when we use the traditional fine-tuning technique on much larger models — LLMs, for instance.

This is because these models are huge — billions or even trillions of parameters, and hundreds of GBs in size.

Traditional fine-tuning is just not practically feasible here. In fact, not everyone can afford to do fine-tuning at such a scale due to a lack of massive infrastructure and the costs associated with such an endeavor.

We covered this in much more detail in the LoRA/QLoRA article, so I would recommend reading that:

Implementing LoRA From Scratch for Fine-tuning LLMs

Understanding the challenges of traditional fine-tuning and addressing them with LoRA.

Daily Dose of Data ScienceAvi Chawla

Introduction

LoRA has been among the most significant contributions to AI in recent years. As we discussed earlier, it completely redefined our approach to large model fine-tuning by modifying only a small subset of model parameters.

We also mentioned it in the 12 years of AI review we did recently (see year 2021).

Now, of course, it's been some time since LoRA was first introduced. Since then, many variants of LoRA have been proposed, each tailored to address specific challenges and improve upon the foundational technique.

The timeline of some of the most popular techniques introduced after LoRA is depicted below:

Going ahead, in this article, we will explore the LoRA family in-depth, discussing each variant's design philosophy, technical innovations, and the specific use cases they aim to address.

Let’s begin!

#1) LoRA

The core idea in LoRA, as also discussed in the earlier article, revolves around training very few parameters in comparison to the base model, say, full GPT-3, while preserving the performance that we would otherwise get with full-model fine-tuning (which we discussed above).

More specifically, two low-rank matrices $A$ and $B$ are added alongside specific layers, and these low-rank matrices contain the trainable parameters:

Mathematically, the adaptation is executed by modifying the weight matrix $\Delta W$ in a transformer layer using the formula:

Here, $W$ represents the adapted weight matrix, and $AB$ is the low-rank modification applied to $W$.

As depicted in the LoRA diagram above, the dimensions of matrices $A$ and $B$ are much smaller in size compared to $W$, leading to a significant reduction in the number of trainable parameters.

This low-rank update, despite its simplicity, proves to be remarkably effective in retaining the nuanced capabilities of the LLM while introducing the desired adaptations specific to a new task or dataset.

This way, if there are plenty of users who wish to fine-tune an LLM model (say, from OpenAI), OpenAI must only store the above two matrices $A$ and $B$ (for all layers where this was introduced), which is pretty small in size.

However, the original weight matrix $W$ being common across all fine-tuned versions can have a central version, i.e., one that can be shared across all users.

As per the original paper on LoRA, they reduced the checkpoint size by roughly 10,000 times — from 350GB to just 35MB.

Moreover, they also observed a 25% speedup during training on the GPT-3 175B model compared to full fine-tuning, which is pretty obvious because we do not compute the gradient for the vast majority of the parameters.

Another key benefit is that it also introduces no inference latency. This is because of its simple linear design, which allows us to merge the trainable matrices ($A$ and $B$) with the frozen weights ($W$) when deployed, so one can proceed with an inference literally the same way as they would otherwise do.

A pretty cool thing about LoRA is that the hyperparameter $r$ can be orders of magnitude smaller than the dimensions of the corresponding weight matrix.

For instance, in the results table, compare the results of $r=1$ with that of other ranks:

In most cases, we notice that $r=1$ almost performs as well as any other higher rank, which is great!

In other words, this means that the $A$ and $B$ can be a simple row and column matrix.

Next, let’s understand the variants of LoRA and how they differ from LoRA.

#2) LoRA-FA

Introduction

Building on the foundational Low-Rank Adaptation (LoRA) technique, the LoRA-FA method introduces a slight change that reduces the memory overhead associated with fine-tuning large language models (LLMs).

8 Fatal (Yet Non-obvious) Pitfalls and Cautionary Measures in Data Science

Avi Chawla — Mon, 22 Apr 2024 12:22:00 GMT

Introduction

I have had the opportunity to work on several real-world machine learning projects, both as a full-time data scientist and then as a part-time data scientist companies would outsource their data science projects to.

Building end-to-end data science and machine learning modeling projects has taught me many invaluable lessons, pitfalls, and cautionary measures that I never found anyone talking about explicitly.

To be honest, the practical lessons I am about to share in this article are something I wish someone told me when I started my career (or was progressing).

But it would be best if you didn't feel that way.

So, in this blog post, I have put down eight pitfalls you might experience and cautionary measures you can take when working on data science projects.

In my experience, these pitfalls are almost always present, but they are never that obvious to observe, which ruins many projects.

Let’s begin!

#1) Inspect decision trees

If we were to visualize the decision rules (the conditions evaluated at every node) of ANY decision tree, we would ALWAYS find them to be perpendicular to the feature axes, as depicted below:

In other words, every decision tree progressively segregates feature space based on such perpendicular boundaries to split the data.

Of course, this is not a “problem” per se.

In fact, this perpendicular splitting is what makes it so powerful to perfectly overfit any dataset (read the overfitting experiment section here to learn more).

However, this also brings up a pretty interesting point that is often overlooked when fitting decision trees.

More specifically, what would happen if our dataset had a diagonal decision boundary, as depicted below:

It is easy to guess that in such a case, the decision boundary learned by a decision tree is expected to appear as follows:

In fact, if we plot this decision tree, we notice that it creates so many splits just to fit this easily separable dataset, which a model like logistic regression, support vector machine (SVM), or even a small neural network can easily handle:

It becomes more evident if we zoom into this decision tree and notice how close the thresholds of its split conditions are:

This is a bit concerning because it clearly shows that the decision tree is meticulously trying to mimic a diagonal decision boundary, which hints that it might not be the best model to proceed with.

To double-check this, I often do the following:

Take the training data (X, y);
- Shape of X: (n, m).
- Shape of y: (n, 1).
Run PCA on X to project data into an orthogonal space of m dimensions. This will give X_pca, whose shape will also be (n, m).
Fit a decision tree on X_pca and visualize it (thankfully, decision trees are always visualizable).
If the decision tree depth is significantly smaller in this case, it validates that there is a diagonal separation.

For instance, the PCA projections on the above dataset are shown below:

It is clear that the decision boundary on PCA projections is almost perpendicular to the X2` feature (the 2nd principal component).

Fitting a decision tree on this X_pca drastically reduces its depth, as depicted below:

This lets us determine that we might be better off using some other algorithm instead.

Or, we can spend some time engineering better features that the decision tree model can easily work with using its perpendicular data splits.

At this point, if you are thinking, why can’t we use the decision tree trained on X_pca?

While nothing stops us from doing that, do note that PCA components are not interpretable, and maintaining feature interpretability can be important at times.

Thus, whenever you train your next decision tree model, consider spending some time inspecting what it’s doing.

Of course, the objective is not to discourage the use of decision trees. They are the building blocks of some of the most powerful ensemble models we use today.

The point is to bring forward the structural formulation of decision trees and why/when they might not be an ideal algorithm to work with.

5 Must-Know Ways to Test ML Models in Production (Implementation Included)

Avi Chawla — Sat, 13 Apr 2024 13:12:00 GMT

Introduction

A typical blueprint of any real-world machine learning (ML) looks like the following:

Formulate the problem statement
Get the management’s approval
Gather the required datasets
Explore the gathered data
Start building a model
Make improvements
Validate the model
Test the model
Improve it

Once you are satisfied:

Productionize the ideal model
Proceed with deployment
Set up logging methods
Handover the model
Go back to step 1

Of course, the above process can be a bit more comprehensive, but the overall blueprint from idea inception to handover to the team you built that solution for is almost the same across projects.

Also, in the above process:

Steps 1-9 mainly highlight development in the local environment.
Steps 10-13 are inclined towards the production environment.

To elaborate further, during the local development phase (steps 1-9), the model goes through rigorous engineering and testing to ensure its accuracy, robustness, and generalizability. We do this all the time.

Testing using validation/test sets is critical in this phase as it helps identify and rectify any issues before the model is sent for productionisation (which demands considerable engineering efforts).

What is productionisation?

For more context, productionisation is the phase where an ideal model is prepared for deployment in a production setting.

For instance, if we developed the model in Python, but the server we intend to deploy our model on runs any other language except Python, like C++ or Java, then making it compatible with such environment configuration is what productionisation involves.

We discussed this here in the following article (you can read it after this article):

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

Eliminating the dependence of PyTorch models on Python.

Daily Dose of Data ScienceAvi Chawla

This article discussed the use of TorchScript, which we can use to convert Python-developed PyTorch models into formats that are compatible with other languages.

Moreover, it's possible that we leveraged some classical ML models from sklearn. But these models are not production-friendly because sklearn is built on top of NumPy, which can only run on a single core of a CPU.

As a result, it provides sub-optimal performance.

The techniques we discussed in the following article help us make these models more production-friendly (you can read it after this article):

Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.

Speed up sklearn model inference up to 50x with GPU support.

Daily Dose of Data ScienceAvi Chawla

This article discussed techniques to convert sklearn models to tensor computations, which can be parallelized and can also be loaded on a GPU (if needed)

One more example could be that if the model is to be deployed on edge devices, we may want to reduce its size (discussed below).

Model Compression: A Critical Step Towards Efficient Machine Learning

Four critical ways to reduce model footprint and inference time.

Daily Dose of Data ScienceAvi Chawla

In a nutshell, the two primary objectives of the productionisation phase are to optimize the model for deployment and ensure its robustness and reliability.

This involves testing the model against various edge cases and scenarios to ensure that it can handle unexpected inputs and situations gracefully.

Once the model is fully productionised, it is deployed to the production environment, where it begins to serve predictions to end-users or other systems.

Project over?

Not yet!

Ideal deployment strategy

If we already have a model running in production, it could be a terrible idea to instantly replace the previous model with the updated model.

Instead, a more conservative and reliable strategy is to test the model in production (yes, on real-world incoming data) before completely substituting/discarding the previous version of the model.

Testing a model in production might appear risky, but ML teams do it all the time, and it isn't that complicated.

In the upcoming section, we shall discuss five commonly used techniques to test ML models in production.

We shall also implement these strategies, and in order to do that, we shall be using Modelbit, which we discussed in the following article:

Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit

Deployment has possibly never been so simple.

Daily Dose of Data ScienceAvi Chawla

It’s okay if you haven’t read it yet. We will do a quick overview of the model deployment steps in Modelbit.

👉

Note: Modelbit is a commercial product. However, this article is not influenced by any external motivation (or sponsorships) to use their service. I have personally found Modelbit’s deployment procedures to be immensely simple. Everything demonstrated in this article can be done with a free Modelbit account.

Let’s begin!

Modelbit deployment demo

👉

Feel free to skip this section if you already know how Modelbit works.

The core objective behind model deployment is to obtain an API endpoint for our deployed model, which can be later used for inference purposes:

Modelbit lets us seamlessly deploy ML models directly from our Python notebooks (or Git, as we would see ahead in this article) and obtain a REST API.

Process workflow

Since Modelbit is a relatively new service, let’s understand the general workflow to generate an API endpoint when deploying a model with Modelbit.

The image below depicts the steps involved in deploying models with Modelbit:

Step 1) We connect the Jupyter kernel to Modelbit.
Step 2) Next, we train the ML model.
Step 3) We define the inference function. Simply put, this function contains the code that will be executed at inference. Thus, it will be responsible for returning the prediction.
Step 4) [OPTIONAL] Here, we specify the version of Python and other open-source libraries we used while training the model.
Step 5) Lastly, we send it for deployment.

Once done, Modelbit returns the API endpoint, which we can integrate into any of the applications and serve end-users with.

Let’s implement this!

Prerequisites

First, we must install the Modelbit package first.

We can use pip to install the Modelbit package:

Done!

Also, to deploy and view our deployed models in the Modelbit dashboard, we must create a Modelbit account as well here: https://app.modelbit.com/signup.

Now, we can implement the steps depicted in the earlier animation.

Step 1) Connect Jupyter kernel to Modelbit

First, we connect our Jupyter kernel to Modelbit. This is done as follows:

A Beginner-friendly Guide to Multi-GPU Model Training

Avi Chawla — Fri, 05 Apr 2024 18:19:00 GMT

In an earlier article on PyTorch Lightning, we did not discuss multi-GPU training.

I mentioned that it will require you to know more background details about how it works, the strategies we use, how multiple GPUs remain in sync with one another during model training in a distributed setting, considerations, and more.

A Detailed and Beginner-Friendly Introduction to PyTorch Lightning: The Supercharged PyTorch

Immensely simplify deep learning model building with PyTorch Lightning.

Daily Dose of Data ScienceAvi Chawla

So today, we are continuing with that topic and will be understanding some of the core technicalities behind multi-GPU training, how it works under the hood, and implementation-specific details.

We shall also look at the key considerations for multi-GPU (or distributed) training, which, if not addressed appropriately, may lead to suboptimal performance, slow training, or even instability in training.

Let’s begin!

Motivation for multi-GPU training

By default, deep learning models built with PyTorch are only trained on a single GPU, even if you have multiple GPUs available.

This does mean that we cannot do multi-GPU training with PyTorch. We can do that. However, it does require us to explicitly utilize PyTorch's parallel processing capabilities.

Moreover, even if we were to utilize multiple GPUs with PyTorch, typical training procedures would always be restricted to a single machine. This limitation arises because PyTorch's default behavior is to use a single machine for model training.

Therefore, it becomes a severe bottleneck when working with larger datasets that require more computational power than what a single machine can provide.

However, acknowledging that we are restricted to a single machine for model training makes us realize that there is ample scope for further run-time optimization.

Multi-GPU training solves this.

In a gist (and as the name suggests), multi-GPU training enables us to distribute the workload of model training across multiple GPUs and even multiple machines if necessary.

This significantly reduces the training time for large datasets and complex models by leveraging the combined computational power of the available hardware.

While there are many ways (strategies) to achieve multi-GPU training, one of the most common ways is to let each GPU or machine process a portion of the input data independently.

This is also called data parallelism.

💡

In addition to data parallelism, other strategies such as model parallelism, pipeline parallelism, and hybrid parallelism can also be used to achieve multi-GPU training. Each of these strategies has its own advantages and disadvantages, and the choice of which strategy to use depends on the specific requirements of the model and the hardware available. We shall discuss a few of them in brief towards the end of the article.

Data parallelism

In data parallelism, the idea is to divide the available data into smaller batches, and each batch is processed by a separate GPU.

Finally, the updates from each GPU are then aggregated and used to update the model parameters.

If you recall the deep dive on federated learning, the idea might appear very similar to what we discussed back then:

In federated learning, instead of a single centralized server processing all the data, the model is trained across multiple decentralized edge devices, each with its own data. The updates from these edge devices are then aggregated to improve the global model.

Similarly, in data parallelism, each GPU acts as a "mini-server," processing a portion of the data and updating the model parameters locally.

These local updates are then combined to update the global model. This parallel processing of data not only speeds up the training process but also allows for efficient use of resources in distributed environments.

One major difference is that in federated learning, we do not have direct access to the local dataset, whereas in data parallelism, the data is directly accessible.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning

Learn real-world ML model development with a primary focus on data privacy – A practical guide.

Daily Dose of Data ScienceAvi Chawla

Nonetheless, it is quite obvious to understand that this approach not only improves the efficiency of model training but also allows us to scale our training to handle larger datasets and more complex models than would be possible with a single GPU or machine.

Strategies for data parallelism

As discussed above, data parallelism is a technique used in deep learning to parallelize the training of a model by splitting the data across multiple devices, such as GPUs or machines, and then combining the results.

Quite intuitively, this approach can significantly reduce the training time for large models and datasets by leveraging the computational power of multiple devices.

11 Powerful Techniques To Supercharge Your ML Models

Avi Chawla — Thu, 21 Mar 2024 19:27:00 GMT

Many machine learning engineers and data scientists very quickly pivot to building a different type of model when they don't get satisfying results with one kind of model.

At times, we do not fully exploit all the possibilities of existing models and continue to move towards complex models when minor tweaks in simple models can achieve promising results.

Over the course of building so many ML models, I have utilized various techniques that uncover nuances and optimizations we could apply to significantly enhance model performance without necessarily increasing the model complexity.

Thus, in this article, I will share 11 such powerful techniques that will genuinely help you supercharge your ML models so that you extract maximum value from them.

I provide clear motivation behind their usage, as well as the corresponding code, so that you can start using them right away.

Let’s begin!

#1) Robustify Linear Regression

The issue with regression models

The biggest problem with most regression models is that they are sensitive to outliers.

Consider linear regression, for instance.

Even a few outliers can significantly impact Linear Regression performance, as shown below:

And it isn’t hard to identify the cause of this problem.

Essentially, the loss function (MSE) scales quickly with the residual term (true-predicted).

Thus, even a few data points with a large residual can impact parameter estimation.

Huber regression

Huber loss (used by Huber Regression) precisely addresses this problem.

In a gist, it attempts to reduce the error contribution of data points with large residuals.

How?

One simple, intuitive, and obvious way to do this is by applying a threshold (δ) on the residual term:

If the residual is smaller than the threshold, use MSE (no change here).
Otherwise, use a loss function that has a smaller output than MSE — linear, for instance.

This is depicted below:

For residuals smaller than the threshold (δ) → we use MSE.
Otherwise, we use a linear loss function, which has a smaller output than MSE.

Mathematically, Huber loss is defined as follows:

Experiment

For instance, consider the 2D dummy dataset again which we saw earlier:

Currently, this dataset has no outliers, so let’s add a couple of them:

As a result, we get the following dataset:

Let’s fit a linear regression model on this dataset:

Next, we visualize the regression fit by plotting the y_pred values as follows:

This produces the following plot:

It is clear that the Linear Regression plot is affected by outliers.

Now, let’s look at the Huber regressor, which, as discussed earlier, uses a linear penalty for all residuals greater than the specified threshold $\delta$.

To train a Huber regressor model, we shall repeat the same steps as before, but this time, we shall use the HuberRegressor class from sklearn.

As a result, we get the following regression fit.

It’s clear that Huber regression is more robust.

How to determine δ?

While trial and error is one way, creating a residual plot can be quite helpful. This is depicted below:

Here’s how we create it:

Step 1) Train a linear regression model as you usually would on the outlier-included dataset.

Step 2) Compute the absolute value of residuals (=true-predicted) on the training data.

Step 3) Plot the absolute residuals array for every data point.

As a result, we get the following plot:

Based on the above plot, it appears that $\delta=4$ will be a good threshold value.

One good thing is that we can create this plot for any dimensional dataset. The objective is just to plot (true-predicted) values, which will always be 1D.

Overall, the idea is to reduce the contribution of outliers in the regression fit.

#2-3) Supercharge kNN Models

Issue with kNNs

One of the things that always makes me a bit cautious and skeptical when using kNN is its HIGH sensitivity to the parameter k.

To understand better, consider this dummy 2D dataset below. The red data point is a test instance we intend to generate a prediction for using kNN.

Say we set the value of k=7.

The prediction for the red instance is generated in two steps:

First, we count the 7 nearest neighbors of the red data point.
Next, we assign it to the class with the highest count among those 7 nearest neighbors.

This is depicted below:

The problem is that step 2 is entirely based on the notion of class contribution — the class that maximally contributes to the k nearest neighbors is assigned to the data point.

But this notion fails miserably at times, especially when we have a class with few samples.

For instance, as shown below, with k=7, the red data point can NEVER be assigned to the yellow class, no matter how close it is to that cluster:

While it is easy to tweak the hyperparameter k visually in the above demo, this approach is infeasible in high-dimensional datasets.

There are two ways to address this.

Solution #1: Used distance-weighed kNN

Distance-weighted kNNs are a much more robust alternative to traditional kNNs.

As the name suggests, in step 2, they consider the distance to the nearest neighbor.

As a result, the closer a specific neighbor is, the more impact it will have on the final prediction.

For instance, consider this 2D dummy dataset:

Now, consider a dummy test instance:

Training a kNN classifier under the uniform (count-based) strategy, the model predicts the class 0 (the red class in the above dataset) as the output for the new instance:

However, the distance-weighted kNN is found to be more robust in its prediction, as demonstrated below:

This time, we get Class 1 (the blue class) as the output, which is correct.

As per my observation, a distance-weighted kNN typically works much better than a traditional kNN. And this makes intuitive sense as well.

Yet, this may go unnoticed because, by default, the kNN implementation of sklearn considers uniform weighting.

Solution #2: Dynamically update the hyperparameter k

Recall the diagram I started this kNN discussion with:

Here, one may argue that we must refrain from setting the hyperparameter k to any value greater than the minimum number of samples that belong to a class in the dataset.

Of course, I agree with this to an extent.

But let me tell you the downside of doing that.

Setting a very low value of k can be highly problematic in the case of extremely imbalanced datasets.

To give you more perspective, I have personally used kNN on datasets that had merely one or two instances for a particular class in the training set.

And I discovered that setting a low of k (say, 1 or 2) led to suboptimal performance because the model was not as holistically evaluating the nearest neighbor patterns as it was when a large value of k was used.

In other words, setting a relatively larger value of k typically gives more informed predictions than using lower values.

But we just discussed above that if we set a large value of k, the majority class can dominate the classification result:

To address this, I found dynamically updating the hyperparameter k to be much more effective.

More specifically, there are three steps in this approach.

For every test data point:

Step 1) Begin with a standard value of k as we usually would and find the k nearest neighbors.
Step 2) Next, update the value of the k as follows:
- For all unique classes that appear in the k nearest neighbor, find the total number of training instances they have.

Update the value of k to:

$$ \Large k' = min(k, 40, 3) $$

Step 3) Now perform majority voting only on the first k' neighbors only.

This makes an intuitive sense as well:

If a minority class appears in the top k nearest neighbor, we must reduce the value of k so that the majority class does not dominate.
If a minority class DOES NOT appear in the top k nearest neighbor, we will likely not update the value of k and proceed with a holistic classification.

I used this approach in a couple of my research projects. If you want to learn more, here’s my research paper: Interpretable Word Sense Disambiguation with Contextualized Embeddings.

A Detailed and Beginner-Friendly Introduction to PyTorch Lightning: The Supercharged PyTorch

Avi Chawla — Tue, 12 Mar 2024 06:45:00 GMT

Introduction

PyTorch is the go-to choice for researchers and practitioners for building deep learning models due to its flexibility, intuitive Pythonic API design, and ease of use.

It takes a programmer just three steps to create a deep learning model in PyTorch:

First, we define a model class inherited from PyTorch’s nn.Module class

Moving on, we declare all the network components (layers, dropout, batch norm, etc.) in the __init__() method:

Finally, we define the forward pass of the neural network in the forward() method:

That’s it!

Once we have defined the network, we can proceed with training the model by declaring the optimizer, loss function, etc., without having to define the backward pass explicitly.

More specifically, one can define the training loop as demonstrated below and train the model easily:

Issues with PyTorch

As we saw above, defining the network was so simple and elegant, wasn’t it?

However, as our models grow more complex and larger, several challenges arise when using PyTorch:

#1) Managing training loops

With complex models, manually managing the training loop in PyTorch can become tedious.

This includes iterating over the dataset, performing forward and backward passes, and updating the model parameters, which, of course, is quite standardized as shown below, but it does need some review and maintainability.

#2) Logging

Logging is crucial for monitoring the training process and analyzing model performance.

PyTorch does not provide built-in support for logging, requiring users to implement their own logging solutions or integrate external logging frameworks.

#3) Handling distributed training

As models grow larger, training them on multiple GPUs or across multiple machines becomes necessary to reduce training time.

PyTorch provides support for distributed training, but the implementation can be complex, involving setting up processes, synchronizing gradients, and handling communication between processes.

#4) Debugging in a distributed setting

Debugging distributed training can be challenging due to the complexity of the setup and the potential for issues to arise from communication between processes.

#5) Mixed-precision training

Mixed-precision training, which involves using lower precision (e.g., half-precision floating-point numbers) for certain parts of the training process, can help reduce memory usage and speed up training.

PyTorch supports mixed-precision training, but managing the precision of different operations manually is pretty challenging.

We also saw this in a recent newsletter issues, where we into full detail about mixed-precision training, and how it works in PyTorch:

Mixed Precision Training

Train large deep learning models efficiently.

Daily Dose of Data ScienceAvi Chawla

#6) Running models on TPUs

PyTorch natively supports running models on GPUs, but running models on TPUs (Tensor Processing Units) requires additional setup and configuration.

From the above discussion, it’s clear that PyTorch doesn’t provide out-of-the-box solutions for many important tasks, leading to boilerplate code and increased chances of errors.

Of course, these challenges may not be major concern for all types of models, especially small or simple ones.

For small-scale projects, the overhead of managing training loops, logging, and distributed training may not outweigh the benefits of using PyTorch directly. However, as models grow in complexity and size, these challenges become more pronounced.

PyTorch Lightning

PyTorch Lightning resolves each of the above-discussed challenges with PyTorch.

You can think of PyTorch Lightning as a lightweight wrapper around PyTorch that abstracts away the boilerplate code, which we typically write with PyTorch, and makes the training process more streamlined and readable.

Just like Keras is a wrapper on TensorFlow, PyTorch lightning is a wrapper on PyTorch, but one that makes it much more efficient than the traditional way of training the model.

Thus, one can use ANY PyTorch model as a PyTorch Lightning model.

As the library is an optimized wrapper around PyTorch, the developers claim to reduce the repeated (boilerplate) code by 70-80%, which minimizes the surface area for bugs and lets us focus on delivering value instead of engineering.

Moreover, as we shall see ahead, with PyTorch Lightning, we can define our model and training logic in a clear and concise manner, which lets us focus more on the research and less on the implementation details.

In fact, the utility is pretty evident from its popularity because its GitHub repo has over 26k stars:

Revisiting the challenges with PyTorch, we discussed above, here’s how PyTorch Lightning addresses them.

Managing training loops: PyTorch Lightning simplifies this process by providing a high-level abstraction for defining the training loop, reducing the amount of boilerplate code required.
Logging: PyTorch Lightning integrates with popular logging frameworks like TensorBoard and Comet, making it easier to log training metrics and visualize them in real-time.
Handling distributed training: PyTorch Lightning simplifies distributed training by providing a unified interface. This abstracts away the complexity of the underlying implementation.
Debugging in a distributed setting: PyTorch Lightning provides tools and utilities to facilitate debugging in a distributed setting, making it easier to identify and resolve issues.
Mixed-precision training: PyTorch Lightning simplifies mixed-precision training by providing utilities to automatically handle the precision of operations based on user-defined settings.
Running models on TPUs: PyTorch Lightning supports running models on TPUs, abstracting away the complexity of the underlying TPU architecture and allowing users to focus on their model implementation.

Along with that, one of the best things about PyTorch Lightning is that it has a minimal API. In most cases, the LightningModule and Trainer class are the only 2 APIs one must learn because the rest is just organized PyTorch.

If none of these things is clear yet, don’t worry. Let’s get into a complete walkthrough of using PyTorch Lightning.

Now that we understand what PyTorch Lightning is and the motivation to use it over PyTorch, let’s get into more details about its implementation and how PyTorch Lightning works.

More specifically:

We shall begin with a standard PyTorch code, and learn how to convert that into PyTorch Lightning code.
Next, we shall look at how we use the Trainer() class from PyTorch Lightning to simplify model training and define various methods for training, validation, testing and predicting. Here, we shall also learn how to log model training and integrate various performance metrics during training.
Finally, we shall deep dive into the additional utilities offered by PyTorch Lightning like mixed precision training, callbacks, profiling code for optimization.

Let’s begin!

PyTorch to PyTorch Lightning

In this section, let’s build a simple neural network on the MNIST dataset using PyTorch. Then, we will see how we can convert that code to a PyTorch Lightning code.

PyTorch Model

Here are the traditional steps to building a model in PyTorch:

Step 1) Import required packages and libraries

First, we import the required packages from PyTorch:

Step 2) Load the dataset

Next, we load the MNIST dataset (train and test) and create their respective PyTorch dataloaders.

Step 3) Define the PyTorch Model

Moving on, we define a simple feedforward neural network architecture. This is demonstrated below:

Step 4) Initialize the model and define the loss function and optimizer

Moving on, we shall initialize the model and define the loss function to train it — the CrossEntropyLoss.

Step 5) Define the evaluation method

To evaluate the model after every epoch, let’s define an evaluate() method that will iterate over the examples in the testloader and compute the accuracy. This is demonstrated below:

Step 6) Train the model

Now, we will train the PyTorch model.

With this, we are done with the PyTorch model.

Now, if we go back to the above code, there’s too much boilerplate code here.

Simply put, boilerplate means the repetitive and standardized sections of code that are necessary for the functioning of the program, but they are not unique to the model we are training. Instead, this is something that we would almost always write in most other projects too.

For instance, the accuracy() method and the training loop contribute to the boilerplate code here.

While these boilerplate sections are essential for training a neural network, they can be cumbersome to write and maintain, and they are pretty repetitive as well.

PyTorch Lightning Model

Now that we have defined a network in PyTorch, let’s see how we can convert this to a Pytorch Lightning with just two to three simple changes.

But first, we must install PyTorch Lightning, which we can do as follows:

Next, we import PyTorch Lightning as follows: