KMeans is an unsupervised clustering algorithm that groups data based on distances. It is widely recognized for its simplicity and effectiveness as a clustering algorithm.

Essentially, the core idea is to partition a dataset into distinct clusters, with each point belonging to the cluster whose centroid is closest to it.

While its simplicity often makes it the most preferred clustering algorithm, KMeans has many limitations that hinder its effectiveness in many scenarios.

For instance, KMeans does not account for cluster variance and shape.

In other words, one of the primary limitations of KMeans is its assumption of spherical clusters.

One intuitive and graphical way to understand the KMeans algorithm is to place a **circle** at the center of each cluster, which encloses the points.

💡

In 3-dimensions, circles can be replaced with spheres. In higher dimensions, they can be thought of as a hyper-sphere.

As KMeans is all about placing circles, its results aren’t ideal when the dataset has irregular shapes or varying sizes, as shown below:

Instead, an ideal clustering should cluster the data as follows:

The same can be observed in the following dataset:

This rigidity of KMeans to only cluster globular clusters often leads to misclassification and suboptimal cluster assignments.

Another significant limitation of the KMeans clustering algorithm is its inherent assumption that every data point must be assigned to a cluster.

More specifically, KMeans operates under the premise that each data point belongs to one and only one cluster, leaving no room for the representation of noise or outliers.

This characteristic can be problematic in scenarios where the dataset contains irregularities, anomalies, or noise that do not conform to clear cluster boundaries.

As a result, KMeans may inadvertently assign data points to clusters even when they do not truly belong to any discernible pattern.

This is depicted in the image below:

This tendency can lead to suboptimal clustering results, particularly in datasets with varying densities, irregular shapes, or the presence of noisy data.

In real-world datasets where the underlying structure is not well-suited to the assumptions of KMeans, the algorithm’s rigid assignment of all data points to clusters can limit its effectiveness.

We should be mindful of this limitation and consider alternative clustering approaches, such as density-based or hierarchical methods, when dealing with datasets that exhibit significant noise or outliers.

One notable constraint of using the K-Means clustering algorithm is the requirement for prior knowledge or assumptions about the exact number of clusters in the dataset.

Unlike some other clustering algorithms that can automatically determine the optimal number of clusters based on the data's inherent structure, K-Means relies on a predefined value for the number of clusters (often denoted as 'k').

Of course, this is not a problem though, and there are various methods to determine a reliable value for the parameter $k$, but that constraint poses challenges in scenarios where the true number of clusters is not known in advance, leading to potential inaccuracies and suboptimal results.

On a side note, we commonly use the Elbow curve to determine the number of clusters (`k`

) for KMeans.

However, the problem with the Elbow curve is that it:

- has a subjective interpretation
- involves ambiguity in determining the Elbow point accurately
- only considers a within-cluster distance, and more.

Silhouette score is an alternative measure used to evaluate clustering quality, and it is typically found to be more reliable than the Elbow curve.

Measuring it across a range of centroids (`k`

) can reveal which clustering results are most promising.

The visual below compares the Elbow curve and the Silhouette plot:

It’s clear that the Elbow curve is highly misleading and inaccurate.

In a dataset with 25 clusters:

- The Elbow curve depicts
**four**as the number of optimal clusters. - The Silhouette curve depicts
**25**as the number of optimal clusters.

We have already covered this in a couple of previous newsletter issues if you want to get into more details:

Coming back...

The above limitations of KMeans highlight the importance of learning about other better algorithms that we can use to address these limitations is extremely important.

While KMeans comes under the centroid-based clustering algorithms, there are many different types of clustering algorithms (shown below) that can be used depending on the situation:

We have already covered distribution-based clustering before:

So the focus of this article is **Density-based clustering**, that too specifically DBSCAN++, which is a significant improvement on the DBSCAN clustering algorithm.

Let’s understand in detail!

But before understanding DBSCAN++, let’s understand what DBSCAN is.

DBSCAN, which stands for **Density-Based Spatial Clustering of Applications with Noise**, is a popular clustering algorithm in machine learning and data mining.

As the name suggests, the core idea behind DBSCAN is to group together data points based on “density”, i.e., points that are close to each other in a high-density region and are separated by lower-density regions.

It does a great job of seeking areas in the data with a high density of data points, versus areas of the data that are not very dense with observations.

The notion of density in DBSCAN lets it sort data into clusters of varying shapes as well, which is its substantial advantage over traditional clustering algorithms such as KMeans.

As a result, it immediately resolves each of the above-discussed limitations of the KMeans clustering algorithm.

- As it clusters based on the notion of “density”, the clusters may not necessarily have a globular shape. Instead, they may have arbitrary shapes, as depicted below:

- Unlike KMeans, DBSCAN does not allocate each data point to a cluster. This is because DBSCAN operates on a fundamentally different principle—density-based clustering. Outliers are regions of very low density, and we will see shortly that DBSCAN can quickly identify regions of low density and classify points in those regions as outliers.
- What’s more, another benefit of DBSCAN is that it doesn’t require specifying the number of clusters in advance. Instead, it identifies clusters based on the density of data points.

We will walk through a simple example to understand how the algorithm works.

Let’s say we have a dataset of points like the following:

Clearly, we have two distinct clusters and one noise point at the center. Our objective is to cluster these points into groups that are densely packed together.

Firstly, we count the number of points in the vicinity of each data point. For example, if we start with the green point, we draw a circle around it.

The radius `epsilon`

of the circle is the first parameter that we must specify when using DBSCAN. In other words, this is a hyperparameter.

After drawing the circle, we count the number of data points that fall within that $\epsilon$ radius circle. For instance, for our green point, there are six close points.

Likewise, we count the number of close points for all remaining points.

After counting the number of data points in the vicinity of each data point, we classify every data point into one of the three categories:

**Core point**: A data point that has at least`minPts`

data points (including the point itself) at a distance less than equal to`epsilon`

. For instance, if`minPts=5`

, the green point in the figure below is a core point:

**Border point**: A data point that does not have at least`minPts`

data points (including the point itself) at a distance less than equal to`epsilon`

, but it is in the vicinity of a core point. For instance, in the figure below, the green point is a border point.

**Noise point**: A data point that is neither a core point nor a border point is a noise point. For instance, in the figure below, the purple data point is a noise point:

And of course, as you may have already guessed, `minPts`

is another hyperparameter of DBSCAN.

After classifying all the data points as core, border, or noise points, we proceed with clustering.

The idea is simple.

- Start with any core point (let’s call it
`A`

) and assign it to a cluster with cluster-ID, say`1`

(to begin).

- All points in the vicinity of the above core point
`A`

will belong to the same cluster — whose cluster-ID is`1`

.

- If a data point in the vicinity of the core point
`A`

is also a core point, as shown below...

- ....then we include data points in its vicinity in the same cluster — whose cluster ID is
`1`

.

- The above step is executed
**recursively**under the same cluster-ID until we cannot find a new core point.

- At this point, we may still have some core points in the dataset, but they may not belong to the previous cluster, as shown below. We repeat the same steps as above but with a new cluster ID this time.

The algorithm completes its execution once we are left with no core points.

In DBSCAN, determining the `epsilon`

parameter is often tricky. But the Elbow curve is often helpful in determining it.

To begin, as we discussed above, DBSCAN has two hyperparameters:

`epsilon`

: two points are considered neighbors if they are closer than Epsilon.`minPts`

: Min neighbors for a point to be classified as a core point.

We can use the Elbow curve to find an optimal value of `epsilon`

:

For every data point, plot the distance to its $k^{th}$ nearest neighbor (in increasing order), where k is the `minPts`

hyperparameter. This is called the k-distance plot.

The optimal value of `epsilon`

is found near the elbow point.

Why does it work?

Recall that in this plot, we are measuring the distance to a specific ($k^{th}$) neighbor for all points. Thus, the elbow point suggests a distance to a more isolated point or a point in a different cluster.

The point where change is most pronounced hints towards an optimal epsilon. The efficacy is evident from the image above.

As depicted above, selecting the elbow value provides better clustering results over another value.

**Effective Density-Based Clustering**:- Quite intuitively, DBSCAN excels in separating clusters of high density from those of low density within a given dataset.
- By focusing on the density of data points, it can identify clusters with varying shapes and sizes.

**Robust Outlier Handling**:- DBSCAN is particularly adept at handling outliers within the dataset.
- As discussed earlier, if a data point has no core point close to it, where the core point in itself is an indicator of high density because of the presence of many points in its vicinity, then we can intuitively say that the point will be an outlier.
- In fact, in some cases, DBSCAN is primarily employed as an outlier detection technique. The algorithm's ability to designate noise points as outliers contributes to a more comprehensive understanding of the data's structure.
- We don’t even have to proceed with clustering for outlier detection. Instead, we can just determine if a point is a noise point or not using the parameters
`epsilon`

and`minPts`

.

Density-based clustering algorithms have greatly impacted a wide range of areas in data analysis, including outlier detection, computer vision, and medical imaging.

In fact, in many practical applications, as data volumes rise, it can become increasingly difficult to collect labels for supervised learning.

**In such cases, non-parametric unsupervised algorithms are becoming increasingly important in understanding large datasets.**

However, one of the biggest issues with DBSCAN is its run time.

Until recently, it was believed that DBSCAN had a run-time of $O(n \log n)$ until it was proven to be $O(n^2)$ in the worst case.

Thus, there is an increasing need to establish more efficient versions of DBSCAN.

Although we won’t get into much detail here, it can be proved that DBSCAN can run in $O(n \log n)$ when the dimension is at most $2$, which is rarely the case.

However, it quickly starts to exhibit quadratic behavior in high dimensions and/or when $n$ becomes large.

This can also be verified from the figure below:

In this experiment, we have a simple 2-dimensional dataset with a varying number of data points.

It is clear that DBSCAN possesses a quadratic run-time with respect to the number of data points.

The quadratic run-time for most density-based algorithms can be realized from the fact that DBSCAN implicitly must compute density estimates for each data point, which is linear time in the worst case for each query.

As discussed above, in the case of DBSCAN, such queries are proximity-based, which are computed for each pair of data points.

This is what gives it a quadratic nature of run-time performance.

DBSCAN++ is a step towards a fast and scalable DBSCAN.

Simply put, DBSCAN++ is based on the observation that...

]]>In many practical machine learning (ML) projects, it is a common practice to consolidate the data at a central location.

Subsequently, machine learning engineers leverage this centralized data for:

- Analysis,
- Conducting feature engineering,
- And ultimately proceed with the model training, validation, scaling, deployment, and ongoing production monitoring.

This traditional method is widely accepted and employed in developing ML models.

Nevertheless, a notable challenge associated with this conventional approach is its requirement for data to be physically centralized before any subsequent processing can occur.

Let's understand the issues with this in detail!

Consider that our application has a user base of millions. It’s evident that data quantity can be extremely high to deal with.

This data is valuable because modern devices have access to a wealth of data that can be suitable for machine learning models.

This data can significantly improve the user experience on the device.

For instance:

- If it’s text data, then language models can improve speech recognition and text entry
- If it’s image data, then many downstream image models can be improved, and more.

However, the conventional machine learning approach, which involves aggregating all data in a central repository, presents many challenges in such situations.

More specifically, in this approach, transferring data from individual user devices to a central location is both bandwidth and time-intensive, discouraging users from participating.

Even if users were incentivized to contribute data, the redundancy of having the data on both the user's device and the central server could be logistically infeasible because of the amount of data we might be dealing with.

Moreover, the data often contains personal information such as photos, private texts, and voice notes.

Requesting users to upload such sensitive data not only jeopardizes privacy but also raises legal concerns. Storing such data in a centralized database becomes problematic, introducing feasibility issues and privacy violations.

This creates problems in storing this data in a centralized database. Simply put, it can be both infeasible and raise many privacy violations.

Moving large amounts of data to a central server can be costly in terms of user bandwidth and time.

But the data is still valuable to us, isn’t it? We want to utilise it some way.

**Federated learning** is an incredible machine learning model training technique that minimizes data transfer, making it suitable for low-bandwidth and high-latency environments.

Let’s understand!

Formally, federated learning represents a decentralized approach to machine learning, wherein the training data remains localized on individual devices, such as smartphones.

Instead of transmitting data to a central server, models are dispatched to devices, trained locally, and only the resultant model updates are gathered and sent back to the server.

In essence, this approach involves leaving the training data on individual devices while learning a shared model by aggregating locally computed gradient updates.

💡

The terminology "Federated Learning" is derived from the concept that a loose **federation** of participating devices (referred to as clients) collaborates with a central server to solve the learning task.

One of its primary merits lies in enhancing privacy and security by eliminating all dependencies on centralized data collection.

This is because each client possesses a local training dataset that remains exclusively on the device and is never uploaded to the server.

Instead, clients compute updates to the global model maintained by the server, transmitting only the essential "model update."

As a result, the entire model update process occurs on the client side, providing a key advantage by decoupling model training from the necessity for direct access to raw training data.

While a degree of trust in the coordinating server is required, federated learning effectively addresses major concerns associated with the conventional centralized approach to machine learning model training.

By facilitating on-device training and minimizing the need for extensive data transfer, federated learning presents practical solutions to challenges inherent in the traditional model training paradigm.

The primary motivations for using federated learning are:

**Privacy:**- Safeguarding user data is a top priority, especially because, lately, more and more users have started caring about their privacy.
- Centralized data repositories pose inherent privacy risks, while federated learning mitigates these concerns by allowing data to reside exclusively on user devices, minimizing exposure.

**Bandwidth and Latency:**- As previously discussed, the resource-intensive process of transferring substantial data volumes to a central server can be both time and bandwidth-consuming.
- Federated learning strategically minimizes data transfer, proving particularly advantageous in environments characterized by low bandwidth and high latency.

**Data Ownership:**- Users maintain control and ownership of their data within the federated learning framework.
- This not only addresses concerns related to data ownership but also ensures the preservation of data rights, offering a user-centric approach to machine learning.

**Scalability:**- Federated learning exhibits a natural scalability that aligns seamlessly with the increasing number of devices.
- This inherent scalability renders it well-suited for applications on a large scale, spanning mobile devices, IoT devices, and edge computing scenarios.

In essence, federated learning represents a paradigm shift by bringing our models to where the data resides, as opposed to the conventional approach of moving data to the location where the model is situated.

This inversion of the traditional model training process emphasizes the adaptability and efficiency of federated learning in contemporary data-driven applications.

Certainly, at this point, the argument may arise that anonymizing data before uploading it to central servers can address privacy concerns.

Simply put, anonymizing means removing all personally identifiable information (PII) from a dataset.

This typically involves replacing or encrypting specific data elements to prevent the identification of individuals associated with the information.

However, contrary to common belief, even handling anonymized data can introduce privacy issues.

Consider a scenario with a database of cardholders — a highly sensitive dataset.

While masking card numbers is a common practice, additional details such as cardholder addresses, necessary for processing, may still be present.

Thus, anonymizing the dataset does not always guarantee the elimination of privacy concerns.

Federated learning, on the other hand, minimizes the transmission of data-specific information to centralized locations. As discussed above, the information transmitted is minimal, typically containing significantly less raw data.

In this paradigm, only **model updates** are sent to the central server, and remarkably, the aggregation algorithm on the server side does not require knowledge of the source of these updates. Thus, the source information can be entirely ignored.

This lack of dependence on the source information guarantees true anonymity by ensuring that locally generated model updates can be transmitted without revealing any other details that might compromise user privacy.

This creates a mutually beneficial scenario.

- Users are content as their experience is driven by high-quality ML models without compromising their data.
- Simultaneously, teams benefit by successfully addressing various challenges, including:

**Privacy Concerns:**Federated learning effectively sidesteps privacy issues associated with traditional centralized approaches.**Reduced Model Training Cost:**The approach helps mitigate costs associated with centralized model training.**Minimized Data Maintenance Cost:**Federated learning significantly diminishes the burden of data maintenance costs.**Large Dataset Training:**Teams can train models on expansive datasets without the need for centralized storage.**Better user experience:**Despite centralized data storage, high-quality ML models can be developed.

In essence, federated learning provides a win-win solution for everyone.

In federated learning, the scope of data used for model training extends beyond what centralized data engineering may have collected and managed.

By tapping into the full spectrum of data residing on individual user devices, federated learning enables models to learn from diverse and rich datasets.

This diversity enhances the robustness of models, making them more representative of real-world scenarios.

Federated learning not only improves models through collaborative training but also extends benefits directly to users.

When a user's device participates in model training, it receives updates based on collective knowledge, enhancing the user experience.

For example, in a personalized recommendation system, a user benefits from a model trained on the preferences of a larger user base, leading to more accurate and tailored recommendations.

Unlike traditional centralized approaches that demand substantial computational resources for data processing and model training on a central server, federated learning redistributes most computation to user devices.

This shift brings several advantages:

**Reduced Server Load**: Central servers require less computational power as they no longer need to process and train on massive amounts of data.**Lower Latency**: Users experience lower latency since data doesn't need to be transmitted to a remote server for processing, thereby improving the overall user experience.**Energy Efficiency**: The local computation on user devices can be more energy-efficient.

Before understanding key strategies for federated learning, it is essential to understand that the applicability of federated learning is not a one-size-fits-all proposition.

Rather than adopting it everywhere, understanding the specific situations when federated learning is the optimal approach is critical.

This is because if you understand these specific types of situations and come across them someday, you will immediately know that federated learning is the way out here.

In my experience, ideal problems for federated learning have the following properties:

]]>In my experience, most ML projects lack a dedicated experimentation management/tracking system.

As the name suggests, this helps us track:

**Model configuration**→ critical for reproducibility.**Model performance**→ critical for comparing different models.

…across all experiments.

What’s more, consider that our ML pipeline has three steps:

If we only made some changes in model training (step 3), say, we changed a hyperparameter, does it make any sense to rerun the first two steps?

No, right?

Yet, typically, most ML pipelines rerun the entire pipeline, wasting compute resources and time.

Of course, we may set some manual flags to avoid this.

But being manual, it will always be prone to mistakes.

To avoid this hassle and unnecessary friction, an ideal tracking system must be aware of:

- All changes made to an ML pipeline.
- The steps it can avoid rerunning.
- The only steps it must execute to generate the final results.

While the motivation is quite clear, this is a critical skill that most people ignore, and they continue to leverage highly inefficient and manual tracking systems — Sheets, Docs, etc.

To help you develop that critical skill, I'm excited to bring you a special guest post by Bex Tuychiev.

Bex is a Kaggle Master, he’s among the top 10 AI/ML writers on Medium, and I am a big fan of his writing.

In this machine learning deep dive, he will provide a detailed guide on where we left last week — data version control with DVC.

Make sure you have read that article before proceeding ahead:

More specifically, this article will expand on further highly useful features of DVC for machine learning projects.

The article has been divided into two parts, and by the end of this article, you will learn:

- How to efficiently track and log your ML experiments?
- How to build efficient ML pipelines?

Over to Bex!

Keeping track of machine learning experiments is like keeping FIVE dogs in a bathtub.

Without help, at least FOUR of them are bound to slip out of your hands and ruin everything.

A total disaster is what’s going to happen if you don’t have a proper experiment management system.

First, you’ll probably end up with a complete mess of code, with no idea which version of the model is the most recent or the best performing.

You’ll constantly be overwriting and losing important code, and it will be almost impossible to reproduce your results or track your progress.

On top of that, you’ll have no way of keeping track of hyperparameters, metrics, or any other important details of your experiments (unless you want to write them down). You’ll be flying blind.

**In all seriousness, a proper experiment management system is crucial for any machine learning project.**

It allows you to track and compare your experiments, reproduce results, and make informed decisions about the direction of your project.

Without it, you’re just shooting in the dark and hoping for the best.

By finishing this tutorial, you will be able to track your machine-learning experiments **by adding a single line of code to your training script**.

In the end, you will have a table of experiments, which you can sort by any metric or parameter to find the best model for your use case.

Let's begin!

]]>In a recent deep dive into model deployment, we discussed the importance of version-controlling deployments in machine learning (ML) projects:

More specifically, we looked at techniques to version control:

- Our deployment code, and
- Our deployed model.

Moving on, we also looked at various advantages of version-controlling model deployments.

Let’s recap those.

For instance, with version control, one can precisely identify what changed, when it changed, and who changed it — which is crucial information when trying to diagnose and fix issues that arise during the deployment process or if models start underperforming post-deployment.

**Another advantage of version control is effective collaboration.**

For instance, someone in the team might be working on identifying better features for the model, and someone else might be responsible for fine-tuning hyperparameters or optimizing the deployment infrastructure.

And it is well known that with version control, teams can work on the same codebase/data and improve the same models without interfering with each other’s work.

Moreover, one can easily track changes, review each other’s work, and resolve conflicts (if any).

**Lastly, version control also helps in the reproducibility of an experiment.**

It ensures that results can be replicated and validated by others, which improves the overall credibility of our work.

Version control allows us to track the exact code version and configurations used to produce a particular result, making it easier to reproduce results in the future.

This becomes especially useful for open-source data projects that many programmers may use.

**HOWEVER!**

Let me ask you something:

Purely from a reproducibility perspective, do you think model and code versioning are sufficient?

In other words, are these the only requirements to ensure model reproducibility?

See, when we want to reproduce a model:

- First, we need the exact version of the code which was used to train the model.

- We have access to the code through code versioning.
- Next, we need the exact configuration the model was trained with:

- This may include the random seed used in the model, the learning rate, the optimizer, etc.
- Typically, configurations are a part of the code, so we know the configuration as well through code versioning.
- Finally, we need the trained model to compare its performance with the reproduced model.

- We have access to the model as well through model versioning.

But how would we train the model without having access to the **exact dataset** that was originally used?

In fact, across model updates, our data can largely vary as well.

Thus, it becomes important to track the exact version of the dataset that was used in a specific stage of model development and deployment.

This is precisely what **data version control** is all about.

The motivation for maintaining a data version control system in place is quite intuitive and straightforward.

Typically, all real-world machine learning models are trained using large datasets.

Before training a model and even during its development, many transformations are regularly applied to the dataset.

This may include:

- Preprocessing
- Transformation
- Feature engineering
- Tokenization, and many more.

As the number of model updates (or iterations) increases, it can get quite difficult to track which specific version of the dataset was used to train the machine learning model.

If that is clear, then the motivation for having a data version control system in place becomes quite intuitive and simple, as it addresses many data-specific challenges:

With a data version control system, we can precisely reproduce the same training dataset for any given version of our machine learning model.

By tracking and linking each dataset version to a specific model version, we can easily recreate the exact conditions under which a model was trained, making it possible to replicate results.

Like codebase traceability, data version control enables traceability by documenting the history of dataset changes.

It offers a clear lineage of how the dataset has evolved over time, including information about who made changes, when those changes occurred, and the reasons behind those modifications.

This traceability is crucial for understanding the data's quality and history, ensuring transparency and accountability in our ML pipeline.

In collaborative machine learning projects, team members need to work on the same data, apply transformations, and access shared dataset versions.

Data version control simplifies data sharing and collaboration by providing a centralized repository for datasets.

Team members can easily access and sync with the latest data versions, ensuring that everyone is on the same page.

As we would see ahead in the practical demo, data version control systems help in optimizing storage and bandwidth usage by employing techniques such as:

- data deduplication
- data caching

This ensures that we do not waste resources on storing redundant copies of large datasets, making the storage and transfer of data more efficient.

It’s common for datasets to undergo quality checks and transformations during their preparation.

Data version control can help identify issues or discrepancies in the dataset as it evolves over time.

By comparing different dataset versions, we can spot unexpected changes and revert to previous versions if necessary.

Many industries and organizations have stringent data governance and compliance requirements.

Data version control supports these needs by providing a well-documented history of data changes, which can be crucial for audits and regulatory compliance.

In situations where a newer model version performs worse or encounters unexpected issues in production, having access to previous dataset versions allows for easy rollback to a more reliable dataset.

This can be a lifesaver in many situations.

**Here, you might be wondering, why data rollback might be of any relevance to us when it’s the model that must be rolled back.**

You are right.

But there are situations where data rollback is useful.

For instance, assume that your deployed machine learning model is a k-nearest neighbor (kNN) model.

**The thing is that we never train a kNN.**

In fact, there are no explicitly trained weights in the case of a kNN. Instead, there’s only the training dataset that is used for inference purposes.

The model is effectively the training data, and predictions are made based on the similarity of new data points to the stored instances.

If anything goes wrong with this dataset during production, having a data version control system in place can help us quickly roll back to a previous reliable dataset.

What could go wrong, you may wonder?

- Maybe the data engineering team has stopped collecting a specific feature due to a compliance issue.
- Maybe there were some human errors, like accidental data deletions. This will make the kNN void.
- Maybe there was data corruption — Data files or records became corrupted due to a hardware failure, network issues, or software bugs. This will make the kNN void.

The benefit of data version control is that if anything goes wrong with this dataset during production, having a data version control system in place can help us quickly roll back to a previous reliable dataset.

In the meantime, we can work in the development environment to investigate the issue and decide on the next steps.

Now that we have understood the motivation for having a data version control system, let’s look at some considerations for a **data version control **system.

We know that Git can manage the versioning of any type of file in a Git repository. These can be code, models, datasets, config files, etc.

These can be hosted on remote repositories like GitHub with a few commands.

So for someone using GitHub to host codebases, they might be tempted to extend the same to version datasets:

- They may manage different versions of the dataset with Git locally.
- They may host the repository on GitHub (or other services) for collaboration and stuff.

The rationale behind this idea could be that Git can do data version control as elegantly as version control codebases.

Sounds like a fair thing to do, right?

Well, it’s not!

Recall our objective again.

In the previous section, we discussed using data version control for production systems.

As production systems are much more complex and involve collaboration, the codebase is typically hosted in a remote repository, like on GitHub.

But the problem is that Github repositories (or other similar hosting tools like GitLab) always have an upper limit on the file size we can push to these remote repositories.

In other words, GitHub is only designed for lightweight code scripts. However, it is not particularly well-suited for version controlling large datasets that exceed just a few MBs in size.

Typically, in machine learning projects, the dataset size can be in the order of GBs. Thus, it is impossible to execute data version control with Github.

In fact, we can also verify this experimentally.

Consider we have the following project directory:

As shown above, the `data.csv`

file takes about 200 MBs of space.

Let’s create a local Git repository first before pushing the files to a remote repository hosted on GitHub.

To create a local Git repository:

- First, we shall initialize a git repository with
`git init`

. - Next, we will add the files to the staging area with
`git add`

. - Finally, we will use
`git commit`

to commit to the local repo.

This is demonstrated below:

👉

This is not shown here but we will do the same for the project.py file as well.

All good so far. The changes have been committed successfully to the local git repo.

We can verify this by reviewing the commit history of the local git repo with `git log`

command.

Next, let’s try to push these changes to a remote GitHub repository. To do this, we have created a new repository on GitHub – `dvc_project`

.

Before pushing the files to the remote repository, let’s also verify whether the local git repository has any uncommitted changes using the `git status`

command:

💡

git status command is used to check the status of your git repository. It shows the state of your working directory and helps you see all the files that are untracked by Git, staged, or unstaged,

The output says that the working tree is clean.

Thus, we can commit these changes to the remote GitHub repository created above as follows:

As discussed earlier, Github repositories always have an upper limit on the file size we can push to these remote repositories. This is precisely what the above error message says.

For this reason, we can only use the typical GitHub repositories for data version control as long as the dataset size is below a few MBs, which is rarely the case.

Thus, we need a better version control system, especially for large files, that does not have the above limitations.

💡

Before we proceed, a critical point to note here is that as long as we are **only working locally** and wish to maintain data version control in our personal machine learning projects, using the usual git-based functionalities is not a big problem. You can use Git if you want as long as you keep everything on our local computer. However, versioning large files is often time-consuming with Git so it is recommended to use tools that are specifically meant for such purposes.

An ideal data version control system must fulfill the following requirements:

**It should allow us to track all data changes like Git does with files.**- As soon as we make any change (adding, deleting, or altering files) in a git-initialized repository, git can always identify those changes.

- The same should be true for our data version control system.
**It should not be limited to tracking datasets.**- See, as we discussed above, GitHub sets an upper limit on the file size we can push to these remote repositories.

- But large files may not not necessarily be limited to datasets. In fact, models can also be large, and pushing model pickles to GitHub repositories can be difficult.
**It must have support for branching and committing.**

- Like Git, this data version control system must provide support for creating branches and commits.
**Its syntax must be similar to Git.**- Of course, this is optional but good to have.
- Having a data version control system that has a similar syntax to git can simplify its learning curve.

**It must be compatible with Git.**- In any data-driven project, data, code, and model always work together.
- After using the data version control system, it must not happen that we are tracking code with Git and then managing models/datasets with an entirely non-compatible tool.

- They must work seamlessly integrated to avoid any unnecessary friction.
**It must have collaborative functionalities like Git**- Like git makes it extremely simple for teams to collaborate and work together by sharing their code, the data version control must also promote collaboration.

So what can we do here?

]]>Random forest is a pretty powerful and robust model, which is a combination of many different decision trees.

What makes them so powerful over a traditional decision tree model is Bagging:

Anyone who has ever heard of Random Forest has surely heard of Bagging and **how** it works.

This is because, in my experience, there are plenty of resources that neatly describe:

- How Bagging algorithmically works in random forests.
- Experimental demo on how Bagging reduces the overall variance (or overfitting).

However, these resources often struggle to provide an intuition on:

**Why**Bagging is so effective.**Why**do we sample rows from the training dataset**with replacement**.- The mathematical demonstration that verifies variance reduction.

Thus, in this article, let me address all of these above questions and provide you with a clear and intuitive reasoning on:

- Why bagging makes the random forest algorithm so effective at variance reduction.
- Why does bagging involves sampling with replacement?
- How do we prove variance reduction mathematically?

👉

The code for this article and its practice exercise notebook has been provided towards the end of the article.

Let’s begin!

Decision trees are popular for their interpretability and simplicity.

Yet, unknown to many, they are pretty infamous when it comes to overfitting any data they are given.

This happens because a standard decision tree algorithm greedily selects the best split at each node, making its nodes more and more pure as we traverse down the tree.

Unless we don’t restrict its growth, nothing can stop a decision tree from 100% overfitting the training dataset.

For instance, consider that we have the following dummy data, and we **intentionally** want to **100% overfit** it with, say, a linear regression model.

This task will demand some serious effort by the engineer.

In other words, we can’t just run `linear_model.fit(X, y)`

in this case to directly overfit the dataset.

Instead, as mentioned above, this will require some serious feature engineering effort to entirely overfit the given dataset.

For instance, to intentionally overfit this dummy dataset, we would have to explicitly create relevant features, which, in this case, would mostly be higher-degree polynomial features.

This is shown below:

As shown above, as we increase the degree of our feature $x$ in our polynomial regression, the model starts to overfit the dataset more and more.

With a polynomial degree of $40$, the model entirely overfits the dataset.

The point is that overfitting this dataset (on any dataset, for that matter) with linear regression typically demands some engineering effort.

While the above dataset was easy to overfit, a complex dataset with all sorts of feature types may require serious effort to intentionally overfit the data.

**However**,** this is NEVER the case with a decision tree model.**

In fact, overfitting **any dataset** with a decision tree demands no effort from the engineer.

In other words, we can simply run `dtree_model.fit(X, y)`

to overfit any dataset, regression or classification.

This happens because a standard decision tree always continues to add new levels to its tree until all leaf nodes are pure.

As a result, it always $100\%$ overfits the dataset by default, as shown below:

The same problem is observed in classification datasets as well.

For instance, consider the following dummy binary classification dataset.

It’s clear that there is some serious overlap between the two classes.

Yet, a decision tree does not care about that.

The model will still meticulously create its decision boundary such that it classifies the dataset with 100% accuracy.

This is depicted below:

It is important to address this problem.

Of course, there are many ways to prevent this, such as pruning and ensembling.

👉

The main focus of this article is ensembling, specifically bagging, so we won’t get into much detail about pruning.

Pruning is commonly used in tree-based models, where it involves removing branches (or nodes) to simplify the model.

For instance, we can intentionally restrict the decision tree from growing after a certain depth. In sklearn’s implementation, we can do this by specifying the `max_depth`

parameter.

Pruning is also possible by specifying the minimum number of samples required to split an internal node.

Another pruning technique is called the **cost-complexity-pruning (CCP)**.

CCP considers a combination of two factors for pruning a decision tree:

- Cost (C): Number of misclassifications
- Complexity (C): Number of nodes

Of course, dropping nodes will result in a drop in the model’s accuracy.

Thus, in the case of decision trees, the core idea is to iteratively drop sub-trees, which, after removal, leads to:

- a minimal increase in classification cost
- a maximum reduction of complexity (or nodes)

This is depicted below:

In the image above, both sub-trees result in the same increase in cost. However, it makes more sense to remove the sub-tree with more nodes to reduce computational complexity.

In sklearn, we can control cost-complexity-pruning using the `ccp_alpha`

parameter:

- large value of
`ccp_alpha`

→ results in underfitting - small value of
`ccp_alpha`

→ results in overfitting

The objective is to determine the optimal value of ccp_alpha, which gives a better model.

The effectiveness of cost-complexity-pruning is evident from the image below:

- Training the decision tree without any cost-complexity-pruning results in a complex decision region plot, and the model exhibits 100% accuracy.
- However, by tuning the
`ccp_alpha`

parameter, we prevented overfitting while improving the test set accuracy.

Another widely used technique to prevent overfitting is ensemble learning.

In a gist, an ensemble combines multiple models to build a more powerful model.

Whenever I wish to intuitively illustrate their immense power, I use the following image:

They are fundamentally built on the idea that by aggregating the predictions of multiple models, the weaknesses of individual models can be mitigated. Combining models is expected to provide better overall performance.

Ensembles are primarily built using two different strategies:

- Bagging
- Boosting

Here’s how it works:

- Bagging creates different subsets of data with replacement (this is called bootstrapping).
- Next, we train one model per subset.
- Finally, we aggregate all predictions to get the final prediction.

Some common models that leverage Bagging are:

- Random Forests
- Extra Trees

Here’s how it works:

- Boosting is an iterative training process.
- The subsequent model puts more focus on misclassified samples from the previous model
- The final prediction is a weighted combination of all predictions

Some common models that leverage Boosting are:

- XGBoost,
- AdaBoost, etc.

Overall, ensemble models significantly boost the predictive performance compared to using a single model. They tend to be more robust, generalize better to unseen data, and are less prone to overfitting.

As mentioned above, the focus of this article is specifically **Bagging**.

In my experience, there are plenty of resources that neatly describe:

- How Bagging algorithmically works in random forests.
- Experimental demo on how Bagging reduces the overall variance (or overfitting).

For instance, we can indeed verify variance reduction ourselves experimentally.

The following diagram shows the decision region plot obtained from a decision tree and random forest model:

It’s pretty clear that a random forest does not exhibit as high variance (overfitting) as the decision tree model does.

Typically, these resources explain the idea of Bagging as follows:

Instead of training one decision tree, train plenty of them, each on a different subsets of the dataset generated with replacement. Once trained, average the predictions of all individual decision tree models to obtain the final prediction. This results in reducing the overall variance and increases the model’s generalization.

However, these resources often struggle to provide an intuition on:

**Why**Bagging is so effective.**Why**do we sample rows from the training dataset**with replacement**.- The mathematical demonstration that verifies variance reduction.

Thus, in this article, let me address all of these above questions and provide you with a clear and intuitive reasoning on:

- Why bagging makes the random forest algorithm so effective at variance reduction.
- Why does bagging involves sampling with replacement?
- How do we prove variance reduction mathematically?

Towards the end, we shall also build an intuition towards the Extra trees algorithm and how it further contributes towards the variance reduction step.

Once we understand the objective bagging tries to solve, we shall also formulate new strategies to build our own bagging algorithms.

Let’s begin!

As shown in an earlier diagram, the core idea in a random forest model is to train multiple decision tree models, each on a different sample of the training dataset.

During inference, we take the average of all predictions to get the final prediction:

And as we saw earlier, training multiple decision trees reduces the model's overall variance.

But why?

**Let’s dive into the mathematics that will explain this.**

The scikit-learn project has possibly been one of the most significant contributions to the data science and machine learning community for building traditional machine learning (ML) models.

Personally speaking, it’s hard to imagine a world without sklearn.

However, things get pretty concerning if we intend to deploy sklearn-driven models in real-world systems.

Let’s understand why.

Scikit-learn models are primarily built on top of NumPy, which, of course, is a fantastic and high-utility library for numerical computations in Python.

Yet, contrary to common belief, NumPy isn’t as optimized as one may hope to have in real-world ML systems.

One substantial reason for this is that **NumPy can only run on a single core of a CPU**.

This provides massive room for improvement as there is no parallelization support in NumPy (yet), and it naturally becomes a **big concern** for data teams to let NumPy drive their production systems.

While traditional ML models do perform well on tabular datasets, but as we discussed in a recent blog on model compression: *“Typically, when we deploy any model to production, the specific model that gets shipped to production is NOT solely determined based on performance. Instead, we must consider several operational metrics that are not ML-related.”*

Another major limitation is that scikit-learn models cannot natively run on Graphics Processing Units (GPUs).

Having GPU support in deployment matters because real-world systems often demand lightning-fast predictions and processing.

However, as discussed above, sklearn models are primarily driven by NumPy, which, disappointingly, can only run on a single core of a CPU. In this context, it is unlikely to have GPU support anytime soon.

In fact, this is also mentioned on Sklearn’s FAQ page:

**Question:****Will you add GPU support?****Answer**:*No, or at least not in the near future. The main reason is that GPU support will introduce many software dependencies and introduce platform-specific issues. scikit-learn is designed to be easy to install on a wide variety of platforms.*

Further, they mention that “*Outside of neural networks, GPUs don’t play a large role in machine learning today, and much larger gains in speed can often be achieved by a careful choice of algorithms.*”

**I don’t entirely agree with this specific statement.**

Consider the enterprise space. Here, the data is primarily tabular. Classical ML techniques such as linear models and tree-based ensemble methods are frequently used to model the tabular data.

In fact, when you have tons of data to model, there’s absolutely no reason to avoid experimenting with traditional ML models first.

Yet, in the current landscape, one is often compelled to train and deploy deep learning-based models just because they offer optimized matrix operations using **tensors**.

We see a clear gap here.

Thus, in this article, let’s learn a couple of techniques today:

- How do we run traditional ML models on large datasets?
- How do we integrate GPU support with traditional ML models in deployment systems?
- While there is no direct way to do this, we must (somehow) compile our machine-learning model to tensor operations, which can be loaded on a GPU for acceleration. We’ll discuss this in the article shortly.

But before that, we must understand a few things.

More specifically:

- What are tensors?
- How are tensors different from a traditional NumPy array?
- Why are tensor computations faster than NumPy operations, and why are tensor operations desired?

Let’s begin!

Many often interpret tensors as a complicated and advanced concept in deep learning.

However, it isn’t.

The only thing that is ever there to understand about Tensors is that, like any NumPy array, Tensors are just another data structure to store multidimensional data.

- When we use NumPy to store numerical data, we create a
**NumPy array**— NumPy’s built-in data structure. - When we use PyTorch (for instance) to store numerical data, we create a Tensor — PyTorch’s built-in data structure.

That’s it.

**Tensor, like NumPy array, is just another data structure.**

Now, an obvious question at this point is:

Why create another data structure when NumPy arrays do the exact same thing of storing multidimensional data, and they are very well integrated with other scientific Python libraries?

There are multiple reasons why PyTorch decided to develop a new data structure.

NumPy undoubtedly offers:

- extremely fast, and
- optimized operations.

This happens through its vectorized operations.

Simply put, vectorization offers run-time optimization:

- when dealing with a batch of data together…
- …by avoiding native Python for-loops (which are slow).

But as discussed earlier in this article, NumPy** DOES NOT support parallelism.**

Thus, even though its operations are vectorized, every operation is executed in a single core of the processing unit.

This provides further scope for run-time improvement.

Of course, there are open-source libraries like Numexpr that address this limitation by providing a fast evaluator for NumPy expression using:

**Multi-threading**:- It is a parallel computing technique that allows a program to execute multiple threads (smaller units of a process) concurrently.
- In the context of NumPy and libraries like Numexpr, multi-threading accelerates mathematical and numerical operations by dividing the computation across multiple CPU cores.

- This approach is particularly effective when you have a multi-core CPU, as it leverages the available cores for parallelism, leading to faster computation.
**Just-in-time (JIT) compilation**:- JIT compilation is a technique used to improve the run-time performance of code by compiling it at run-time, just before execution.

- In the context of Numexpr (and similar libraries, JIT compilation involves taking a NumPy expression or mathematical operation and dynamically generating machine code specific to the operation.
- As a result, JIT-compiled code can run much faster than equivalent pure Python code because it is optimized for the specific operation and can make use of low-level hardware features.

The speedup offered by Numexpr is evident from the image below.

According to Numexpr’s documentation, depending upon the complexity of the expression, the speed-ups can range from 0.95x to 20x.

Nonetheless, the biggest problem is that Numexpr can only speed up element-wise operations on NumPy arrays.

This includes:

- Element-wise sum/multiplication etc.
- Element-wise transformations like
`sin`

,`log`

, etc. - and more.

**But Numexpr has no parallelization support for matrix multiplications, which, as you may already know, are the backbone of deep learning models.**

This problem gets resolved in PyTorch tensors as they offer parallelized operations.

**An important point to note:**

**GPU parallelization:**- If you're working with PyTorch tensors on a GPU (using CUDA), the matrix multiplication operation is highly parallelized across the numerous cores of the GPU.
- Modern GPUs consist of thousands of cores designed for parallel computation.
- When you perform a matrix multiplication on a GPU, these cores work together to compute the result much faster than a CPU could.

**CPU Parallelization**:- The extent of parallelization on a CPU may depend on the CPU’s architecture.
- All CPUs these days have multiple cores, and PyTorch is optimized to utilize these cores efficiently for matrix operations.
- While it may not be as parallel as a GPU, you can still expect significant speed improvements over performing the operation in pure Python.

We can also verify experimentally:

- On the left, we create two random NumPy arrays and perform matrix multiplication using
`np.matmul()`

method. - On the right, we create two random PyTorch tensors and perform matrix multiplication using
`torch.matmul()`

method.

As depicted above, **PyTorch is over three times faster than NumPy**, which is a massive speed-up.

This proves that PyTorch provides highly optimized vector operations, which the neural network can benefit from, **not only during forward pass but backpropagation as well**.

In fact, here’s another reason why tensor operations in PyTorch are faster.

See, as we all know, **NumPy is a general-purpose computing framework** that is designed to handle a wide range of numerical computations across various domains, not limited to just deep learning or machine learning.

In fact, NumPy is not just used by data science and machine learning practitioners, but it is also widely used in various scientific and engineering fields for tasks such as signal processing, image analysis, and simulations in physics, chemistry, and biology.

It’s so popular in Biological use cases that a bunch of folks extended the NumPy package to create BioNumPy:

While its versatility is a key strength, it also means that NumPy’s internal optimizations are geared toward a broad spectrum of use cases.

On the other hand, **PyTorch is purposely built for deep learning and tensor operations**.

This specialized focus allows PyTorch to implement highly tuned and domain-specific optimizations for tensor computations, including matrix multiplications, convolutions, and more.

These optimizations are finely tuned to the needs of deep learning, where large-scale matrix operations are fundamental.

Being niched down to a specific set of users allowed PyTorch developers to optimize tensor operations, including matrix multiplication to a specific application — deep learning.

If you want another motivating example, we discussed this in the newsletter here:

In a gist, the core idea is that the more specific we get, the better we can do compared to a generalized solution.

Deep learning is all about a series of matrix operations applied layer after layer to generate the final output.

For instance, consider the following neural network for regression:

- First, the input received at the input layer $(x_1, x_2, \cdots, x_m)$ is transformed by a set of weights $(W_A)$ and an activation function to get the output of the hidden layer.
- Next, the output of the hidden layer is further transformed by a set of weights $(W_B)$ to get the final output (t).

If we were to use NumPy to represent input, weights, and layer outputs, it would be impossible to tell how a specific array was computed.

For instance, consider the two NumPy arrays `arr1`

and `arr2`

below:

The NumPy array `arr3`

holds no information about how it was computed.

In other words, as long as we don’t manually dig into the code, we can never tell:

- What were the operands?
- What was the operator?

But why do we even care about that information?

See, as long as we are only doing a forward pass in a neural network, we don’t care which specific operation and which arrays generated a particular layer output. We only care about the output in that case.

But that’s not how neural networks are trained, are they?

To train a neural network, we must run backpropagation.

To run backpropagation, we must compute gradients to update the weights.

And to compute gradients of a layer’s weights, we must know the specific arrays that were involved in that computation.

For instance, consider the above neural network again:

To update the weights $W_B$, we must compute the gradient $\Large \frac{\delta L}{\delta W_B}$.

The above gradient depends on the loss value $L$, which in turn depends on $\hat y$.

Thus, we must know the specific vectors that were involved in the computation of $\hat y$.

While this is clear from the above network:

- What if we add another layer?
- What if we change the activation function?
- What if we add more neurons to the layer?
- What if we were to compute the gradient of weight in an earlier layer?

All this can get pretty tedious to manage manually.

However, if (somehow) we can keep track of how each tensor was computed, what operands were involved, and what the operator was, we can simplify gradient computation.

A computational graph helps us achieve this.

Simply put, a computational graph is a directed acyclic graph representing the sequence of mathematical operations that led to the creation of a particular tensor.

💡

A directed acyclic graph is a directed graph with no directed cycles.

During network training, PyTorch forms this computational graph during the forward pass.

For instance, the computational graph for a dummy neural network is shown below:

💡

Here, we have represented the operation as **matmul **for simplicity. In reality, however, PyTorch stores the backward function of that operation in its computational graph.

- First, we perform a matrix multiplication between the input $X$ and the weights $W_A$ to get the output activations $Z_B$ (we are ignoring any activation functions for now).
- Next, we perform a matrix multiplication between the output activations $Z_B$ and the weights $W_B$ to get the network output $Z_C$.

During backpropagation, PyTorch starts scanning the computational graph backward, i.e., from the output node, iteratively computes the gradients, and updates all the weights.

The program that performs all the gradient computation is called PyTorch Autograd.

Large deep learning models demand plenty of computational resources for speeding up model training.

However, NumPy operations are primarily designed to run on the Central Processing Unit (CPU), which is the general-purpose processor in most computers.

While CPUs are versatile and suitable for many tasks, they do not provide the speed and parallel processing capabilities needed for large-scale numerical computations, especially in the context of modern deep learning and scientific computing.

On a side note, if you genuinely want to run NumPy-like computation on a GPU, CuPy is an open-source NumPy alternative that you may try.

It’s a NumPy-compatible array library for GPU-accelerated computing.

The syntax of CuPy is quite compatible with NumPy. To use GPU, you just need to replace the following line of your code:

Nonetheless, the issue of not being able to track how each array was computed still exists with CuPy.

Thus, even if we wanted to, we could not use CuPy as an alternative to NumPy.

In fact, CuPy, like NumPy, is also a general-purpose scientific computation library. So any deep learning-specific optimizations are still not up to the mark.

These limitations prompted PyTorch developers to create a new data structure, which addressed these limitations.

This also suggests that by somehow compiling machine learning models to tensor computations, we can leverage immense inference speedups.

Before getting into those details, let’s understand how we can train sklearn models on large datasets on a CPU.

So far, we have spent plenty of time understanding the motivation for building traditional ML models on large datasets.

As sklearn can only utilize CPU, using it for large datasets is still challenging.

Yet, there’s a way.

We know that sklearn provides a standard API across each of its machine learning model implementations.

- Train the model using
`model.fit()`

. - Predict the output using
`model.predict()`

. - Compute the accuracy using
`model.score()`

. - and more.

However, the problem with training a model this way is that the sklearn API expects the entire training data at once.

This means that the entire dataset **must be** available in memory to train the model.

But what if the dataset itself is large enough to load in memory? These are called out-of-memory datasets.

In fact, even if we can somehow barely load the dataset in memory, it might be difficult to train the model because every model requires some amount of computations, which, of course, will consume memory.

Thus, there’s a high possibility that the program (or Jupyter kernel) may crash.

Nonetheless, there’s a solution to this problem.

In situations where it’s not possible to load the entire data into the memory at once, we can load the data in chunks and fit the training model for each chunk of data.

This is also called incremental learning and sklearn, considerately, provides the flexibility to do so.

More specifically, Sklearn implements the `partial_fit()`

API for various algorithms, which offers incremental learning.

As the name suggests, the model can learn incrementally from a mini-batch of instances. This prevents limited memory constraints as only a few instances are loaded in memory at once.

What’s more, by loading and training on a few instances at a time, we can possibly speed up the training of sklearn models.

Why?

Usually, when we use the `model.fit(X, y)`

method to train a model in sklearn, the training process is vectorized but on the entire dataset.

While vectorization provides magical run-time improvements when we have a bunch of data, it is observed that the performance may degrade after a certain point.

Thus, by loading fewer training instances at a time into memory and applying vectorization, we can get a better training run-time.

Let’s see this in action!

First, let’s create a dummy classification dataset with:

- 20 Million training instances
- 5 features
- 2 classes

We will use the `make_classification()`

method from sklearn to do so:

After creating a Pandas DataFrame and exporting it to a CSV, the dataset occupies roughly 4 GBs of local storage space:

We train a `SGDClassifier`

model using sklearn on the entire dataset as follows:

The above training takes about $251$ seconds.

Next, let’s train the same model using the `partial_fit()`

API of sklearn.

Here’s what we shall do:

- Load data from the CSV file
`large_dataset.csv`

in chunks- We can do this by specifying the
`chunksize`

parameter in`pd.read_csv()`

method. - Say
`chunksize=400000`

, then this would mean that Pandas will only load four lakh rows at a time in memory.

- We can do this by specifying the
- After loading a specific chunk, we will invoke the
`partial_fit()`

API on the`SGDClassifier`

model.

This is implemented below:

**The training time is reduced by ~8 times**, which is massive.

💡

In this case, the classes parameter is used to specify all the classes in the training dataset. When using the partial_fit() API, a mini-batch may not have instances of all classes (especially the first mini-batch). Thus, the model will be unable to cope with new/unseen classes in subsequent mini-batches. Therefore, we must pass a list of all possible classes in the classes parameter.

This validates what we discussed earlier:

While vectorization provides magical run-time improvements when we have a bunch of data, it is observed that the performance may degrade after a certain point. Thus, by loading fewer training instances at a time into memory and applying vectorization, we can get a better training run-time.

Of course, we must also compare the model coefficients and prediction accuracy of the two models.

The following visual depicts the comparison between `model_full`

and `model_chunk`

:

Both models have similar coefficients and similar performance.

Having said that, it is also worth noting that not all sklearn estimators implement the `partial_fit()`

API.

Here's the list of models that do:

Once we have trained our sklearn model (either on a small dataset or large), we may want to deploy it.

However, as discussed earlier, sklearn models are backed by NumPy computations, so they **can only run on a single core of a CPU**.

Thus, in a deployment scenario, this can lead to suboptimal run-time performance.

Nonetheless, it is possible to compile many sklearn models to tensor operations, which can be loaded on a GPU to gain immense speedups.

Let’s understand how.

In recent deep dives, we’ve primarily focused on cultivating skills that can help us develop large machine learning (ML) projects.

For instance, in the most recent deep dive on “**Model Compression**”, we learned many techniques to drastically reduce the size of a model to make it more production-friendly.

In the above article, we saw how these techniques allow us to reduce both the latency and size of the original model, which directly helps in:

- Lowering computation costs.
- Reducing model footprint.
- Improving user experience due to low latency…

**…all of which are critical metrics for businesses.**

However, learning about model compression techniques isn’t sufficient.

In most cases, we would only proceed with model compression when the model is intended to serve an end-user.

And that is only possible when we know how to deploy and manage machine learning in production.

Thus, after learning about model compression techniques, we are set to learn the next critical skill** **—** deployment.**

In my opinion, many think about deployment as just “deployment” — host the model somewhere, obtain an API endpoint, integrate it into the application, and you are done!

**But that is almost NEVER the case.**

This is because, in reality, plenty of things must be done post-deployment to ensure the model’s reliability and performance.

What they are? Let’s understand!

Deployment is a pivotal stage in the ML project lifecycle. It’s that stage where a real user will rely on your model’s predictions.

Yet, it’s important to recognize that deployment is not the final destination.

After deploying a model, several critical considerations must be addressed to ensure its reliability and performance.

Let’s understand them.

Version control is critical to all development processes. It allows developers to track software changes (code, configurations, data, etc.) over time.

In the context of data teams, version control can be especially crucial when deploying models.

For instance, with version control, one can precisely identify what changed, when it changed, and who changed it — which is crucial information when trying to diagnose and fix issues that arise during the deployment process or if models start underperforming post-deployment.

This goes back to what we discussed in a recent deep dive — “*Machine learning deserves the rigor of any software engineering field.*”

If the model starts underperforming, git-based functionality allows us to quickly roll back to previous versions of the model.

There are many other benefits too.

Effective collaboration becomes increasingly important as data science projects get bigger and bigger.

Someone in the team might be working on identifying better features for the model, and someone else might be responsible for fine-tuning hyperparameters or optimizing the deployment infrastructure.

And it is well known that with version control, teams can work on the same codebase/data and improve the same models without interfering with each other’s work.

Moreover, one can easily track changes, review each other’s work, and resolve conflicts (if any).

Reproducibility is one of the critical aspects of building reliable machine learning.

Imagine this: Something that one works on one system but does not work on another reflects bad reproducibility practices.

Why it’s important, you may wonder?

It ensures that results can be replicated and validated by others, which improves the overall credibility of our work.

Version control allows us to track the exact code version and configurations used to produce a particular result, making it easier to reproduce results in the future.

This becomes especially useful for open-source data projects that many may use.

CI/CD enables teams to build, test, and deploy code quickly and efficiently.

In machine learning, Continuous Integration (CI) may involve building and testing changes automatically to ML models as soon as they are committed to a code repository.

In Continuous Deployment (CD), the objective can be to reflect new changes to the model once they have passed testing.

Consequently, it should seamlessly update the changes to production, making the latest version of the model available to end users.

Model logging is another crucial aspect of post-deployment ML operations.

As the name suggests, logging involves capturing and storing relevant information about model performance, resource utilization, predictions, input data, latency, etc.

There are various reasons why model logging is important and why it’s something that should NEVER be overlooked.

To understand better, imagine you have already deployed a model, and it is serving end-users.

Once deployed, there’s little possibility that nothing will go wrong in production, **especially on the data front**!

Let’s understand in detail.

**Concept drift** happens when the statistical properties of the target variable or the input features sent as input to the model change over time.

In simpler terms, the relationship between your model's inputs and outputs evolves, making your model less accurate over time if not addressed.

Concept drift can occur due to various reasons, such as:

- Changes in user behavior
- Shifts in the data source
- Alterations in the underlying data-generating process.

For instance, imagine you are building a spam email classifier. You train the model on a dataset collected over several months.

Initially, the model performs well and accurately classifies spam and non-spam emails.

However, over time, email spamming techniques evolve.

New types of spam emails emerge with different keywords, structures, and techniques.

This change in the underlying concept of “spam” represents concept drift.

That is why it is important to have periodic retraining or continuous training strategies in place.

If your model isn't regularly retrained with up-to-date data, it may start misclassifying the new types of spam emails, leading to decreased performance.

💡

The term ‘Covariates’ refers to the features of your model.

**Covariate shift** is a specific type of concept drift that occurs when the distribution of the input features (covariates) in your data changes over time, but the true relationship between the target variable and the input remains the same.

In other words, the true (or natural) relationships between the input features and the target variable stay constant, but the distribution of the input features shifts.

For instance, consider this is the true relationship, which is non-linear:

Based on the observed training data, we ended up learning a linear relationship:

However, at the time of inference post-deployment, the distribution of input samples was different from that of the observed distribution:

It leads to poor model performance because the model was trained on one distribution of the data, but now, it is being tested or deployed on a different distribution.

Methods for addressing covariate shifts include reweighting the training data or using domain adaptation techniques to align the source and target distributions.

💡

To an extent, batch normalization is a remedy for covariate shifts in neural networks.

For instance, suppose you are building a weather forecasting model. You train the model using historical weather data from a specific region, and the training data includes features like temperature, humidity, and wind speed.

However, when you deploy the model to a different region with a distinct climate, the distribution of these features can shift significantly.

For instance, temperature ranges and humidity levels in the new region might be quite different from those in the training data. This covariate shift can cause your model to make inaccurate predictions in the new environment.

When building statistical models, we typically assume that the samples are identically distributed.

**Non-stationarity** refers to the situation where the **probability distribution** of the samples evolves over time in a non-systematic or unpredictable manner.

This can encompass various aspects, including changes in data distributions, trends, seasonality, or other patterns.

**Non-stationarity can be challenging for machine learning models, as they are typically trained while assuming that the data distribution remains constant.**

For instance, assume you are building some wealth predictor. Using your currency amount feature will typically not serve as a good feature because currency values get affected due to inflation.

Models deployed in non-stationary environments may need regular updates or adaptive learning strategies to cope with changing data patterns.

**Unrepresentative training data** is a situation where the data used to train a machine learning model does not adequately represent the real-world conditions or the diversity of scenarios that the model will encounter in production.

When training data is not representative, the model may perform well on the training data but poorly on new, unseen data.

This issue can lead to bias and poor generalization.

For instance, suppose you are building a speech recognition system for a voice assistant.

You collect training data primarily from young adults with clear accents and no speech impairments.

However, in real-world usage, the voice assistant will be used by people of all ages who speak various languages and have different speech patterns.

If your training data is unrepresentative and biased towards a specific demographic, the model may struggle to understand and accurately transcribe speech from a more diverse user base, leading to poor performance in production.

The above four problems that we typically face in production systems highlight the importance of model logging.

As discussed above, addressing these issues often involves continuous monitoring of model performance in the deployment environment, collecting and labeling new data when necessary, and retraining the model to adapt to changing conditions.

Traditionally, in industry, advanced techniques like adaptive learning are employed to update the model with the new data and potentially mitigate the impact of concept drift, covariate shift, non-stationarity, and unrepresentative training data.

👉

Below, we shall discuss a bit about Adaptive learning (mostly conceptual ideas), but we will do a full practical article on this soon.

Traditionally, ML models are trained on some gathered fixed/static dataset and then used to make predictions on unseen data.

While this is how machine learning has been typically (and successfully) approached so far, it gets infeasible to train a new model from scratch every time we get some new data.

This makes intuitive sense as well.

Adaptive learning is a remedy to this problem.

Adaptive models can adapt and improve their performance as they are exposed to more data, leading to better accuracy and utility.

In situations where the data distribution is constantly changing, adaptive models can adapt and continue to perform well, while non-adaptive models may struggle.

A major advantage of adaptive learning is that since the model isn’t trained from scratch, for every update, the additional computational cost is small if the previously trained model is used for the start of the next retraining iteration.

In the realm of solving real-life problems by deploying machine learning models, it is inevitable that, with time, data distribution will change.

As a result, the models trained on old data will likely provide little value going forward.

Adaptive models are a great solution in such situations, as they consistently adapt to the incoming data streams.

In almost all ML use cases, the algorithm is never coded from scratch.

Instead, one uses open-source implementations offered by libraries like PyTorch, Sklearn, and many more.

To ensure reproducibility in production, the production environment should be consistent with the environment in which it was trained.

This involves installing similar versions of libraries used, software dependencies, OS configurations, and many more.

Of course, achieving this consistency is not a painstaking process. All you should do is maintain an environment configuration.

Yet, it does require careful environment configuration and management.

This involves documenting and tracking the versions of all software components, libraries, and dependencies used during model development and deployment.

To address these consistency challenges, organizations often use containerization technologies like Docker.

Containers encapsulate the entire environment, including software dependencies, libraries, and configurations, ensuring that the same environment is replicated in both the development and production stages.

ML engineers may not have experience with deployment. They may not have the necessary expertise in areas such as software engineering, MLOps, and infrastructure management.

This can make it difficult for them to effectively deploy and scale models in production environments.

In such cases, organizations hire specialized talents.

However, engineers hired specifically for deployment may not have an in-depth understanding of ML algorithms and techniques.

This makes it difficult for them to understand the code and make necessary optimizations, leading to issues with scaling, performance, and reliability, and can ultimately impact the effectiveness of the model in production.

The above pain points, along with the data challenges we discussed above, highlight the **necessity for a data scientist to have the necessary deployment expertise**.

Traditional hosting services like Google Cloud, AWS, and Heroku have been go-to options for deploying machine learning models.

However, the process can be challenging and time-consuming, requiring specialized expertise in infrastructure and DevOps.

For data scientists without these skills, deploying models to production can be a significant pain point.

There are several challenges associated with traditional hosting services.

First, data scientists often switch between different tools and environments to manage deployments. This means leaving the comfort of their Jupyter notebooks, where they spend most of their time developing and refining models.

The process can be jarring, and the need to learn new tools and interfaces can slow down productivity.

Second, deploying machine learning models to production environments demands plenty of configuration and management of infrastructure resources, including servers, networking, and security.

This is a specialized area that many data scientists may need to become more familiar with, and it can take a lot of time and effort to get right.

The above pain points highlight a need for a simple and elegant way to deploy machine learning models that doesn’t require specialized expertise and can be done entirely from a Jupyter Notebook.

Modelbit is a deployment service that specifically addresses all these challenges, allowing data scientists to deploy models with just a single command from their notebooks.

With Modelbit, there’s no need to worry about infrastructure, security, or server management — the service takes care of everything, allowing data scientists to focus on what they are supposed to do — building and improving models.

Thus, in this article, let’s understand how to use Modelbit for machine learning model deployment.

Let’s begin!

🗒️

The core objective behind model deployment is to obtain an API endpoint for our deployed model, which can be later used for inference purposes:

Modelbit lets us seamlessly deploy ML models directly from our Python notebooks (or Git, as we would see ahead in this article) and obtain a REST API.

Since Modelbit is a relatively new service, let’s understand the general workflow to generate an API endpoint when deploying a model with Modelbit.

The image below depicts the steps involved in deploying models with Modelbit:

- Step 1) We connect the Jupyter kernel to Modelbit.
- Step 2) Next, we train the ML model.
- Step 3) We define the inference function. Simply put, this function contains the code that will be executed at inference. Thus, it will be responsible for returning the prediction.
- Step 4) [OPTIONAL] Here, we specify the version of Python and other open-source libraries we used while training the model.
- Step 5) Lastly, we send it for deployment.

Once done, Modelbit returns the API endpoint, which we can integrate into any of the applications and serve end-users with.

Let’s implement this!

]]>Training machine learning (ML) models are frequently driven by a relentless pursuit of achieving higher and higher accuracies.

Many create increasingly complex deep learning models, which, without a doubt, do incredibly well “performance-wise.”

However, the complexity severely impacts their real-world utility.

**For years, the primary objective in model development has been to achieve the best performance metrics.**

👉

This, unfortunately, is also a practice that many leaderboard-based competitions promote. Nothing wrong, but in my opinion, this overshadows the importance of focusing on the real-world applicability of the solution.

However, it is important to note that when it comes to deploying these models in production (or user-facing) systems, the focus shifts from raw accuracy to considerations such as efficiency, speed, and resource consumption.

Thus, typically, when we deploy any model to production, the specific model that gets shipped to production is NOT solely determined based on performance.

Instead, we must consider several operational metrics that are not ML-related.

**What are they? Let’s understand!**

When a model is deployed into production, certain requirements must be met.

Typically, these “requirements” are not considered during the prototyping phase of the model.

For instance, it is fair to assume that a user-facing model may have to handle plenty of requests from a product/service the model is integrated with.

And, of course, we can never ask users to wait for, say, a minute for the model to run and generate predictions.

Thus, along with “model performance,” we would want to optimize for several other operational metrics:

It’s the time it takes for a model to process a single input and generate a prediction.

It measures the delay between sending a request to the model and receiving the response.

Striving for a low inference latency is crucial for all real-time or interactive applications, as users expect a quick response.

High latency, as you may have guessed, will lead to a poor user experience and will not be suitable for many applications like:

- Chatbots
- Real-time speech-to-text transcription
- Gaming, and many more.

Throughput is the number of inference requests a model can handle in a given time period.

It estimates the model’s ability to process multiple requests simultaneously.

Yet again, as you may have guessed, high throughput is essential for applications with a high volume of incoming requests.

These include e-commerce websites, recommendation systems, social media platforms, etc. High throughput ensures that the model can serve many users concurrently without significant delays.

This refers to the amount of memory a model occupies when loaded for inference purposes.

It quantifies the memory footprint required to store all the parameters, configurations, and related data necessary for the model to make predictions or generate real-time outputs.

The significance of model size becomes particularly apparent when deploying models in resource-constrained environments.

Many production environments, such as mobile devices, edge devices, or IoT devices, have limited memory capacity.

It is obvious to guess that in such cases, the model’s size will directly impact whether it can be deployed at all.

Large models may not fit within the available memory, making them impractical for these resource-constrained settings.

This is a famous story.

In 2006, Netflix launched the “Netflix Prize,” a machine learning competition that encouraged ML engineers to build the best algorithm to predict user ratings for films.

**The grand prize was USD $ 1,000,000 $.**

After the competition concluded, Netflix awarded a $1 million prize to a developer team in 2009 for an algorithm that increased the accuracy of the company's recommendation engine by **10 percent**.

That’s a lot!

Yet, Netflix never used that solution because it was overly complex.

Here’s what Netflix said:

The increase in accuracy of the winning improvements did not seem to justify the engineering effort needed to bring them into a production environment.

The complexity and resource demands of the developed model made it impractical for real-world deployment. Netflix faced several challenges:

**Scalability:**The model was not easily scalable to handle the vast number of users and movies on the Netflix platform. It would have required significant computational resources to make real-time recommendations for millions of users they had.**Maintenance:**Managing and updating such a complex model in a production environment would have been a logistical nightmare. Frequent updates and changes to the model would be challenging to implement and maintain.**Latency:**The ensemble model's inference latency was far from ideal for a streaming service. Users expect near-instantaneous recommendations, but the complexity of the model made achieving low latency difficult.

You can read more about this story here: Netflix Prize story.

Consequently, Netflix never integrated the winning solution into its production recommendation system. Instead, they continued to use a simplified version of their existing algorithm, which was more practical for real-time recommendations.

This real-life instance from the Netflix Prize was a reminder that we must strive for a delicate balance between model complexity and practical utility.

**While highly complex models may excel in research and competition settings, they may not be suitable for real-world deployment due to scalability, maintenance, and latency concerns.**

In practice, simpler and more efficient models often are a better choice for delivering a seamless user experience in production environments.

Let me ask you this. Which of the following two models would you prefer to integrate into a user-facing product?

I strongly prefer Model B.

If you understand this, you resonate with the idea of keeping things simple in production.

Fortunately, there are various techniques that can help us reduce the size of the model, thereby increasing the speed of model inference.

These techniques are called **Model Compression** methods.

Using these techniques, you can reduce both the latency and size of the original model.

As the name suggests, model compression is a set of techniques used to reduce the size and computational complexity of a model while preserving or even improving its performance.

They aim to make the model smaller — that is why the name “**model compression**.”

Typically, it is expected that a smaller model will:

- Have a lower inference latency as smaller models can deliver quicker predictions, making them well-suited for real-time or low-latency applications.
- Be easy to scale due to their reduced computational demands.
- Have a smaller memory footprint.

In this article, we’ll look at four techniques that help us achieve this:

- Knowledge Distillation
- Pruning
- Low-rank Factorization
- Quantization

As we will see shortly, these techniques attempt to strike a balance between model size and accuracy, making it relatively easier to deploy models in user-facing products.

👉

The Jupyter notebook of this entire article has been provided at the bottom of the article.

Let’s understand them one by one!

This is one of the most common, effective, reliable, and one of my favorite techniques to reduce model size.

Essentially, knowledge distillation involves training a smaller, simpler model (referred to as the “student” model) to mimic the behavior of a larger, more complex model (known as the “teacher” model).

The term can be broken down as follows:

**Knowledge:**Refers to the understanding, insights, or information that a machine learning model has acquired during training. This “knowledge” can be typically represented by the model’s parameters, learned patterns, and its ability to make predictions.

**Distillation:**In this context, distillation means transferring or condensing knowledge from one model to another. It involves training the student model to mimic the behavior of the teacher model, effectively transferring the teacher's knowledge.

This is a two-step process:

- Train the large model as you typically would. This is called the “teacher” model.
- Train a smaller model, which is intended to mimic the behavior of the larger model. This is also called the “student” model.

The primary objective of knowledge distillation is to transfer the knowledge, or the learned insights, from the teacher to the student model.

This allows the student model to achieve comparable performance with fewer parameters and reduced computational complexity.

The technique makes intuitive sense as well.

Of course, comparing it to a real-world teacher-student scenario in an academic setting, the student model may never perform as well as the teacher model.

But with consistent training, we can create a smaller model that is **almost** as good as the larger one.

This goes back to the objective we discussed above

Strike a balance between model size and accuracy, such that it is relatively easier to deploy models in user-facing products.

A classic example of a model developed in this way is DistillBERT. It is a student model of BERT.

We also discussed this in the newsletter here:

DistilBERT is approximately $40\%$ smaller than BERT, which is a massive difference in size.

Still, it retains approximately $97\%$ of the natural language understanding (NLU) capabilities of BERT.

**What’s more, DistilBERT is roughly 60% faster in inference.**

This is something I have personally experienced and verified in one of my research studies on Transformer models:

As shown above, on one of the studied datasets (SensEval-2), BERT achieved the best accuracy of $76.81$. With DistilBERT, it was $75.64$.

On another task (SensEval-3), BERT achieved the best accuracy of $80.96$. With DistilBERT, it was $80.23$.

Of course, DistilBERT isn’t as good as BERT. Yet, the performance difference is small.

Given the run-time performance benefits, it makes more sense to proceed with DistilBERT instead of BERT in a production environment.

💡

If you are interested in learning more about my research study, you can read it here: **A Comparative Study of Transformers on Word Sense Disambiguation**.

One of the biggest downsides of knowledge distillation is that one must still train a larger teacher model first to train the student model.

However, in a resource-constrained environment, it may not be feasible to train a large teacher model.

Assuming we are not resource-constrained at least in the development environment, one of the most common techniques for Knowledge Distillation is **Response-based Knowledge Distillation**.

As the name suggests, in **response-based knowledge distillation**, the focus is on matching the output responses (predictions) of the teacher model and the student model.

Talking about a classification use case, this technique transfers the **probability distributions** of class predictions from the teacher to the student.

It involves training the student to produce predictions that are not only accurate but also mimic the soft predictions (probability scores) of the teacher model.

As we are trying to mimic the **probability distribution of the class predictions of the teacher model**, one ideal candidate for the loss function is KL divergence.

We discussed this in detail in one of the previous articles on t-SNE.

Yet, here’s a quick recap:

The core idea behind KL divergence is to assess how much information is lost when one distribution is used to approximate another.

Thus, the more information is lost, the more the KL divergence. As a result, the more the dissimilarity.

KL divergence between two probability distributions $P(x)$ and $Q(x)$ is calculated as follows:

The formula for KL divergence can be read as follows:

The KL divergence $D_{KL} (P || Q) $ between two probability distributions $P$ and $Q$ is calculated by summing the above quantity over all possible outcomes $x$. Here:

- $P(x)$ represents the probability of outcome $x$ occurring according to distribution $P$.
- $Q(x)$ represents the probability of the same outcome occurring according to distribution $Q$.

It measures how much information is lost when using distribution $Q$ to approximate distribution $P$.

Imagine this. Say $P$ and $Q$ were identical. This should result in zero loss of information. Let’s verify this from the formula above.

If the probability distributions $P$ and $Q$ are identical, it means that for every $x$, $P(x) = Q(x)$. Thus,

Simplifying, we get:

This is precisely what we intend to achieve in response-based knowledge distillation.

Simply put, we want the probability distribution of the class predictions of the student model to be identical to the probability distribution of the class predictions of the teacher model.

- First, we can train the teacher model as we typically would.
- Next, we can instruct the student model to mimic the probability distribution of the class predictions of the teacher model.

Let’s see how we can practically use response-based knowledge distillation using PyTorch.

More specifically, we shall train a slightly complex neural network on the MNIST dataset. Then, we will build a simpler neural network using the response-based knowledge distillation technique.

First, we import the required packages from PyTorch:

Next, we load the MNIST dataset (train and test) and create their respective PyTorch dataloaders.

Now, we shall define a simple CNN-based neural network architecture. This is demonstrated below:

Moving on, we shall initialize the teacher model and define the loss function to train it — the `CrossEntropyLoss`

.

Now, we will train the teacher model.

With this, we are done with the Teacher model.

Next, we must train the Student Model.

We defined the Teacher model as a CNN-based neural network architecture. Let’s define the Student model as a simple feed-forward neural network without any CNN layers:

The above method accepts two parameters:

- The output of the student model (
`student_logits`

). - The output of the teacher model (
`teacher_logits`

).

We convert both outputs to probabilities using the softmax function.

Finally, we find the KL divergence between them and return it.

Moving on, we shall initialize the student model and define the optimizer as we did before.

Finally, it’s time to train the Student model.

To recap, the teacher model was a CNN-based neural network architecture. The student model, however, was a simple feed-forward neural network.

The following visual compares the performance of the teacher and the student model:

]]>Before we understand what OOP is, let’s understand the backstory, and what led us to formulate “OOP”?

Why programmers thought it was essential to have such a thing? What were the pain points in the traditional way of programming?

Traditional programming (or procedural programming) focuses on breaking a problem into steps or procedures executed one after the other.

While this approach was suitable for small-scale projects or experimentation, it eventually became problematic when projects transformed into large, complex systems.

In other words, it became difficult to manage, understand, and maintain the code as it grew in size and complexity.

One major challenge with traditional programming was the lack of structure and organization.

Before OOP, programs were often written in a linear fashion. This made it difficult to manage large amounts of code and understand the relationships between different parts of the program.

As a result...👇

...traditional programming resulted in code that was arder to maintain and debug. Consequently, it was prone to more bugs and errors.

This resulted in longer development cycles and reduced software quality.

The list of challenges with traditional programming is endless.

To overcome these challenges, some computer scientists envisioned a new way of programming that would allow developers to create more flexible and adaptable software programs.

The thought process behind the new programming paradigm was inspired by real-world entities. They have certain properties and behaviors, and can interact with one another.

Thus, scientists believed that programming could be modeled in a similar fashion.

And this is how OOP was born in the late 70s.

As the name suggests, Object-Oriented Programming (OOP) is a programming paradigm/technique based on the concepts of “objects.” That is why the name “object-oriented.”

This is in contrast to traditional programming where methods are executed in sequence.

The core idea of OOP is to model real-world objects and their interactions in a program.

Thus, each object is treated as a separate entity with different values for the properties.

For example, you can model a `car`

object with properties such as `color`

, `speed`

, `model`

, and methods such as `start`

, `stop`

, and `accelerate`

.

In response to the challenges posed by traditional programming, OOP was designed as a new programming paradigm.

Its primary aim was to improve code organization, reusability, maintainability, and scalability.

OOP introduced objects and classes (discussed below), which made it easier to organize code into manageable and reusable components.

Being organized, it was easier to understand the structure of a program, shorten the development lifecycle, and eliminate bugs and errors.

In addition to this, OOP also introduced some crucial concepts, such as inheritance. This provided a way to create relationships between classes.

Moreover, OOP allowed programmers to easily extend existing components and build complex systems. This drastically reduced the amount of code needed to be written again from scratch.

Next, let’s discuss some basic terminologies around classes and how they are defined in Python.

OOP is defined around some core terminologies and concepts. Let’s understand them below:

Class is a blueprint (or template) for instantiating objects. Recall the `Car`

example discussed above. In that context, we can call `Car`

a class.

Objects created from the same class will have the same properties and behaviors. However, the values for the properties may differ.

For instance, all car objects created from the `Car`

class will have the same name of the properties, such as `cost`

, `color`

, `speed`

, `model`

etc. However, the values for these properties may differ, as shown below:

In Python, we define a class using the `class`

keyword, followed by its name:

Inside a class, we define the attributes and methods that its object will have.

The variables defined within a class that store information about an object are called attributes. They can also be thought of as an object’s properties.

Attributes are of two types:

**2.1) Instance-level attributes**

As the name suggests, these attributes are unique to each instance of a class. In other words, every time we create an object, each object gets its copy of instance-level attributes.

For example, in the `Car`

example above, the variables `cost`

, `color`

, `speed`

and `model`

are instance-level attributes.

Also, these attributes usually have different values for each instance.

In Python, we assign the instance-level attributes in the `__init__`

method of a class. As the name suggests, “init” lets us initialize an object.

As `__init__`

is like any other Python function, we can also pass a bunch of parameters to the class method.

The first parameter in a Python class is always the `self`

keyword. It is a special variable used as a reference to the calling object. The `self`

keyword is followed by other parameters, as shown below:

The parameters specified in the `__init__`

method are assigned to the corresponding instance-level attributes.

**2.2) Class-level attributes**

In contrast to the instance-level attributes, which are unique to each object, these attributes are shared by all the instances of a class.

In other words, they are associated with the class itself, but not with any specific instance of the class. Therefore, they are defined outside the `__init__`

method.

For example, you can have a class-level attribute for the `number_of_wheels`

of a car. This attribute would have the same value for all objects created from the car class, regardless of the `color`

, `cost`

, or any other instance-level attributes.

In Python, class-level attributes are defined as follows:

To access the class-level attributes, we can do the following:

The functions defined within the scope of a class are called methods.

They operate on the attributes of an object and are typically used to manipulate or retrieve the values stored in an object’s attributes.

An independent definition created anywhere in the program is called a “function”. However, a function is called a “method” when defined inside a class.

Like the `__init__`

method, class methods are also defined within the scope of the class. Also, the first parameter is always the `self`

keyword.

For instance, let’s define a `change_speed`

method.

This method accepts the `new_speed`

and assigns it to the instance-level attribute `speed`

. As shown above, we reference any instance-level attribute using the `self`

keyword using the dot notation.

Note that a class method may or may not receive any parameters other than `self`

. For instance, if we were to define a `stop_car`

method, it is not necessary to pass the new speed as a parameter.

Whenever we create an instance of a class, it is called an object.

Considering the `Car`

class defined above, we can create a new object as follows:

The above statement invokes the `__init__`

method defined in the class. As a result, the arguments get assigned to the respective instance-level attributes defined in the class.

Also, the object can access the attributes and methods defined in the class. These can be accessed using the dot notation, as shown below:

When we call a class method, we don’t pass any value for the `self`

parameter. For instance, even though the definition of `change_speed()`

has two parameters: `self`

and `new_speed`

, the `self`

parameter is automatically fetched by Python using the calling object.

Therefore, while calling the method, we only passed one value (`20`

), which corresponds to the `new_speed`

.

Whenever we define a new class, it creates a new datatype. For instance, if we use the `type()`

method of Python, and pass the class object `my_car`

, we get the following:

This indicates that the `my_car`

object is of type `Car`

.

Thus, in addition to making your code more organized, readable and manageable, classes also offer a mechanism to create a user-defined datatype.

Next, let’s look at some of the fundamental magic methods in Python OOP and how they are used.

In Python, magic methods are methods that have double underscores at the beginning and end of their names, such as `__init__`

, which we discussed above.

Do you know one of the biggest hurdles data science and machine learning teams face?

It is transitioning their data-driven pipeline from Jupyter Notebooks to an executable, reproducible, error-free, and organized pipeline.

And this is not something data scientists are particularly fond of doing.

Yet, this is an immensely critical skill that many overlook.

Machine learning deserves the rigor of any software engineering field. Training codes should always be reusable, modular, scalable, testable, maintainable, and well-documented.

To help you develop that critical skill, I'm excited to bring you a special guest post by Damien Benveniste. He is the author of The AiEdge newsletter and was a Machine Learning Tech Lead at Meta.

In today’s machine learning deep dive, he will provide a detailed guide on structuring code for machine learning development, one of the most critical yet overlooked skills by many data scientists.

I personally learned a lot from this one and I am sure you will learn a lot too.

Let’s begin!

I have always believed that machine learning deserves the rigor of any software engineering field. Training codes should be reusable, modular, scalable, testable, maintainable, and well-documented.

Today, I want to show you my template to develop quality code for machine learning development.

More specifically, we will look at:

- What does coding mean?
- Designing:
- System design
- Deployment process
- Class diagram

- The code structure:
- Directory structure
- Setting up the virtual environment
- The code skeleton
- The applications
- Implementing the training pipeline
- Saving the model binary

- Improving the code readability:
- Docstrings
- Type hinting

- Packaging the project

I often see many Data Scientists or Machine Learning Engineers developing in Jupyter notebooks, copy-pasting their codes from one place to another, which gives me nightmares!

When running ML experiments, Jupyter is prone to human errors as different cells can be run in different orders. Yet, ideally, you should be able to capture all the configurations of an experiment to ensure reproducibility.

No doubt, Jupyter can be used to call a training package or an API and manually orchestrate experiments, but fully developing in Jupyter is an extremely risky practice.

For instance, when training a model, you should ensure the data is passed through the exact feature processing pipelines at serving (inference) time. This means using the same classes, methods, and identical versions of packages and hardware (GPU vs. CPU).

Personally, I prefer prototyping in Jupyter but developing in Pycharm or VSCode.

When programming, focus on the following aspects:

**Reusability:**- It is the capacity to reuse code in another context or project without significant modifications.
- Code reusability can be achieved in several ways, such as through libraries, frameworks, modules, and object-oriented programming techniques.
- In addition, good documentation and clear code organization also facilitate code reuse by making it easier for other developers to understand and use the code.

**Modularity:**- It is the practice of breaking down a software system into smaller, independent modules or components that can be developed, tested, and maintained separately.

**Scalability:**- It refers to the ability of a software development codebase to accommodate the growth and evolution of a software system over time. In other words, it refers to the ability of the codebase to adapt to changing requirements, features, and functionalities while maintaining its overall structure, quality, and performance.
- To achieve codebase scalability, it is important to establish clear coding standards and practices from the outset, such as using version control, code review, and continuous integration and deployment.
- In addition, it is important to prioritize code maintainability and readability, as well as the use of well-documented code and clear naming conventions.

**Testability:**- It refers to the ease with which software code can be tested to ensure that it meets the requirements and specifications of the software system.
- It can be achieved by designing code with testing in mind rather than treating testing as an afterthought. This can involve writing code that is modular, well-organized, and easy to understand and maintain, as well as using tools and techniques that support automated testing and continuous integration.

**Maintainability:**- It refers to the ease with which software code can be modified, updated, and extended over time.

**Documentation:**- It provides a means for developers, users, and other stakeholders to understand how the software system works, its features, and how to interact with it.

In Machine Learning, like any engineering domain, no line of code should be written until a proper design is established.

Having a design means that we can translate a business problem into a machine learning solution, **provided ML is indeed the right solution to the problem!**

For simplicity, let’s assume we want to build a mobile application where a user needs machine learning predictions displayed on the screen — personalized product recommendations, for instance.

The process workflow may appear as follows:

- The mobile application requests personalized predictions from the backend server.
- The backend server fetches predictions from a database.
- We figured that daily batch predictions were the most appropriate setup for now, and the machine learning service updates the predictions daily.

This process is depicted in the image below:

Before we can understand how to develop our model, we need to understand how we will deploy it. Let’s assume that, for our purposes, an inference application will be containerized in a Docker container.

The container can be deployed in a container registry such as AWS ECR (Amazon Elastic Container Registry) or Docker Hub. We can have an orchestration system such as Airflow that spins up the inference service, pulls the container from the registry, and runs the inference application.

Now that we know what we need to build and how it will be deployed, how we need to structure our codebase is becoming much clearer.

More specifically, we shall build two applications:

- An inference application.
- A training application.

To minimize potential human errors, it is imperative that the modules used at training time are the same as the ones used at inference time.

Let’s look at the following class diagram:

**The application layer:**- This part of the code captures the application’s logic. Think about these modules as “buttons” that start the inference or training processes.
- We will have a
`run()`

function for each of those applications that will serve as handles for the Docker image to start those individual processes.

**The data layer:**- This is the abstraction layer that moves data in and out of the applications. I am calling it the “data” layer, but I am including anything that needs to go into the outside world, like the data, the model binaries, the data transformer, the training metadata, etc.
- In this batch use case, we are just going to need a function that brings the data into the applications
`get_data()`

and another that puts predictions back into the database`put_data()`

.- The
`DataConnector`

class moves data around. - The
`ObjectConnector`

is the actor responsible for transferring model binaries and data transformation pipelines using`get_object()`

and`put_object()`

.

- The

**The machine learning layer:**This is the module where all the different machine learning components will live. The three components of model training are:

*Learning the parameters of the model*: the`Model`

will take care of that with the`fit()`

method. For inferring, we use the`predict()`

method.*Learning the features transformations*: We may need to normalize features, perform Box-Cox transformations, one-hot encode, etc… The`DataProcessor`

will take care of that with the`fit()`

and`transform()`

methods.*Learning the hyperparameters of the model and data pipeline*: the`CrossValidator`

will handle this task with its`fit()`

function.

The `TrainingPipeline`

will handle the logic between the different components.

Now that we have a class diagram, we must map it into actual code. Let’s call the project `machine_learning_service`

.

Of course, there are many ways to do it, but we will organize the project as follows:

**The “docs” folder:**for the documents.**The “src” folder:**for the source code (or the actual codebase).**The “tests” folder:**for the unit tests.

Going ahead, we assume we will Dockerize this project at some point. Thus, controlling the Python version and packages we use locally is crucial.

To do that, we will create a virtual environment called `env`

using venv, an open-source package that allows us to create virtual environments effortlessly.

First, within the project folder, we run the following command to create a virtual environment:

Next, we activate it as follows:

Once done, we should see the following directory structure:

In the current directory, let’s check the Python version that is running so that we can use the Python binaries of the virtual environment. We do this using the `which`

command, as demonstrated below:

Next, let’s make sure that the Python version is Python 3:

Okay, we are good to go!

Within the source folder, let’s create the different modules we have in the class diagram:

For now, let’s have empty classes.

- The
`Model`

class: This will be responsible for training the model on new data and predicting on unseen data:

- The
`DataProcessor`

class: This handles the processing needed before the data is fed to the ML model, such as normalization, transformation, etc.

In a recent article, we devised the entire principal component analysis (PCA) algorithm from scratch.

We saw how projecting the data using the eigenvectors of the covariance matrix naturally emerged from the PCA optimization step.

Moving on, we discussed some of its significant limitations.

**Let’s look at them again!**

Many use PCA as a data visualization technique. This is done by projecting the given data into two dimensions and visualizing it.

While this may appear like a fair thing to do, there’s a big problem here that often gets overlooked.

**As we discussed in the PCA article**, after applying PCA, each new feature captures a fraction of the original variance.

**This means that two-dimensional visualization will only be helpful if the first two principal components collectively capture most of the original data variance, as shown below:**

If not, the two-dimensional visualization will be highly misleading and incorrect. This is because the first two components don’t capture most of the original variance well. This is depicted below:

Thus, using PCA for 2D visualizations is only recommended if the cumulative explained variance plot suggests so. If not, one should refrain from using PCA for 2D visualization.

PCA has two main steps:

- Find the eigenvectors and eigenvalues of the covariance matrix.
- Use the eigenvectors to project the data to another space.

👉

Why eigenvectors, you might be wondering? We discussed the origin of eigenvectors in detail in the PCA article. It is recommended to read that article before reading this article.

Projecting the data using eigenvectors creates uncorrelated features.

Nonetheless, the new features created by PCA $(x_0^{'}, x_1^{'}, x_2^{'})$ are always a linear combination of the original features $(x_0, x_1, x_2)$.

This is depicted below:

As shown above, every new feature in $X_{projected}$ is a linear combination of the features in $X$.

We can also prove this experimentally.

As depicted above, we have a linearly inseparable dataset. Next, we apply PCA and reduce the dimensionality to $1$. The dataset remains linearly inseparable.

On the flip side, if we consider a linearly separable dataset, apply PCA, and reduce the dimensions, we notice that the dataset remains linearly separable:

This proves that PCA is a linear dimensionality reduction technique.

However, not all real-world datasets are linear. In such cases, PCA will underperform.

As discussed above, PCA’s primary objective is to capture the overall (or global) data variance.

In other words, PCA aims to find the orthogonal axes along which the **entire dataset** exhibits the most variability.

Thus, during this process, it does not pay much attention to the local relationships between data points.

**It inherently assumes that the global patterns are sufficient to represent the overall data variance.**

This is demonstrated below:

Because of primarily emphasizing on the global structure, PCA is not ideal for visualizing complex datasets where the underlying structure might rely on local relationships or pairwise similarities.

In cases where the data is nonlinear and contains intricate clusters or groups, PCA can fall short of preserving these finer details, as shown in another illustration below:

As depicted above, data points from different clusters may overlap when projected onto the principal components.

This leads to a loss of information about intricate relationships within and between clusters.

So far, we understand what’s specifically lacking in PCA.

While the overall approach is indeed promising, its limitations make it impractical for many real-world datasets.

**t-distributed stochastic neighbor embedding (t-SNE)** is a powerful dimensionality reduction technique **mainly used to visualize high-dimensional datasets** by projecting them into a lower-dimensional space (typically 2-D).

As we will see shortly, t-SNE addresses each of the above-mentioned limitations of PCA:

- It is well-suited for visualization.
- It works well for linearly inseparable datasets.
- It focuses beyond just capturing the global data relationships.

t-SNE is an improvement to the **Stochastic Neighbor Embedding** (SNE) algorithm. It is observed that in comparison to SNE, t-SNE is much easier to optimize.

💡

t-SNE and SNE are two different techniques, both proposed by one common author — Geoffrey Hinton.

So, before getting into the technical details of t-SNE, let’s spend some time understanding the SNE algorithm instead.

To begin, we are given a high-dimensional dataset, which is difficult to visualize.

Thus, the dimensionality is higher than $3$ (typically, it’s much higher than $3$).

The objective is to project it to a lower dimension (say, $2$ or $3$), such that the lower-dimensional representation preserves as much of the local and global structure in the original dataset as possible.

Let’s understand!

Imaging this is our high-dimensional dataset:

💡

Of course, the above dataset is not a high-dimensional dataset. But for the sake of simplicity, let’s assume that it is.

Local structure, as the name suggests, refers to the arrangement of data points that are close to each other in the high-dimensional space.

Thus, preserving the local structure would mean that:

- Red points should stay closer to other red points.
- Blue points should stay closer to other blue points.
- Green points should stay closer to other green points.

So, is preserving the local structure sufficient?

**Absolutely not!**

If we were to focus solely on preserving the local structure, it may lead to a situation where blue points indeed stay closer to each other, but they overlap with the red points, as shown below:

This is not desirable.

Instead, we also want the low-dimensional projections to capture the global structure.

Thus, preserving the global structure would mean that:

- The red cluster is well separated from the other cluster.
- The blue cluster is well separated from the other cluster.
- The green cluster is well separated from the other cluster.

To summarize:

- Preserving the local structure means maintaining the relationships among nearby data points within each cluster.
- Preserving the global structure involves maintaining the broader trends and relationships that apply across all clusters.

Let’s stick to understanding this as visually as possible.

Consider the above high-dimensional dataset again.

Euclidean distance is a good measure to know if two points are close to each other or not.

For instance, in the figure above, it is easy to figure out that `Point A`

and `Point B`

are close to each other but `Point C`

is relatively much distant from `Point A`

.

Thus, the first step of the **SNE algorithm** is to convert these high-dimensional Euclidean distances between data points into conditional probabilities that represent similarities.

For better understanding, consider a specific data point in the above dataset:

Here, points that are closer to the marked point will have smaller Euclidean distances, while other points that are far away will have larger Euclidean distances.

Thus, as mentioned above, for every data point $(i)$, the **SNE algorithm** first converts these high-dimensional Euclidean distances into conditional probabilities $p_{j|i}$.

Here, $p_{j|i}$ represents the conditional probability that a point $x_i$ will pick another point $x_j$ as its neighbor.

This conditional probability is assumed to be proportional to the probability density of a Gaussian centered at $x_i$.

This makes intuitive sense as well. To elaborate further, consider a Gaussian centered at $x_i$.

It is evident from the above Gaussian distribution centered at $x_i$ that:

- For points near $x_i$, $p_{j|i}$ will be relatively high.
- For points far from $x_i$, $p_{j|i}$ will be small.

So, to summarize, for a data point $x_i$, we convert its Euclidean distances to all other points $x_j$ into conditional probabilities $p_{j|i}$.

This conditional probability is assumed to be proportional to the probability density of a Gaussian centered at $x_i$.

Also, as we are **only interested in modeling pairwise similarities**, we set the value of $p_{i|i}=0$. In other words, a point cannot be its own neighbor.

💡

A Gaussian for a point $x_i$ will be parameterized by two parameters: $(\mu_i, \sigma_i^2)$. As the x-axis of the Gaussian measures Euclidean distance, the mean $(\mu_i)$ is zero. What about $\sigma_i^2$? We’ll get back to it shortly. Until then, just assume that we have somehow figured out the ideal value $\sigma_i^2$.

Based on what we have discussed so far, our conditional probabilities $p_{j|i}$ may be calculated using a Gaussian probability density function as follows:

**However, there’s a problem here.**

Yet again, let’s understand this visually.

Earlier, we considered the data points in the same cluster to be closely packed.

Thus, the resultant conditional probabilities were also high:

However, the data points of a cluster might be far from other clusters. Yet, they themselves can be a bit more scattered, as depicted below:

If we were to determine the resultant conditional probabilities $p_{j|i}$ for the marked data point $x_i$, we will get:

In this case, even though the data points belong to the same cluster, the conditional probabilities are much smaller than what we had earlier.

We need to fix this, and a common way to do this is by normalizing the individual conditional probability between $(x_i, x_j) \rightarrow p_{j|i}$ by the sum of all conditional probabilities $p_{k|i}$.

Thus, we can now estimate the final conditional probability $p_{j|i}$ as follows:

- The numerator is the conditional probability between $(x_i, x_j) \rightarrow p_{j|i}$.
- Each of the terms inside the summation in the denominator is the conditional probability between $(x_i, x_k) \rightarrow p_{k|i}$.

To reiterate, we are **only interested in modeling pairwise similarities**. Thus, we set the value of $p_{i|i}=0$. In other words, a point cannot be its own neighbor.

Recall the objective again.

We intend to project the given high-dimensional dataset to lower dimensions (say, $2$ or $3$).

Thus, for every data point $x_i \in \mathbb{R}^n$, we define it’s counterpart $y_i \in \mathbb{R}^2$:

💡

$y_i$ does not necessarily have to be in two dimensions. Here, we have defined $y_i \in \mathbb{R}^2$ just to emphasize that $n$ is larger than $2$.

Next, we use a similar notion to compute the pairwise conditional probability using a Gaussian, which we denote as $q_{j|i}$ in the low-dimensional space.

Furthermore, to simplify calculations, we set the variance of the conditional probabilities $q_{j|i}$ to $\frac{1}{\sqrt{2}}$.

As a result, we can denote $q_{j|i}$ as follows:

Yet again, as we are **only interested in modeling pairwise similarities, **we set the value of $q_{i|i}=0$.

The objective is to have conditional probabilities in the low-dimensional space $(q_{j|i})$ identical to the conditional probabilities in the high-dimensional space $(p_{j|i})$.

Thus, if the projected data points $y_i$ and $y_j$ will correctly model the similarity between the high-dimensional datapoints $x_i$ and $x_j$, the conditional probabilities $q_{j|i}$ and $p_{j|i}$ must be nearly equal.

**This hints that we must minimize the difference between the two conditional probabilities — $q_{j|i}$ and $p_{j|i}$.**

One of the most common and popular ways to quantify the difference between two probability distributions is KL Divergence.

As we have covered this previously, we won’t get into much detail:

...but here’s a quick overview of what it does.

The core idea behind KL divergence is to assess how much information is lost when one distribution is used to approximate another.

Thus, the more information is lost, the more the KL divergence. As a result, the more the dissimilarity.

KL divergence between two probability distributions $P(x)$ and $Q(x)$ is calculated as follows:

The formula for KL divergence can be read as follows:

The KL divergence $D_{KL} (P || Q) $ between two probability distributions $P$ and $Q$ is calculated by summing the above quantity over all possible outcomes $x$. Here:

- $P(x)$ represents the probability of outcome $x$ occurring according to distribution $P$.
- $Q(x)$ represents the probability of the same outcome occurring according to distribution $Q$.

It measures how much information is lost when using distribution $Q$ to approximate distribution $P$.

Imagine this. Say $P$ and $Q$ were identical. This should result in zero loss of information. Let’s verify this from the formula above.

If the probability distributions $P$ and $Q$ are identical, it means that for every $x$, $P(x) = Q(x)$. Thus,

Simplifying, we get:

Hence proved.

Let’s look at another illustration below:

Here, we have observed distribution (the histogram in blue). The objective is to quantify whether it is more like a Gaussian distribution (left) or a Gamma distribution (right).

By measuring the KL divergence, we notice that the observed distribution resembles a Gamma distribution (right).

This is because the KL divergence between:

- The observed distribution and Gamma distribution is low.
- The observed distribution and Gaussian distribution is high.

Based on this notion, KL divergence between two conditional probability distributions – $q_{j|i}$ and $p_{j|i}$ can potentially be a loss function in our problem.

The goal will be to find the most optimal projected points $y_i$ (these are like model parameters), such that the KL divergence between the two conditional probability distributions – $q_{j|i}$ and $p_{j|i}$ is minimum.

Thus, once we have formulated a KL divergence-based loss function, we can minimize it with respect to the points $y_i$. Further, we can use gradient descent to update the points $y_i$.

Let’s do it.

The cost function $C$ of the **SNE algorithm** is given by:

Here:

- $P_i$ denotes the conditional probability distribution over all other data points given datapoint $x_i$.
- $Q_i$ denotes the conditional probability distribution over all other map points given map point $y_i$.

For every pair of points $(i, j)$, we have the following notations:

- $(x_i, x_j) \rightarrow$ the high-dimensional data points $(x_i, x_j)$.
- $(y_i, y_j) \rightarrow$ the low-dimensional mapping of $(x_i, x_j)$.
- $p_{j|i} \rightarrow$ the probability density of the Euclidean distance to $x_j$ when $x_i$ is the center point.
- $q_{j|i} \rightarrow$ the probability density of the Euclidean distance to $y_j$ when $y_i$ is the center point.

In a gist, the rightmost formulation denotes the** **pairwise similarities we intend to capture. That is why we get two summations.

We can further simplify the above cost function $C$ as follows:

The first term will always be a constant. For a given dataset $X$, the pairwise probability densities can never change.

Also, this term is independent of $q_{j|i}$ (or $y_i \rightarrow$ the model parameters). Thus, we can safely ignore the first term.

Now, the loss function looks more like a cross-entropy loss function with an additive constant $a$.

Finding the gradient with respect to $y_i$, we get the following:

👉

Although the above gradient expression is quite simple, it is a bit complicated to derive. For those who are interested, I have provided a derivation at the end of the article.

Assuming $\gamma$ to be the set of all mapped points in the low-dimensional space:

We can update the parameters as follows:

We are yet to discuss the variance computation of the probability density function $p_{j|i}$ we defined for the high-dimensional data points:

$\sigma_i^2$ denotes the variance of the Gaussian centered over each high-dimensional datapoint $x_i$.

It is unlikely that we can specify a single value of $\sigma_i^2$ for all data points. This is evident from the figure below:

The density of the data will vary across all data points.

Thus, in dense regions, we need a small value of $\sigma_i^2$, but in sparse regions, we need a larger value of $\sigma_i^2$.

While finding the absolute best value of $\sigma_i^2$ for every data point $x_i$ appears to be a bit difficult, we can still try.

This is where we introduce a **user-specified hyperparameter** called `perplexity`

.

**Dimensionality reduction** is crucial to gain insight into the underlying structure of high-dimensional datasets.

One technique that typically stands out in this respect is the **Principal Component Analysis (PCA)**.

In a gist, the core objective of PCA is to transform the data $X$ into another space such that new features are uncorrelated and preserve the variance of the original data.

While PCA is widely popular, I have noticed that most folks struggle to design its true end-to-end formulation from scratch.

Many try to explain PCA by relating it to the idea of eigenvectors and eigenvalues, which is true — PCA does perform transformations using eigenvectors and eigenvalues.

But why?

**There are two questions to ask here:**

**How can we be sure that the data projections that involve eigenvectors and eigenvalues in PCA is the most obvious solution to proceed with?****How can we be sure that the above transformation does preserve the data variance?**

In other words, where did this whole notion of eigenvectors and eigenvalues originate from in PCA?

I have always believed that it is not only important to be aware of techniques, but it is also crucial to know how these techniques are formulated end-to-end.

And the best way to truly* *understand any algorithm or methodology is by manually building it from scratch.

It is like replicating the exact thought process when someone first designed the algorithm by approaching the problem logically, with only mathematical tools.

Thus, in this dimensionality-reduction article series, we’ll dive deep into one of the most common techniques:** Principal Component Analysis (PCA)**

We’ll look at:

- The intuition and the motivation behind dimensionality reduction.
- What are vector projections and how do they alter the mean and variance of the data?
**What is the optimization step of PCA?**- What are Lagrange Multipliers and how are they used in PCA optimization?
- What is the final solution obtained by PCA?
- How to determine the number of components in PCA?
- Why should we be careful when using PCA for visualization?
- What are the advantages and disadvantages of PCA?
- Takeaways.

Let’s begin!

Imagine someone gave you the following weight and height information about a few individuals:

It is easy to guess that the height column has more variation than weight.

Thus, even if you were to discard the **weight** column, chances are you could still identify these individuals solely based on their heights.

However, if you were to discard the **height** column, you would likely struggle to identify them.

Why?

Because their heights have more variations than their weights, and it’s clear from this example that, **typically**, if the data has more variation, it holds more information (this might not always be true, though, and we’ll learn more about this later).

**That’s the essence of many dimensionality reduction techniques.**

Simply put, we reduce dimensions by eliminating useless (or least useful) features.

Based on this explanation, one plausible and intuitive way might be to measure column-wise variance and eliminate top $k$ features with the least variance.

This is depicted below:

However, there’s a problem with this approach.

Almost all datasets have correlated features. Thus, in the case of high correlation, this puts a challenge on which feature we should eliminate or retain in the final dataset.

Removing one feature that is highly correlated with another may lead to an incoherent dataset, as both features can be equally important. As a result, it may lead to misleading conclusions.

**Thus, the above approach of eliminating features based on feature variance is only ideal if the features we begin with are entirely uncorrelated.**

So, can we do something to make the features in our original dataset uncorrelated?

Can we apply any transformation that does this?

Of course we can!

For instance, consider the above dummy dataset again where $x_1$ and $x_2$ were correlated:

As shown above, if we can represent the data in a new coordinate system $(x_1^{'}, x_2^{'})$, it is easy to see that there is almost no correlation between the new features.

What's more, the data varies mostly along the dimension $(x_1^{'})$.

As a result, in this new coordinate system, we can safely discard the dimension along which the original data has the least variance, i.e., $(x_2^{'})$.

Think of it this way. If we were to discard the dimension $(x_1^{'})$, we would get the following low-dimensional representation of the data:

It makes much more sense to discard $(x_2^{'})$ because the data varies less along that dimension.

So overall, these are the steps we need to follow:

**Step #1)** Develop a new coordinate system for our data such that the new features are uncorrelated.

**Step #2)** Find the variance along each of the new uncorrelated features.

**Step #3)** Discard the features with the least variance to get the final “dimensionally-reduced dataset.”

Here, it is easy to guess that steps $2$ and $3$ aren’t** **that big of a challenge. Once we have obtained the new uncorrelated features, the only objective is to find their variance and discard the features with the least variances.

Thus, the primary objective is to develop the new coordinate system, $(x_1^{'}, x_2^{'})$ in the above illustration.

In other words, we want to project the data to a low-dimensional space to preserve as much of the original variance as possible while effectively reducing the dimensions.

**As we’ll look shortly, this becomes more like an optimization problem.**

So, how do we figure out the best projection for our data?

As the name suggests, vector projection is about finding the component of one vector that lies in the direction of another vector.

For instance, consider a toy example:

Given a vector $X = (4, 5)$, its projections along the coordinate axes $\hat x_1$ and $\hat x_2$ are:

What we did was we projected the original vector along the direction of some other vector.

A little complicated version of this could be to project it along a vector that is different from the coordinate vectors, $u$ in the following figure:

For simplicity, let’s consider the vector $u$ to be a unit vector — one which has a unit length.

By the notion of cosine similarity, we know that the cosine of the angle between two vectors is given by:

Thus, we can write the length of the projection of $\vec{a}$ on $\vec{b}$ as:

Now, we know the magnitude of the projection. Next, we can do a scaler vector multiplication of this magnitude with the unit vector in the desired direction ($\hat b$) to get the projection vector:

Substituting the value of $cos(\theta)$, and simplifying we get the final projection as:

Until now, we considered just one data point for projection. But our dataset will indeed have many more data points.

The above projection formula will still hold.

In other words, we can individually project each of the vectors as depicted below:

However, the projection will alter the mean and variance of the individual features.

But it’s simple to figure out the new mean and variance after the projection.

Essentially, all data points are vectors in some high-dimensional space. Thus, the mean of the individual features will also be a vector in that same space $(\hat \mu)$.

As a result, it makes intuitive sense that the transformation to the individual data points will equally affect the mean vector as well.

With this in mind, we can get the projected mean vector as follows:

Similarly, we can also determine the variance of the individual features after projection as follows:

So, to summarize, if we are given some data points X as follows:

💡

Each data point can be thought of as a vector $x_i$.

If we want to project these vectors along some unit vector $\hat b$:

...then the projection for a specific data point $x_i$ is given by:

If the individual features had a mean vector $\vec{\mu}$, then the mean of the projections is given by:

What’s more, if the individual features had a variance vector $\vec{\sigma^2}$, then the variance of the projections is given by:

Finally, the idea of finding the variance of the projection can be extended to the covariance matrix of the projection as well.

In simple terms, while variance $\vec{\sigma^2}$ above denoted the individual feature variances, the covariance matrix holds the information about how two features vary together.

Say $\sum$ denotes the covariance matrix for the original features:

Then the covariance matrix of the projections $\sum_{proj}$ can be written as:

With this, we are set to proceed with formulating PCA as an optimization problem.

]]>In one of the earlier articles, we formulated the entire linear regression algorithm from scratch.

In that process, we saw the following:

- How the assumptions originate from the algorithmic formulation of linear regression.
- Why they should not be violated.
- What to do if they get violated.
- and much more.

Overall, the article highlighted the statistical essence of linear regression and under what assumptions it is theoretically formulated.

Linear regression models a linear relationship between the features $(X)$ and the true/observed output variable $(y)$.

The estimate $\hat y$ is written as:

Where:

- $y$ is the observed/true dependent variable.
- $\hat y$ is the modeled output.
- $X$ represents the independent variables (or features).
- $θ$ is the estimated coefficient of the model.
- $\epsilon$ is the random noise in the output variable. This accounts for the variability in the data that is not explained by the linear relationship.

The primary objective of linear regression is to estimate the values of $\theta=(\theta_1,\theta_2,⋯,\theta_n)$ that most closely estimate the observed dependent variable y.

Talking specifically about the $\epsilon$ – the random noise, we assume it to be drawn from a Gaussian:

Another way of formulating this is as follows:

The above says that the conditional distribution of $Y$ given $X$ is a Gaussian with:

- Mean = $θ^TX$ (dependent on X).
- Variance: $\sigma$ (independent of X or constant).

A graphical way of illustrating this is as follows:

The regression line models the mean of the Gaussian distributions, and the mean varies with $X$. Also, all Gaussians have an equal variance.

So, in a gist, with linear regression, we are trying to explain the dependent variable $Y$ as a function of $X$.

We are given $X$, and we have assumed a distribution for $Y$. $X$ will help us model the mean of the Gaussian.

**But it’s obvious to guess that the above formulation raises a limitation on the kind of data we can model with linear regression.**

More specifically, problems will arise when, just like the mean, the variance is also a function of $X$.

However, during the linear regression formulation, we explicitly assumed that the variance is constant and never depends on $X$.

We aren’t done yet.

There’s another limitation.

Linear regression assumes a very specific form for the mean. In other words, the mean should be the linear combination of features $X$, as depicted below:

Yet, there’s every chance that the above may not hold true, and instead, the mean is represented as follows:

So, to summarize, we have made three core assumptions here:

**#1)**If we consider the conditional distribution $P(Y|X)$, we assume it to be a Gaussian.

**#2)**$X$ affects only the mean, that too in a very specific way, which is linear in the individual features $x_j$.

**#3)**Lastly, the variance is constant for the conditional distribution $P(Y|X)$ across all levels of $X$.

But nothing stops real-world datasets from violating these assumptions.

In many scenarios, the data might exhibit complex and nonlinear relationships, heteroscedasticity (varying variance), or even follow entirely different distributions altogether.

Thus, we need an approach that allows us to adapt our modeling techniques to accommodate these real-world complexities.

**Generalized linear models attempt to relax these things.**

More specifically, they consider the following:

- What if the distribution isn’t normal
**but some other distribution from the exponential family?** - What if $X$ has a more sophisticated relationship with the mean?
- What if the variance varies with $X$?

Let’s dive in!

As the name suggests, generalized linear models (GLMs) are a generalization of linear regression models.

Thus, the linear regression model is a specific case of GLMs.

They expand upon the idea of linear regression by accommodating a broader range of data distributions.

Unlike traditional linear regression, which assumes that the response variable follows a normal distribution and has a linear relationship with the predictors (as discussed above), GLMs can handle various response distributions – Binomial, Poisson, Gamma distributions, and more.

However, before proceeding ahead, it is essential to note that with GLMs, the core idea remains around building “**linear models**.” Thus, in GLMs, we never relax the linear formulation:

We’ll understand why we do this shortly.

In GLMs, the first thing we relax is the conditional distribution $P(Y|X)$, which we assumed to be a Gaussian earlier.

We change the normal distribution to some other distribution from the exponential family of distributions, such as:

**Why specifically the exponential family?**

Because as we would see ahead, everything we typically do with the linear model with Gaussian extends naturally to this family. This includes:

- The way we formulate the likelihood function
- The way we derive the maximum likelihood estimators, etc.

But the above formulation gets extremely convoluted when we go beyond these distributions.

This is because, as the name suggests, the probability density functions (PDFs) of distributions in the exponential family can be manipulated into an exponential representation.

This inherent structure simplifies the mathematical manipulations we typically do in maximum likelihood estimation (MLE):

**#1) Define the likelihood function for the entire dataset:**Here, we typically assume that the observations are independent. Thus, the likelihood function for the entire dataset is the product of the individual likelihoods.**#2) Take the logarithm (the obtained function is called log-likelihood):**To simplify calculations and avoid numerical issues, it is common to take the logarithm of the likelihood function.*This step gets simplified if the likelihood values have an exponential term — which the exponential family of distributions can be manipulated to possess.*

**#3) Maximize the log-likelihood:**Finally, the goal is to find the set of parameters $\theta$ that maximize the log-likelihood function.

More specifically, when we take the logarithm in the MLE process, it becomes easier to transform the product in the likelihood function into a summation. This can be further simplified with the likelihood involving exponential terms.

We also saw this in one of the earlier articles:

Lastly, the exponential family of distributions can explain many natural processes.

This family encapsulates many common data distributions encountered in real-world scenarios.

- Count data $\rightarrow$ Poisson distribution may help.
- Binary Outcomes $\rightarrow$ Bernoulli distribution.
- Continuous Positive-Valued Data $\rightarrow$ Gamma distribution.
- The time between events $\rightarrow$ try Exponential Distribution.

So overall, we alter the specification of the normal distribution to some other distribution from the exponential family.

There's another thing we change.

Earlier, we had the formulation that the mean is directly the linear combination of the features.

In GLMs, we extend it to the idea that some function of the mean is the linear combination of the features.

Here, $F$ is called the link function.

While the above notation of putting the link function onto the mean may not appear intuitive, it’s important to recognize that this approach serves a larger purpose within GLMs.

By putting the transformation on mean instead, we maintain the essence of a linear model intact while also accommodating diverse response distributions.

Understanding the purpose of the link function is also simple.

Imagine you had the Bernoulli random variable for the condition distribution of $Y$ given $X$.

We know that its mean ($p$ or $\mu(x)$) will always lie between [0,1].

However, if we did not add a link function and instead preferred the following representation, the mean can have any real value as $\theta^T \cdot X$ can span all real values.

What the link function does is it maps the mean ($\mu(x)$), which is constrained to some specific range, to all possible real values so that the compatibility between the linear model and the distribution’s mean is never broken.

That is why they are also called “**link**” functions. It is because they link the linear predictor $\theta^T \cdot X$ and the parameter for the probability distribution $\mu(x)$.

Thus, to summarize, there are three components in GLMs.

]]>One of the major aspects of training any reliable ML model is avoiding **overfitting**.

In a gist, overfitting occurs when a model learns to perform exceptionally well on the training data.

This may happen because the model is trying too hard to capture all **unrelated and random noise** in our training dataset, as shown below:

In this context, noise refers to any random fluctuations or inconsistencies that might be present in the data.

While learning this noise leads to a lower train set error and lets your model capture more intricate patterns in the training set, it comes at the tremendous cost of poor generalization on unseen data.

One of the most common techniques used to avoid overfitting is **regularization**.

Simply put, the core objective of regularization is to penalize the model for its complexity.

And if you have taken any ML course or read any tutorials about this, the most common they teach is to add a penalty (or regularization) term to the cost function, as shown below:

In the above expressions:

- $y^{(i)}$ is the true value corresponding to sample $i$.
- $\hat y^{(i)}$ is the model’s prediction corresponding to sample $i$, which is dependent on the parameters $(\theta_1, \theta_2, \cdots, \theta_K)$.

While we can indeed validate the effectiveness of regularization experimentally (as shown below), it’s worth questioning the origin of the regularization term** from a probabilistic perspective.**

More specifically:

- Where did this regularization term originate from?
- What does the regularization term precisely measure?
- What real-life analogy can we connect it to?
- Why do we add this regularization term to the loss?
- Why do we square the parameters (specific to L2 regularization)? Why not any other power?
- Is there any probabilistic evidence that justifies the effectiveness of regularization?

Turns out, there is a concrete probabilistic justification for regularization.

Yet, in my experience, most tutorials never bother to cover it, and readers are always expected to embrace these notions as a given.

Thus, this article intends to highlight the origin of regularization purely from a probabilistic perspective and provide a derivation that makes total logical and intuitive sense.

**Let’s begin!**

Before understanding the origin of the regularization term, it is immensely crucial to learn a common technique to model labeled data in machine learning.

It’s called the maximum likelihood estimation (MLE).

We covered this in the following article, but let’s do it again in more detail.

Essentially, whenever we model data, the model is instructed to maximize the likelihood of observing the given data $(X,y)$.

More formally, a model attempts to find a specific set of parameters $\theta$ (also called model weights), which maximizes the following function:

The above function $L$ is called the likelihood function, and in simple words, the above expression says that:

- maximize the likelihood of observing $y$
- given $X$
- when the prediction is parameterized by some parameters $\theta$ (also called weights)

When we begin modeling:

- We know $X$.
- We also know $y$.
- The only unknown is $\theta$, which we are trying to figure out.

Thus, the objective is to find the specific set of parameters $\theta$ that maximizes the likelihood of observing the data $(X,y)$.

This is commonly referred to as **maximum likelihood estimation (MLE)** in machine learning.

MLE is a method for estimating the parameters of a statistical model by maximizing the likelihood of the observed data.

It is a common approach for parameter estimation in various regression models.

Let’s understand it with the help of an example.

Imagine you walk into a kitchen and notice several broken eggshells on the floor.

Here’s a question: **“Which of these three events is more likely to have caused plenty of eggshells on the floor?”**

- Someone was experimenting with them.
- Someone was baking a cake.
- Someone was creating an art piece with them.

Which one do you think was more likely to have happened?

Let’s look at what could have led to eggshells on the floor with the highest likelihood.

The likelihood of:

- Conducting a science experiment that led to eggshells on the floor is MEDIUM.
- Baking a cake that led to eggshells on the floor is HIGH.
- Creating an art piece that led to eggshells on the floor is LOW.

Here, we’re going to go with the event with the highest likelihood, and that’s **“baking a cake.”**

Thus, we infer that the most likely thing that happened was that someone was baking a cake.

**What we did was that we maximized the conditional probability of the event, given an explanation.**

- The event is what we observed, i.e., eggshells on the floor.
- The “explanation,” as the name suggests, is the possible cause we came up with.

We did this because there’s a probability of:

- “Eggshells” given “Experiment”
- “Eggshells” given “Cake”
- “Eggshells” given “Art piece”

We picked the one with the highest conditional probability $P(event|explanation)$.

Simply put, we tried to find the scenario that most likely led to the eggshells on the floor. **This is called maximum likelihood.**

This is precisely what we do in machine learning at times. We have a bunch of data and several models that could have generated that data.

First, we estimate the probability of seeing the data given “Model 1”, “Model 2”, and so on. Next, we pick the model that most likely produced the data.

In other words, we’re maximizing the probability of the data given model.

Next, let’s consider an example of using MLE for simple linear regression.

Consider we have the following data:

The candidate models that could have generated the above data our:

It’s easy to figure out that **Model 2** would have most likely generated the above data.

**But how does a line generate points?**

Let’s understand this specific to linear regression.

We also discussed this more from a technical perspective in the linear regression assumptions article, but let’s understand visually this time.

Consider the diagram below.

Say you have the line shown in Step 1, and you select data points on this line.

Recalling the data generation process from the above article, we discussed that linear regression generates data from a Gaussian distribution with equal variance.

Thus, in step 2, we create Gaussians with equal variance centered at those points.

Lastly, in step 3, we sample points from these Gaussians.

And that is how a line generates data points.

Now, in reality, we never get to see the line that produced the data. Instead, it’s the data points that we observe.

**Thus, the objective boils down to finding the line that most likely produced these points.**

We can formulate this as a maximum likelihood estimation problem by maximizing the likelihood of generating the data points given a model (or line).

Let’s look at the likelihood of generating a specific point from a Gaussian:

Similarly, we can write the likelihood of generating the entire data as the product of individual likelihoods:

Substitution the likelihood values from the Gaussian distribution, we get:

Maximizing the above expression is the same as minimizing the squared sum term in the exponent.

And this is precisely the least squares error.

Linear regression finds the line that minimizes the sum of squared distances.

This proves that finding the line that most likely produces a point using maximum likelihood is exactly the same as minimizing the least square error using linear regression.

So far, we have learned how we estimate the parameters using maximum likelihood estimation.

We also looked at the example of eggshells, and it helped us intuitively understand what MLE is all about.

Let’s recall it.

This time, let’s replace “Art piece” with an “Eggshell throwing contest.”

So yet again, we have three possibilities that may have led to the observed event.

- Someone was experimenting with them.
- Someone was baking a cake.
- Someone was playing an Eggshell throwing contest.

Which one do you think was more likely to have happened?

It’s obvious that baking a cake will still lead to eggshells with a high probability.

But an Eggshell throwing contest will create eggshells on the floor with a very high probability.

It’s certain that an eggshell-throwing contest will lead to eggshells on the floor.

**However, something tells us that baking a cake is still more likely to have happened and not the contest, isn’t it?**

One reason is that an eggs-throwing contest is not very likely – it’s an improbable event.

Thus, even though it’s more likely to have generated the evidence, it’s less likely to have happened in the first place. This means we should still declare “baking a cake” as the more likely event.

But what makes us believe that baking a cake is still more likely to have produced the evidence?

**It’s regularization.**

In this context, regularization lets us add an extra penalty, which quantifies the probability of the event itself.

Even though our contest will undoubtedly produce the evidence, it is not very likely to have happened in the first place.

This lets us still overweigh “baking a cake” and select it as the most likely event.

This is the intuition behind **regularization**.

Now, at this point, we haven’t yet understood the origin of the regularization term.

But we are not yet close to understanding its mathematical formulation.

Specifically talking about L2 regularization, there are a few questions that we need to answer:

- Where did the squared term come from?
- What does this term measure probabilistically?
- How can we be sure that the term we typically add is indeed the penalty term that makes the most probabilistic sense?

The same can be translated to L1 regularization as well.

- Where did the absolute sum of parameters come from?
- What does this term measure probabilistically?
- How can we be sure that the term we typically add is indeed the penalty term that makes the most probabilistic sense?

Let’s delve into the mathematics of regularization now.

KMeans is an unsupervised clustering algorithm that groups data based on distances. It is widely recognized for its simplicity and effectiveness as a clustering algorithm.

Essentially, the core idea is to partition a dataset into distinct clusters, with each point belonging to the cluster whose centroid is closest to it.

While its simplicity often makes it the most preferred clustering algorithm, KMeans has many limitations that hinder its effectiveness in many scenarios.

KMeans comes with its own set of limitations that can restrict its performance in certain situations.

One of the primary limitations is its assumption of spherical clusters.

One intuitive and graphical way to understand the KMeans algorithm is to place a **circle** at the center of each cluster, which encloses the points.

💡

In 3-dimensions, circles can be replaced with spheres. In higher dimensions, they can be thought of as a hyper-sphere.

As KMeans is all about placing circles, its results aren’t ideal when the dataset has irregular shapes or varying sizes, as shown below:

Instead, an ideal clustering should cluster the data as follows:

This rigidity of KMeans to only cluster globular clusters often leads to misclassification and suboptimal cluster assignments.

This limitation is somewhat connected to the one we discussed above.

Imagine we have the following dataset:

Clearly, the blue cluster has a larger spread.

Therefore, ideally, its influence should be larger as well.

However, when assigning a new data point to a cluster, KMeans only considers the distance to the centroid.

This means that it creates a margin for assigning a new data point to a cluster that is equidistant from both centroids.

But considering the area of influence of the right cluster, having this margin more to the left makes more sense.

KMeans clustering performs hard assignments.

In simple words, this means that **a specific data point can belong to only one cluster**.

Thus, it does not provide probabilistic estimates of a given data point belonging to each possible cluster.

Although this might be a problem per se, it limits its usefulness in uncertainty estimation and downstream applications that require a probabilistic interpretation.

These limitations often make KMeans a non-ideal choice for clustering.

Therefore, learning about other better algorithms that we can use to address these limitations is extremely important.

**Therefore, in this article, we will learn about Gaussian mixture models.**

More specifically, we shall cover:

- Shortcomings of KMeans (already covered above)
- What is the motivation behind GMMs?
- How do GMMs work?
- The intuition behind GMMs.
- Plotting some dummy multivariate Gaussian distributions to better understand GMMs.
- The entire mathematical formulation of GMMs.
- How to use Expectation-Maximization to model data using GMMs?
**Coding a GMM from scratch (without sklearn).**- Comparing results of GMMs with KMeans.
- How to determine the optimal number of clusters for GMMs?
- Some practical use cases of GMMs.
- Takeaways.

Let's begin!

As the name suggests, a **Gaussian mixture model** clusters a dataset that has a mixture of multiple Gaussian distributions.

They can be thought of as a generalized twin of KMeans.

**Simply put, in 2 dimensions, while KMeans only create circular clusters, gaussian mixture models can create oval-shaped clusters.**

But how do they do that?

Let's understand in more detail.

]]>