The scikit-learn project has possibly been one of the most significant contributions to the data science and machine learning community for building traditional machine learning (ML) models.

Personally speaking, it’s hard to imagine a world without sklearn.

However, things get pretty concerning if we intend to deploy sklearn-driven models in real-world systems.

Let’s understand why.

Scikit-learn models are primarily built on top of NumPy, which, of course, is a fantastic and high-utility library for numerical computations in Python.

Yet, contrary to common belief, NumPy isn’t as optimized as one may hope to have in real-world ML systems.

One substantial reason for this is that **NumPy can only run on a single core of a CPU**.

This provides massive room for improvement as there is no parallelization support in NumPy (yet), and it naturally becomes a **big concern** for data teams to let NumPy drive their production systems.

While traditional ML models do perform well on tabular datasets, but as we discussed in a recent blog on model compression: *“Typically, when we deploy any model to production, the specific model that gets shipped to production is NOT solely determined based on performance. Instead, we must consider several operational metrics that are not ML-related.”*

Another major limitation is that scikit-learn models cannot natively run on Graphics Processing Units (GPUs).

Having GPU support in deployment matters because real-world systems often demand lightning-fast predictions and processing.

However, as discussed above, sklearn models are primarily driven by NumPy, which, disappointingly, can only run on a single core of a CPU. In this context, it is unlikely to have GPU support anytime soon.

In fact, this is also mentioned on Sklearn’s FAQ page:

**Question:****Will you add GPU support?****Answer**:*No, or at least not in the near future. The main reason is that GPU support will introduce many software dependencies and introduce platform-specific issues. scikit-learn is designed to be easy to install on a wide variety of platforms.*

Further, they mention that “*Outside of neural networks, GPUs don’t play a large role in machine learning today, and much larger gains in speed can often be achieved by a careful choice of algorithms.*”

**I don’t entirely agree with this specific statement.**

Consider the enterprise space. Here, the data is primarily tabular. Classical ML techniques such as linear models and tree-based ensemble methods are frequently used to model the tabular data.

In fact, when you have tons of data to model, there’s absolutely no reason to avoid experimenting with traditional ML models first.

Yet, in the current landscape, one is often compelled to train and deploy deep learning-based models just because they offer optimized matrix operations using **tensors**.

We see a clear gap here.

Thus, in this article, let’s learn a couple of techniques today:

- How do we run traditional ML models on large datasets?
- How do we integrate GPU support with traditional ML models in deployment systems?
- While there is no direct way to do this, we must (somehow) compile our machine-learning model to tensor operations, which can be loaded on a GPU for acceleration. We’ll discuss this in the article shortly.

But before that, we must understand a few things.

More specifically:

- What are tensors?
- How are tensors different from a traditional NumPy array?
- Why are tensor computations faster than NumPy operations, and why are tensor operations desired?

Let’s begin!

Many often interpret tensors as a complicated and advanced concept in deep learning.

However, it isn’t.

The only thing that is ever there to understand about Tensors is that, like any NumPy array, Tensors are just another data structure to store multidimensional data.

- When we use NumPy to store numerical data, we create a
**NumPy array**— NumPy’s built-in data structure. - When we use PyTorch (for instance) to store numerical data, we create a Tensor — PyTorch’s built-in data structure.

That’s it.

**Tensor, like NumPy array, is just another data structure.**

Now, an obvious question at this point is:

Why create another data structure when NumPy arrays do the exact same thing of storing multidimensional data, and they are very well integrated with other scientific Python libraries?

There are multiple reasons why PyTorch decided to develop a new data structure.

NumPy undoubtedly offers:

- extremely fast, and
- optimized operations.

This happens through its vectorized operations.

Simply put, vectorization offers run-time optimization:

- when dealing with a batch of data together…
- …by avoiding native Python for-loops (which are slow).

But as discussed earlier in this article, NumPy** DOES NOT support parallelism.**

Thus, even though its operations are vectorized, every operation is executed in a single core of the processing unit.

This provides further scope for run-time improvement.

Of course, there are open-source libraries like Numexpr that address this limitation by providing a fast evaluator for NumPy expression using:

**Multi-threading**:- It is a parallel computing technique that allows a program to execute multiple threads (smaller units of a process) concurrently.
- In the context of NumPy and libraries like Numexpr, multi-threading accelerates mathematical and numerical operations by dividing the computation across multiple CPU cores.

- This approach is particularly effective when you have a multi-core CPU, as it leverages the available cores for parallelism, leading to faster computation.
**Just-in-time (JIT) compilation**:- JIT compilation is a technique used to improve the run-time performance of code by compiling it at run-time, just before execution.

- In the context of Numexpr (and similar libraries, JIT compilation involves taking a NumPy expression or mathematical operation and dynamically generating machine code specific to the operation.
- As a result, JIT-compiled code can run much faster than equivalent pure Python code because it is optimized for the specific operation and can make use of low-level hardware features.

The speedup offered by Numexpr is evident from the image below.

According to Numexpr’s documentation, depending upon the complexity of the expression, the speed-ups can range from 0.95x to 20x.

Nonetheless, the biggest problem is that Numexpr can only speed up element-wise operations on NumPy arrays.

This includes:

- Element-wise sum/multiplication etc.
- Element-wise transformations like
`sin`

,`log`

, etc. - and more.

**But Numexpr has no parallelization support for matrix multiplications, which, as you may already know, are the backbone of deep learning models.**

This problem gets resolved in PyTorch tensors as they offer parallelized operations.

**An important point to note:**

**GPU parallelization:**- If you're working with PyTorch tensors on a GPU (using CUDA), the matrix multiplication operation is highly parallelized across the numerous cores of the GPU.
- Modern GPUs consist of thousands of cores designed for parallel computation.
- When you perform a matrix multiplication on a GPU, these cores work together to compute the result much faster than a CPU could.

**CPU Parallelization**:- The extent of parallelization on a CPU may depend on the CPU’s architecture.
- All CPUs these days have multiple cores, and PyTorch is optimized to utilize these cores efficiently for matrix operations.
- While it may not be as parallel as a GPU, you can still expect significant speed improvements over performing the operation in pure Python.

We can also verify experimentally:

- On the left, we create two random NumPy arrays and perform matrix multiplication using
`np.matmul()`

method. - On the right, we create two random PyTorch tensors and perform matrix multiplication using
`torch.matmul()`

method.

As depicted above, **PyTorch is over three times faster than NumPy**, which is a massive speed-up.

This proves that PyTorch provides highly optimized vector operations, which the neural network can benefit from, **not only during forward pass but backpropagation as well**.

In fact, here’s another reason why tensor operations in PyTorch are faster.

See, as we all know, **NumPy is a general-purpose computing framework** that is designed to handle a wide range of numerical computations across various domains, not limited to just deep learning or machine learning.

In fact, NumPy is not just used by data science and machine learning practitioners, but it is also widely used in various scientific and engineering fields for tasks such as signal processing, image analysis, and simulations in physics, chemistry, and biology.

It’s so popular in Biological use cases that a bunch of folks extended the NumPy package to create BioNumPy:

While its versatility is a key strength, it also means that NumPy’s internal optimizations are geared toward a broad spectrum of use cases.

On the other hand, **PyTorch is purposely built for deep learning and tensor operations**.

This specialized focus allows PyTorch to implement highly tuned and domain-specific optimizations for tensor computations, including matrix multiplications, convolutions, and more.

These optimizations are finely tuned to the needs of deep learning, where large-scale matrix operations are fundamental.

Being niched down to a specific set of users allowed PyTorch developers to optimize tensor operations, including matrix multiplication to a specific application — deep learning.

If you want another motivating example, we discussed this in the newsletter here:

In a gist, the core idea is that the more specific we get, the better we can do compared to a generalized solution.

Deep learning is all about a series of matrix operations applied layer after layer to generate the final output.

For instance, consider the following neural network for regression:

- First, the input received at the input layer $(x_1, x_2, \cdots, x_m)$ is transformed by a set of weights $(W_A)$ and an activation function to get the output of the hidden layer.
- Next, the output of the hidden layer is further transformed by a set of weights $(W_B)$ to get the final output (t).

If we were to use NumPy to represent input, weights, and layer outputs, it would be impossible to tell how a specific array was computed.

For instance, consider the two NumPy arrays `arr1`

and `arr2`

below:

The NumPy array `arr3`

holds no information about how it was computed.

In other words, as long as we don’t manually dig into the code, we can never tell:

- What were the operands?
- What was the operator?

But why do we even care about that information?

See, as long as we are only doing a forward pass in a neural network, we don’t care which specific operation and which arrays generated a particular layer output. We only care about the output in that case.

But that’s not how neural networks are trained, are they?

To train a neural network, we must run backpropagation.

To run backpropagation, we must compute gradients to update the weights.

And to compute gradients of a layer’s weights, we must know the specific arrays that were involved in that computation.

For instance, consider the above neural network again:

To update the weights $W_B$, we must compute the gradient $\Large \frac{\delta L}{\delta W_B}$.

The above gradient depends on the loss value $L$, which in turn depends on $\hat y$.

Thus, we must know the specific vectors that were involved in the computation of $\hat y$.

While this is clear from the above network:

- What if we add another layer?
- What if we change the activation function?
- What if we add more neurons to the layer?
- What if we were to compute the gradient of weight in an earlier layer?

All this can get pretty tedious to manage manually.

However, if (somehow) we can keep track of how each tensor was computed, what operands were involved, and what the operator was, we can simplify gradient computation.

A computational graph helps us achieve this.

Simply put, a computational graph is a directed acyclic graph representing the sequence of mathematical operations that led to the creation of a particular tensor.

💡

A directed acyclic graph is a directed graph with no directed cycles.

During network training, PyTorch forms this computational graph during the forward pass.

For instance, the computational graph for a dummy neural network is shown below:

💡

Here, we have represented the operation as **matmul **for simplicity. In reality, however, PyTorch stores the backward function of that operation in its computational graph.

- First, we perform a matrix multiplication between the input $X$ and the weights $W_A$ to get the output activations $Z_B$ (we are ignoring any activation functions for now).
- Next, we perform a matrix multiplication between the output activations $Z_B$ and the weights $W_B$ to get the network output $Z_C$.

During backpropagation, PyTorch starts scanning the computational graph backward, i.e., from the output node, iteratively computes the gradients, and updates all the weights.

The program that performs all the gradient computation is called PyTorch Autograd.

Large deep learning models demand plenty of computational resources for speeding up model training.

However, NumPy operations are primarily designed to run on the Central Processing Unit (CPU), which is the general-purpose processor in most computers.

While CPUs are versatile and suitable for many tasks, they do not provide the speed and parallel processing capabilities needed for large-scale numerical computations, especially in the context of modern deep learning and scientific computing.

On a side note, if you genuinely want to run NumPy-like computation on a GPU, CuPy is an open-source NumPy alternative that you may try.

It’s a NumPy-compatible array library for GPU-accelerated computing.

The syntax of CuPy is quite compatible with NumPy. To use GPU, you just need to replace the following line of your code:

Nonetheless, the issue of not being able to track how each array was computed still exists with CuPy.

Thus, even if we wanted to, we could not use CuPy as an alternative to NumPy.

In fact, CuPy, like NumPy, is also a general-purpose scientific computation library. So any deep learning-specific optimizations are still not up to the mark.

These limitations prompted PyTorch developers to create a new data structure, which addressed these limitations.

This also suggests that by somehow compiling machine learning models to tensor computations, we can leverage immense inference speedups.

Before getting into those details, let’s understand how we can train sklearn models on large datasets on a CPU.

So far, we have spent plenty of time understanding the motivation for building traditional ML models on large datasets.

As sklearn can only utilize CPU, using it for large datasets is still challenging.

Yet, there’s a way.

We know that sklearn provides a standard API across each of its machine learning model implementations.

- Train the model using
`model.fit()`

. - Predict the output using
`model.predict()`

. - Compute the accuracy using
`model.score()`

. - and more.

However, the problem with training a model this way is that the sklearn API expects the entire training data at once.

This means that the entire dataset **must be** available in memory to train the model.

But what if the dataset itself is large enough to load in memory? These are called out-of-memory datasets.

In fact, even if we can somehow barely load the dataset in memory, it might be difficult to train the model because every model requires some amount of computations, which, of course, will consume memory.

Thus, there’s a high possibility that the program (or Jupyter kernel) may crash.

Nonetheless, there’s a solution to this problem.

In situations where it’s not possible to load the entire data into the memory at once, we can load the data in chunks and fit the training model for each chunk of data.

This is also called incremental learning and sklearn, considerately, provides the flexibility to do so.

More specifically, Sklearn implements the `partial_fit()`

API for various algorithms, which offers incremental learning.

As the name suggests, the model can learn incrementally from a mini-batch of instances. This prevents limited memory constraints as only a few instances are loaded in memory at once.

What’s more, by loading and training on a few instances at a time, we can possibly speed up the training of sklearn models.

Why?

Usually, when we use the `model.fit(X, y)`

method to train a model in sklearn, the training process is vectorized but on the entire dataset.

While vectorization provides magical run-time improvements when we have a bunch of data, it is observed that the performance may degrade after a certain point.

Thus, by loading fewer training instances at a time into memory and applying vectorization, we can get a better training run-time.

Let’s see this in action!

First, let’s create a dummy classification dataset with:

- 20 Million training instances
- 5 features
- 2 classes

We will use the `make_classification()`

method from sklearn to do so:

After creating a Pandas DataFrame and exporting it to a CSV, the dataset occupies roughly 4 GBs of local storage space:

We train a `SGDClassifier`

model using sklearn on the entire dataset as follows:

The above training takes about $251$ seconds.

Next, let’s train the same model using the `partial_fit()`

API of sklearn.

Here’s what we shall do:

- Load data from the CSV file
`large_dataset.csv`

in chunks- We can do this by specifying the
`chunksize`

parameter in`pd.read_csv()`

method. - Say
`chunksize=400000`

, then this would mean that Pandas will only load four lakh rows at a time in memory.

- We can do this by specifying the
- After loading a specific chunk, we will invoke the
`partial_fit()`

API on the`SGDClassifier`

model.

This is implemented below:

**The training time is reduced by ~8 times**, which is massive.

💡

In this case, the classes parameter is used to specify all the classes in the training dataset. When using the partial_fit() API, a mini-batch may not have instances of all classes (especially the first mini-batch). Thus, the model will be unable to cope with new/unseen classes in subsequent mini-batches. Therefore, we must pass a list of all possible classes in the classes parameter.

This validates what we discussed earlier:

While vectorization provides magical run-time improvements when we have a bunch of data, it is observed that the performance may degrade after a certain point. Thus, by loading fewer training instances at a time into memory and applying vectorization, we can get a better training run-time.

Of course, we must also compare the model coefficients and prediction accuracy of the two models.

The following visual depicts the comparison between `model_full`

and `model_chunk`

:

Both models have similar coefficients and similar performance.

Having said that, it is also worth noting that not all sklearn estimators implement the `partial_fit()`

API.

Here's the list of models that do:

Once we have trained our sklearn model (either on a small dataset or large), we may want to deploy it.

However, as discussed earlier, sklearn models are backed by NumPy computations, so they **can only run on a single core of a CPU**.

Thus, in a deployment scenario, this can lead to suboptimal run-time performance.

Nonetheless, it is possible to compile many sklearn models to tensor operations, which can be loaded on a GPU to gain immense speedups.

Let’s understand how.

In recent deep dives, we’ve primarily focused on cultivating skills that can help us develop large machine learning (ML) projects.

For instance, in the most recent deep dive on “**Model Compression**”, we learned many techniques to drastically reduce the size of a model to make it more production-friendly.

In the above article, we saw how these techniques allow us to reduce both the latency and size of the original model, which directly helps in:

- Lowering computation costs.
- Reducing model footprint.
- Improving user experience due to low latency…

**…all of which are critical metrics for businesses.**

However, learning about model compression techniques isn’t sufficient.

In most cases, we would only proceed with model compression when the model is intended to serve an end-user.

And that is only possible when we know how to deploy and manage machine learning in production.

Thus, after learning about model compression techniques, we are set to learn the next critical skill** **—** deployment.**

In my opinion, many think about deployment as just “deployment” — host the model somewhere, obtain an API endpoint, integrate it into the application, and you are done!

**But that is almost NEVER the case.**

This is because, in reality, plenty of things must be done post-deployment to ensure the model’s reliability and performance.

What they are? Let’s understand!

Deployment is a pivotal stage in the ML project lifecycle. It’s that stage where a real user will rely on your model’s predictions.

Yet, it’s important to recognize that deployment is not the final destination.

After deploying a model, several critical considerations must be addressed to ensure its reliability and performance.

Let’s understand them.

Version control is critical to all development processes. It allows developers to track software changes (code, configurations, data, etc.) over time.

In the context of data teams, version control can be especially crucial when deploying models.

For instance, with version control, one can precisely identify what changed, when it changed, and who changed it — which is crucial information when trying to diagnose and fix issues that arise during the deployment process or if models start underperforming post-deployment.

This goes back to what we discussed in a recent deep dive — “*Machine learning deserves the rigor of any software engineering field.*”

If the model starts underperforming, git-based functionality allows us to quickly roll back to previous versions of the model.

There are many other benefits too.

Effective collaboration becomes increasingly important as data science projects get bigger and bigger.

Someone in the team might be working on identifying better features for the model, and someone else might be responsible for fine-tuning hyperparameters or optimizing the deployment infrastructure.

And it is well known that with version control, teams can work on the same codebase/data and improve the same models without interfering with each other’s work.

Moreover, one can easily track changes, review each other’s work, and resolve conflicts (if any).

Reproducibility is one of the critical aspects of building reliable machine learning.

Imagine this: Something that one works on one system but does not work on another reflects bad reproducibility practices.

Why it’s important, you may wonder?

It ensures that results can be replicated and validated by others, which improves the overall credibility of our work.

Version control allows us to track the exact code version and configurations used to produce a particular result, making it easier to reproduce results in the future.

This becomes especially useful for open-source data projects that many may use.

CI/CD enables teams to build, test, and deploy code quickly and efficiently.

In machine learning, Continuous Integration (CI) may involve building and testing changes automatically to ML models as soon as they are committed to a code repository.

In Continuous Deployment (CD), the objective can be to reflect new changes to the model once they have passed testing.

Consequently, it should seamlessly update the changes to production, making the latest version of the model available to end users.

Model logging is another crucial aspect of post-deployment ML operations.

As the name suggests, logging involves capturing and storing relevant information about model performance, resource utilization, predictions, input data, latency, etc.

There are various reasons why model logging is important and why it’s something that should NEVER be overlooked.

To understand better, imagine you have already deployed a model, and it is serving end-users.

Once deployed, there’s little possibility that nothing will go wrong in production, **especially on the data front**!

Let’s understand in detail.

**Concept drift** happens when the statistical properties of the target variable or the input features sent as input to the model change over time.

In simpler terms, the relationship between your model's inputs and outputs evolves, making your model less accurate over time if not addressed.

Concept drift can occur due to various reasons, such as:

- Changes in user behavior
- Shifts in the data source
- Alterations in the underlying data-generating process.

For instance, imagine you are building a spam email classifier. You train the model on a dataset collected over several months.

Initially, the model performs well and accurately classifies spam and non-spam emails.

However, over time, email spamming techniques evolve.

New types of spam emails emerge with different keywords, structures, and techniques.

This change in the underlying concept of “spam” represents concept drift.

That is why it is important to have periodic retraining or continuous training strategies in place.

If your model isn't regularly retrained with up-to-date data, it may start misclassifying the new types of spam emails, leading to decreased performance.

💡

The term ‘Covariates’ refers to the features of your model.

**Covariate shift** is a specific type of concept drift that occurs when the distribution of the input features (covariates) in your data changes over time, but the true relationship between the target variable and the inpute remains the same.

In other words, the true (or natural) relationships between the input features and the target variable stay constant, but the distribution of the input features shifts.

For instance, consider this is the true relationship, which is non-linear:

Based on the observed training data, we ended up learning a linear relationship:

However, at the time of inference post-deployment, the distribution of input samples was different from that of the observed distribution:

It leads to poor model performance because the model was trained on one distribution of the data, but now, it is being tested or deployed on a different distribution.

Methods for addressing covariate shifts include reweighting the training data or using domain adaptation techniques to align the source and target distributions.

💡

To an extent, batch normalization is a remedy for covariate shifts in neural networks.

For instance, suppose you are building a weather forecasting model. You train the model using historical weather data from a specific region, and the training data includes features like temperature, humidity, and wind speed.

However, when you deploy the model to a different region with a distinct climate, the distribution of these features can shift significantly.

For instance, temperature ranges and humidity levels in the new region might be quite different from those in the training data. This covariate shift can cause your model to make inaccurate predictions in the new environment.

When building statistical models, we typically assume that the samples are identically distributed.

**Non-stationarity** refers to the situation where the **probability distribution** of the samples evolves over time in a non-systematic or unpredictable manner.

This can encompass various aspects, including changes in data distributions, trends, seasonality, or other patterns.

**Non-stationarity can be challenging for machine learning models, as they are typically trained while assuming that the data distribution remains constant.**

For instance, assume you are building some wealth predictor. Using your currency amount feature will typically not serve as a good feature because currency values get affected due to inflation.

Models deployed in non-stationary environments may need regular updates or adaptive learning strategies to cope with changing data patterns.

**Unrepresentative training data** is a situation where the data used to train a machine learning model does not adequately represent the real-world conditions or the diversity of scenarios that the model will encounter in production.

When training data is not representative, the model may perform well on the training data but poorly on new, unseen data.

This issue can lead to bias and poor generalization.

For instance, suppose you are building a speech recognition system for a voice assistant.

You collect training data primarily from young adults with clear accents and no speech impairments.

However, in real-world usage, the voice assistant will be used by people of all ages who speak various languages and have different speech patterns.

If your training data is unrepresentative and biased towards a specific demographic, the model may struggle to understand and accurately transcribe speech from a more diverse user base, leading to poor performance in production.

The above four problems that we typically face in production systems highlight the importance of model logging.

As discussed above, addressing these issues often involves continuous monitoring of model performance in the deployment environment, collecting and labeling new data when necessary, and retraining the model to adapt to changing conditions.

Traditionally, in industry, advanced techniques like adaptive learning are employed to update the model with the new data and potentially mitigate the impact of concept drift, covariate shift, non-stationarity, and unrepresentative training data.

👉

Below, we shall discuss a bit about Adaptive learning (mostly conceptual ideas), but we will do a full practical article on this soon.

Traditionally, ML models are trained on some gathered fixed/static dataset and then used to make predictions on unseen data.

While this is how machine learning has been typically (and successfully) approached so far, it gets infeasible to train a new model from scratch every time we get some new data.

This makes intuitive sense as well.

Adaptive learning is a remedy to this problem.

Adaptive models can adapt and improve their performance as they are exposed to more data, leading to better accuracy and utility.

In situations where the data distribution is constantly changing, adaptive models can adapt and continue to perform well, while non-adaptive models may struggle.

A major advantage of adaptive learning is that since the model isn’t trained from scratch, for every update, the additional computational cost is small if the previously trained model is used for the start of the next retraining iteration.

In the realm of solving real-life problems by deploying machine learning models, it is inevitable that, with time, data distribution will change.

As a result, the models trained on old data will likely provide little value going forward.

Adaptive models are a great solution in such situations, as they consistently adapt to the incoming data streams.

In almost all ML use cases, the algorithm is never coded from scratch.

Instead, one uses open-source implementations offered by libraries like PyTorch, Sklearn, and many more.

To ensure reproducibility in production, the production environment should be consistent with the environment in which it was trained.

This involves installing similar versions of libraries used, software dependencies, OS configurations, and many more.

Of course, achieving this consistency is not a painstaking process. All you should do is maintain an environment configuration.

Yet, it does require careful environment configuration and management.

This involves documenting and tracking the versions of all software components, libraries, and dependencies used during model development and deployment.

To address these consistency challenges, organizations often use containerization technologies like Docker.

Containers encapsulate the entire environment, including software dependencies, libraries, and configurations, ensuring that the same environment is replicated in both the development and production stages.

ML engineers may not have experience with deployment. They may not have the necessary expertise in areas such as software engineering, MLOps, and infrastructure management.

This can make it difficult for them to effectively deploy and scale models in production environments.

In such cases, organizations hire specialized talents.

However, engineers hired specifically for deployment may not have an in-depth understanding of ML algorithms and techniques.

This makes it difficult for them to understand the code and make necessary optimizations, leading to issues with scaling, performance, and reliability, and can ultimately impact the effectiveness of the model in production.

The above pain points, along with the data challenges we discussed above, highlight the **necessity for a data scientist to have the necessary deployment expertise**.

Traditional hosting services like Google Cloud, AWS, and Heroku have been go-to options for deploying machine learning models.

However, the process can be challenging and time-consuming, requiring specialized expertise in infrastructure and DevOps.

For data scientists without these skills, deploying models to production can be a significant pain point.

There are several challenges associated with traditional hosting services.

First, data scientists often switch between different tools and environments to manage deployments. This means leaving the comfort of their Jupyter notebooks, where they spend most of their time developing and refining models.

The process can be jarring, and the need to learn new tools and interfaces can slow down productivity.

Second, deploying machine learning models to production environments demands plenty of configuration and management of infrastructure resources, including servers, networking, and security.

This is a specialized area that many data scientists may need to become more familiar with, and it can take a lot of time and effort to get right.

The above pain points highlight a need for a simple and elegant way to deploy machine learning models that doesn’t require specialized expertise and can be done entirely from a Jupyter Notebook.

Modelbit is a deployment service that specifically addresses all these challenges, allowing data scientists to deploy models with just a single command from their notebooks.

With Modelbit, there’s no need to worry about infrastructure, security, or server management — the service takes care of everything, allowing data scientists to focus on what they are supposed to do — building and improving models.

Thus, in this article, let’s understand how to use Modelbit for machine learning model deployment.

Let’s begin!

🗒️

The core objective behind model deployment is to obtain an API endpoint for our deployed model, which can be later used for inference purposes:

Modelbit lets us seamlessly deploy ML models directly from our Python notebooks (or Git, as we would see ahead in this article) and obtain a REST API.

Since Modelbit is a relatively new service, let’s understand the general workflow to generate an API endpoint when deploying a model with Modelbit.

The image below depicts the steps involved in deploying models with Modelbit:

- Step 1) We connect the Jupyter kernel to Modelbit.
- Step 2) Next, we train the ML model.
- Step 3) We define the inference function. Simply put, this function contains the code that will be executed at inference. Thus, it will be responsible for returning the prediction.
- Step 4) [OPTIONAL] Here, we specify the version of Python and other open-source libraries we used while training the model.
- Step 5) Lastly, we send it for deployment.

Once done, Modelbit returns the API endpoint, which we can integrate into any of the applications and serve end-users with.

Let’s implement this!

]]>Training machine learning (ML) models are frequently driven by a relentless pursuit of achieving higher and higher accuracies.

Many create increasingly complex deep learning models, which, without a doubt, do incredibly well “performance-wise.”

However, the complexity severely impacts their real-world utility.

**For years, the primary objective in model development has been to achieve the best performance metrics.**

👉

This, unfortunately, is also a practice that many leaderboard-based competitions promote. Nothing wrong, but in my opinion, this overshadows the importance of focusing on the real-world applicability of the solution.

However, it is important to note that when it comes to deploying these models in production (or user-facing) systems, the focus shifts from raw accuracy to considerations such as efficiency, speed, and resource consumption.

Thus, typically, when we deploy any model to production, the specific model that gets shipped to production is NOT solely determined based on performance.

Instead, we must consider several operational metrics that are not ML-related.

**What are they? Let’s understand!**

When a model is deployed into production, certain requirements must be met.

Typically, these “requirements” are not considered during the prototyping phase of the model.

For instance, it is fair to assume that a user-facing model may have to handle plenty of requests from a product/service the model is integrated with.

And, of course, we can never ask users to wait for, say, a minute for the model to run and generate predictions.

Thus, along with “model performance,” we would want to optimize for several other operational metrics:

It’s the time it takes for a model to process a single input and generate a prediction.

It measures the delay between sending a request to the model and receiving the response.

Striving for a low inference latency is crucial for all real-time or interactive applications, as users expect a quick response.

High latency, as you may have guessed, will lead to a poor user experience and will not be suitable for many applications like:

- Chatbots
- Real-time speech-to-text transcription
- Gaming, and many more.

Throughput is the number of inference requests a model can handle in a given time period.

It estimates the model’s ability to process multiple requests simultaneously.

Yet again, as you may have guessed, high throughput is essential for applications with a high volume of incoming requests.

These include e-commerce websites, recommendation systems, social media platforms, etc. High throughput ensures that the model can serve many users concurrently without significant delays.

This refers to the amount of memory a model occupies when loaded for inference purposes.

It quantifies the memory footprint required to store all the parameters, configurations, and related data necessary for the model to make predictions or generate real-time outputs.

The significance of model size becomes particularly apparent when deploying models in resource-constrained environments.

Many production environments, such as mobile devices, edge devices, or IoT devices, have limited memory capacity.

It is obvious to guess that in such cases, the model’s size will directly impact whether it can be deployed at all.

Large models may not fit within the available memory, making them impractical for these resource-constrained settings.

This is a famous story.

In 2006, Netflix launched the “Netflix Prize,” a machine learning competition that encouraged ML engineers to build the best algorithm to predict user ratings for films.

**The grand prize was USD $ 1,000,000 $.**

After the competition concluded, Netflix awarded a $1 million prize to a developer team in 2009 for an algorithm that increased the accuracy of the company's recommendation engine by **10 percent**.

That’s a lot!

Yet, Netflix never used that solution because it was overly complex.

Here’s what Netflix said:

The increase in accuracy of the winning improvements did not seem to justify the engineering effort needed to bring them into a production environment.

The complexity and resource demands of the developed model made it impractical for real-world deployment. Netflix faced several challenges:

**Scalability:**The model was not easily scalable to handle the vast number of users and movies on the Netflix platform. It would have required significant computational resources to make real-time recommendations for millions of users they had.**Maintenance:**Managing and updating such a complex model in a production environment would have been a logistical nightmare. Frequent updates and changes to the model would be challenging to implement and maintain.**Latency:**The ensemble model's inference latency was far from ideal for a streaming service. Users expect near-instantaneous recommendations, but the complexity of the model made achieving low latency difficult.

You can read more about this story here: Netflix Prize story.

Consequently, Netflix never integrated the winning solution into its production recommendation system. Instead, they continued to use a simplified version of their existing algorithm, which was more practical for real-time recommendations.

This real-life instance from the Netflix Prize was a reminder that we must strive for a delicate balance between model complexity and practical utility.

**While highly complex models may excel in research and competition settings, they may not be suitable for real-world deployment due to scalability, maintenance, and latency concerns.**

In practice, simpler and more efficient models often are a better choice for delivering a seamless user experience in production environments.

Let me ask you this. Which of the following two models would you prefer to integrate into a user-facing product?

I strongly prefer Model B.

If you understand this, you resonate with the idea of keeping things simple in production.

Fortunately, there are various techniques that can help us reduce the size of the model, thereby increasing the speed of model inference.

These techniques are called **Model Compression** methods.

Using these techniques, you can reduce both the latency and size of the original model.

As the name suggests, model compression is a set of techniques used to reduce the size and computational complexity of a model while preserving or even improving its performance.

They aim to make the model smaller — that is why the name “**model compression**.”

Typically, it is expected that a smaller model will:

- Have a lower inference latency as smaller models can deliver quicker predictions, making them well-suited for real-time or low-latency applications.
- Be easy to scale due to their reduced computational demands.
- Have a smaller memory footprint.

In this article, we’ll look at four techniques that help us achieve this:

- Knowledge Distillation
- Pruning
- Low-rank Factorization
- Quantization

As we will see shortly, these techniques attempt to strike a balance between model size and accuracy, making it relatively easier to deploy models in user-facing products.

👉

The Jupyter notebook of this entire article has been provided at the bottom of the article.

Let’s understand them one by one!

This is one of the most common, effective, reliable, and one of my favorite techniques to reduce model size.

Essentially, knowledge distillation involves training a smaller, simpler model (referred to as the “student” model) to mimic the behavior of a larger, more complex model (known as the “teacher” model).

The term can be broken down as follows:

**Knowledge:**Refers to the understanding, insights, or information that a machine learning model has acquired during training. This “knowledge” can be typically represented by the model’s parameters, learned patterns, and its ability to make predictions.

**Distillation:**In this context, distillation means transferring or condensing knowledge from one model to another. It involves training the student model to mimic the behavior of the teacher model, effectively transferring the teacher's knowledge.

This is a two-step process:

- Train the large model as you typically would. This is called the “teacher” model.
- Train a smaller model, which is intended to mimic the behavior of the larger model. This is also called the “student” model.

The primary objective of knowledge distillation is to transfer the knowledge, or the learned insights, from the teacher to the student model.

This allows the student model to achieve comparable performance with fewer parameters and reduced computational complexity.

The technique makes intuitive sense as well.

Of course, comparing it to a real-world teacher-student scenario in an academic setting, the student model may never perform as well as the teacher model.

But with consistent training, we can create a smaller model that is **almost** as good as the larger one.

This goes back to the objective we discussed above

Strike a balance between model size and accuracy, such that it is relatively easier to deploy models in user-facing products.

A classic example of a model developed in this way is DistillBERT. It is a student model of BERT.

We also discussed this in the newsletter here:

DistilBERT is approximately $40\%$ smaller than BERT, which is a massive difference in size.

Still, it retains approximately $97\%$ of the natural language understanding (NLU) capabilities of BERT.

**What’s more, DistilBERT is roughly 60% faster in inference.**

This is something I have personally experienced and verified in one of my research studies on Transformer models:

As shown above, on one of the studied datasets (SensEval-2), BERT achieved the best accuracy of $76.81$. With DistilBERT, it was $75.64$.

On another task (SensEval-3), BERT achieved the best accuracy of $80.96$. With DistilBERT, it was $80.23$.

Of course, DistilBERT isn’t as good as BERT. Yet, the performance difference is small.

Given the run-time performance benefits, it makes more sense to proceed with DistilBERT instead of BERT in a production environment.

💡

If you are interested in learning more about my research study, you can read it here: **A Comparative Study of Transformers on Word Sense Disambiguation**.

One of the biggest downsides of knowledge distillation is that one must still train a larger teacher model first to train the student model.

However, in a resource-constrained environment, it may not be feasible to train a large teacher model.

Assuming we are not resource-constrained at least in the development environment, one of the most common techniques for Knowledge Distillation is **Response-based Knowledge Distillation**.

As the name suggests, in **response-based knowledge distillation**, the focus is on matching the output responses (predictions) of the teacher model and the student model.

Talking about a classification use case, this technique transfers the **probability distributions** of class predictions from the teacher to the student.

It involves training the student to produce predictions that are not only accurate but also mimic the soft predictions (probability scores) of the teacher model.

As we are trying to mimic the **probability distribution of the class predictions of the teacher model**, one ideal candidate for the loss function is KL divergence.

We discussed this in detail in one of the previous articles on t-SNE.

Yet, here’s a quick recap:

The core idea behind KL divergence is to assess how much information is lost when one distribution is used to approximate another.

Thus, the more information is lost, the more the KL divergence. As a result, the more the dissimilarity.

KL divergence between two probability distributions $P(x)$ and $Q(x)$ is calculated as follows:

The formula for KL divergence can be read as follows:

The KL divergence $D_{KL} (P || Q) $ between two probability distributions $P$ and $Q$ is calculated by summing the above quantity over all possible outcomes $x$. Here:

- $P(x)$ represents the probability of outcome $x$ occurring according to distribution $P$.
- $Q(x)$ represents the probability of the same outcome occurring according to distribution $Q$.

It measures how much information is lost when using distribution $Q$ to approximate distribution $P$.

Imagine this. Say $P$ and $Q$ were identical. This should result in zero loss of information. Let’s verify this from the formula above.

If the probability distributions $P$ and $Q$ are identical, it means that for every $x$, $P(x) = Q(x)$. Thus,

Simplifying, we get:

This is precisely what we intend to achieve in response-based knowledge distillation.

Simply put, we want the probability distribution of the class predictions of the student model to be identical to the probability distribution of the class predictions of the teacher model.

- First, we can train the teacher model as we typically would.
- Next, we can instruct the student model to mimic the probability distribution of the class predictions of the teacher model.

Let’s see how we can practically use response-based knowledge distillation using PyTorch.

More specifically, we shall train a slightly complex neural network on the MNIST dataset. Then, we will build a simpler neural network using the response-based knowledge distillation technique.

First, we import the required packages from PyTorch:

Next, we load the MNIST dataset (train and test) and create their respective PyTorch dataloaders.

Now, we shall define a simple CNN-based neural network architecture. This is demonstrated below:

Moving on, we shall initialize the teacher model and define the loss function to train it — the `CrossEntropyLoss`

.

Now, we will train the teacher model.

With this, we are done with the Teacher model.

Next, we must train the Student Model.

We defined the Teacher model as a CNN-based neural network architecture. Let’s define the Student model as a simple feed-forward neural network without any CNN layers:

The above method accepts two parameters:

- The output of the student model (
`student_logits`

). - The output of the teacher model (
`teacher_logits`

).

We convert both outputs to probabilities using the softmax function.

Finally, we find the KL divergence between them and return it.

Moving on, we shall initialize the student model and define the optimizer as we did before.

Finally, it’s time to train the Student model.

To recap, the teacher model was a CNN-based neural network architecture. The student model, however, was a simple feed-forward neural network.

The following visual compares the performance of the teacher and the student model:

]]>Before we understand what OOP is, let’s understand the backstory, and what led us to formulate “OOP”?

Why programmers thought it was essential to have such a thing? What were the pain points in the traditional way of programming?

Traditional programming (or procedural programming) focuses on breaking a problem into steps or procedures executed one after the other.

While this approach was suitable for small-scale projects or experimentation, it eventually became problematic when projects transformed into large, complex systems.

In other words, it became difficult to manage, understand, and maintain the code as it grew in size and complexity.

One major challenge with traditional programming was the lack of structure and organization.

Before OOP, programs were often written in a linear fashion. This made it difficult to manage large amounts of code and understand the relationships between different parts of the program.

As a result...👇

...traditional programming resulted in code that was arder to maintain and debug. Consequently, it was prone to more bugs and errors.

This resulted in longer development cycles and reduced software quality.

The list of challenges with traditional programming is endless.

To overcome these challenges, some computer scientists envisioned a new way of programming that would allow developers to create more flexible and adaptable software programs.

The thought process behind the new programming paradigm was inspired by real-world entities. They have certain properties and behaviors, and can interact with one another.

Thus, scientists believed that programming could be modeled in a similar fashion.

And this is how OOP was born in the late 70s.

As the name suggests, Object-Oriented Programming (OOP) is a programming paradigm/technique based on the concepts of “objects.” That is why the name “object-oriented.”

This is in contrast to traditional programming where methods are executed in sequence.

The core idea of OOP is to model real-world objects and their interactions in a program.

Thus, each object is treated as a separate entity with different values for the properties.

For example, you can model a `car`

object with properties such as `color`

, `speed`

, `model`

, and methods such as `start`

, `stop`

, and `accelerate`

.

In response to the challenges posed by traditional programming, OOP was designed as a new programming paradigm.

Its primary aim was to improve code organization, reusability, maintainability, and scalability.

OOP introduced objects and classes (discussed below), which made it easier to organize code into manageable and reusable components.

Being organized, it was easier to understand the structure of a program, shorten the development lifecycle, and eliminate bugs and errors.

In addition to this, OOP also introduced some crucial concepts, such as inheritance. This provided a way to create relationships between classes.

Moreover, OOP allowed programmers to easily extend existing components and build complex systems. This drastically reduced the amount of code needed to be written again from scratch.

Next, let’s discuss some basic terminologies around classes and how they are defined in Python.

OOP is defined around some core terminologies and concepts. Let’s understand them below:

Class is a blueprint (or template) for instantiating objects. Recall the `Car`

example discussed above. In that context, we can call `Car`

a class.

Objects created from the same class will have the same properties and behaviors. However, the values for the properties may differ.

For instance, all car objects created from the `Car`

class will have the same name of the properties, such as `cost`

, `color`

, `speed`

, `model`

etc. However, the values for these properties may differ, as shown below:

In Python, we define a class using the `class`

keyword, followed by its name:

Inside a class, we define the attributes and methods that its object will have.

The variables defined within a class that store information about an object are called attributes. They can also be thought of as an object’s properties.

Attributes are of two types:

**2.1) Instance-level attributes**

As the name suggests, these attributes are unique to each instance of a class. In other words, every time we create an object, each object gets its copy of instance-level attributes.

For example, in the `Car`

example above, the variables `cost`

, `color`

, `speed`

and `model`

are instance-level attributes.

Also, these attributes usually have different values for each instance.

In Python, we assign the instance-level attributes in the `__init__`

method of a class. As the name suggests, “init” lets us initialize an object.

As `__init__`

is like any other Python function, we can also pass a bunch of parameters to the class method.

The first parameter in a Python class is always the `self`

keyword. It is a special variable used as a reference to the calling object. The `self`

keyword is followed by other parameters, as shown below:

The parameters specified in the `__init__`

method are assigned to the corresponding instance-level attributes.

**2.2) Class-level attributes**

In contrast to the instance-level attributes, which are unique to each object, these attributes are shared by all the instances of a class.

In other words, they are associated with the class itself, but not with any specific instance of the class. Therefore, they are defined outside the `__init__`

method.

For example, you can have a class-level attribute for the `number_of_wheels`

of a car. This attribute would have the same value for all objects created from the car class, regardless of the `color`

, `cost`

, or any other instance-level attributes.

In Python, class-level attributes are defined as follows:

To access the class-level attributes, we can do the following:

The functions defined within the scope of a class are called methods.

They operate on the attributes of an object and are typically used to manipulate or retrieve the values stored in an object’s attributes.

An independent definition created anywhere in the program is called a “function”. However, a function is called a “method” when defined inside a class.

Like the `__init__`

method, class methods are also defined within the scope of the class. Also, the first parameter is always the `self`

keyword.

For instance, let’s define a `change_speed`

method.

This method accepts the `new_speed`

and assigns it to the instance-level attribute `speed`

. As shown above, we reference any instance-level attribute using the `self`

keyword using the dot notation.

Note that a class method may or may not receive any parameters other than `self`

. For instance, if we were to define a `stop_car`

method, it is not necessary to pass the new speed as a parameter.

Whenever we create an instance of a class, it is called an object.

Considering the `Car`

class defined above, we can create a new object as follows:

The above statement invokes the `__init__`

method defined in the class. As a result, the arguments get assigned to the respective instance-level attributes defined in the class.

Also, the object can access the attributes and methods defined in the class. These can be accessed using the dot notation, as shown below:

When we call a class method, we don’t pass any value for the `self`

parameter. For instance, even though the definition of `change_speed()`

has two parameters: `self`

and `new_speed`

, the `self`

parameter is automatically fetched by Python using the calling object.

Therefore, while calling the method, we only passed one value (`20`

), which corresponds to the `new_speed`

.

Whenever we define a new class, it creates a new datatype. For instance, if we use the `type()`

method of Python, and pass the class object `my_car`

, we get the following:

This indicates that the `my_car`

object is of type `Car`

.

Thus, in addition to making your code more organized, readable and manageable, classes also offer a mechanism to create a user-defined datatype.

Next, let’s look at some of the fundamental magic methods in Python OOP and how they are used.

In Python, magic methods are methods that have double underscores at the beginning and end of their names, such as `__init__`

, which we discussed above.

Do you know one of the biggest hurdles data science and machine learning teams face?

It is transitioning their data-driven pipeline from Jupyter Notebooks to an executable, reproducible, error-free, and organized pipeline.

And this is not something data scientists are particularly fond of doing.

Yet, this is an immensely critical skill that many overlook.

Machine learning deserves the rigor of any software engineering field. Training codes should always be reusable, modular, scalable, testable, maintainable, and well-documented.

To help you develop that critical skill, I'm excited to bring you a special guest post by Damien Benveniste. He is the author of The AiEdge newsletter and was a Machine Learning Tech Lead at Meta.

In today’s machine learning deep dive, he will provide a detailed guide on structuring code for machine learning development, one of the most critical yet overlooked skills by many data scientists.

I personally learned a lot from this one and I am sure you will learn a lot too.

Let’s begin!

I have always believed that machine learning deserves the rigor of any software engineering field. Training codes should be reusable, modular, scalable, testable, maintainable, and well-documented.

Today, I want to show you my template to develop quality code for machine learning development.

More specifically, we will look at:

- What does coding mean?
- Designing:
- System design
- Deployment process
- Class diagram

- The code structure:
- Directory structure
- Setting up the virtual environment
- The code skeleton
- The applications
- Implementing the training pipeline
- Saving the model binary

- Improving the code readability:
- Docstrings
- Type hinting

- Packaging the project

I often see many Data Scientists or Machine Learning Engineers developing in Jupyter notebooks, copy-pasting their codes from one place to another, which gives me nightmares!

When running ML experiments, Jupyter is prone to human errors as different cells can be run in different orders. Yet, ideally, you should be able to capture all the configurations of an experiment to ensure reproducibility.

No doubt, Jupyter can be used to call a training package or an API and manually orchestrate experiments, but fully developing in Jupyter is an extremely risky practice.

For instance, when training a model, you should ensure the data is passed through the exact feature processing pipelines at serving (inference) time. This means using the same classes, methods, and identical versions of packages and hardware (GPU vs. CPU).

Personally, I prefer prototyping in Jupyter but developing in Pycharm or VSCode.

When programming, focus on the following aspects:

**Reusability:**- It is the capacity to reuse code in another context or project without significant modifications.
- Code reusability can be achieved in several ways, such as through libraries, frameworks, modules, and object-oriented programming techniques.
- In addition, good documentation and clear code organization also facilitate code reuse by making it easier for other developers to understand and use the code.

**Modularity:**- It is the practice of breaking down a software system into smaller, independent modules or components that can be developed, tested, and maintained separately.

**Scalability:**- It refers to the ability of a software development codebase to accommodate the growth and evolution of a software system over time. In other words, it refers to the ability of the codebase to adapt to changing requirements, features, and functionalities while maintaining its overall structure, quality, and performance.
- To achieve codebase scalability, it is important to establish clear coding standards and practices from the outset, such as using version control, code review, and continuous integration and deployment.
- In addition, it is important to prioritize code maintainability and readability, as well as the use of well-documented code and clear naming conventions.

**Testability:**- It refers to the ease with which software code can be tested to ensure that it meets the requirements and specifications of the software system.
- It can be achieved by designing code with testing in mind rather than treating testing as an afterthought. This can involve writing code that is modular, well-organized, and easy to understand and maintain, as well as using tools and techniques that support automated testing and continuous integration.

**Maintainability:**- It refers to the ease with which software code can be modified, updated, and extended over time.

**Documentation:**- It provides a means for developers, users, and other stakeholders to understand how the software system works, its features, and how to interact with it.

In Machine Learning, like any engineering domain, no line of code should be written until a proper design is established.

Having a design means that we can translate a business problem into a machine learning solution, **provided ML is indeed the right solution to the problem!**

For simplicity, let’s assume we want to build a mobile application where a user needs machine learning predictions displayed on the screen — personalized product recommendations, for instance.

The process workflow may appear as follows:

- The mobile application requests personalized predictions from the backend server.
- The backend server fetches predictions from a database.
- We figured that daily batch predictions were the most appropriate setup for now, and the machine learning service updates the predictions daily.

This process is depicted in the image below:

Before we can understand how to develop our model, we need to understand how we will deploy it. Let’s assume that, for our purposes, an inference application will be containerized in a Docker container.

The container can be deployed in a container registry such as AWS ECR (Amazon Elastic Container Registry) or Docker Hub. We can have an orchestration system such as Airflow that spins up the inference service, pulls the container from the registry, and runs the inference application.

Now that we know what we need to build and how it will be deployed, how we need to structure our codebase is becoming much clearer.

More specifically, we shall build two applications:

- An inference application.
- A training application.

To minimize potential human errors, it is imperative that the modules used at training time are the same as the ones used at inference time.

Let’s look at the following class diagram:

**The application layer:**- This part of the code captures the application’s logic. Think about these modules as “buttons” that start the inference or training processes.
- We will have a
`run()`

function for each of those applications that will serve as handles for the Docker image to start those individual processes.

**The data layer:**- This is the abstraction layer that moves data in and out of the applications. I am calling it the “data” layer, but I am including anything that needs to go into the outside world, like the data, the model binaries, the data transformer, the training metadata, etc.
- In this batch use case, we are just going to need a function that brings the data into the applications
`get_data()`

and another that puts predictions back into the database`put_data()`

.- The
`DataConnector`

class moves data around. - The
`ObjectConnector`

is the actor responsible for transferring model binaries and data transformation pipelines using`get_object()`

and`put_object()`

.

- The

**The machine learning layer:**This is the module where all the different machine learning components will live. The three components of model training are:

*Learning the parameters of the model*: the`Model`

will take care of that with the`fit()`

method. For inferring, we use the`predict()`

method.*Learning the features transformations*: We may need to normalize features, perform Box-Cox transformations, one-hot encode, etc… The`DataProcessor`

will take care of that with the`fit()`

and`transform()`

methods.*Learning the hyperparameters of the model and data pipeline*: the`CrossValidator`

will handle this task with its`fit()`

function.

The `TrainingPipeline`

will handle the logic between the different components.

Now that we have a class diagram, we must map it into actual code. Let’s call the project `machine_learning_service`

.

Of course, there are many ways to do it, but we will organize the project as follows:

**The “docs” folder:**for the documents.**The “src” folder:**for the source code (or the actual codebase).**The “tests” folder:**for the unit tests.

Going ahead, we assume we will Dockerize this project at some point. Thus, controlling the Python version and packages we use locally is crucial.

To do that, we will create a virtual environment called `env`

using venv, an open-source package that allows us to create virtual environments effortlessly.

First, within the project folder, we run the following command to create a virtual environment:

Next, we activate it as follows:

Once done, we should see the following directory structure:

In the current directory, let’s check the Python version that is running so that we can use the Python binaries of the virtual environment. We do this using the `which`

command, as demonstrated below:

Next, let’s make sure that the Python version is Python 3:

Okay, we are good to go!

Within the source folder, let’s create the different modules we have in the class diagram:

For now, let’s have empty classes.

- The
`Model`

class: This will be responsible for training the model on new data and predicting on unseen data:

- The
`DataProcessor`

class: This handles the processing needed before the data is fed to the ML model, such as normalization, transformation, etc.

In a recent article, we devised the entire principal component analysis (PCA) algorithm from scratch.

We saw how projecting the data using the eigenvectors of the covariance matrix naturally emerged from the PCA optimization step.

Moving on, we discussed some of its significant limitations.

**Let’s look at them again!**

Many use PCA as a data visualization technique. This is done by projecting the given data into two dimensions and visualizing it.

While this may appear like a fair thing to do, there’s a big problem here that often gets overlooked.

**As we discussed in the PCA article**, after applying PCA, each new feature captures a fraction of the original variance.

**This means that two-dimensional visualization will only be helpful if the first two principal components collectively capture most of the original data variance, as shown below:**

If not, the two-dimensional visualization will be highly misleading and incorrect. This is because the first two components don’t capture most of the original variance well. This is depicted below:

Thus, using PCA for 2D visualizations is only recommended if the cumulative explained variance plot suggests so. If not, one should refrain from using PCA for 2D visualization.

PCA has two main steps:

- Find the eigenvectors and eigenvalues of the covariance matrix.
- Use the eigenvectors to project the data to another space.

👉

Why eigenvectors, you might be wondering? We discussed the origin of eigenvectors in detail in the PCA article. It is recommended to read that article before reading this article.

Projecting the data using eigenvectors creates uncorrelated features.

Nonetheless, the new features created by PCA $(x_0^{'}, x_1^{'}, x_2^{'})$ are always a linear combination of the original features $(x_0, x_1, x_2)$.

This is depicted below:

As shown above, every new feature in $X_{projected}$ is a linear combination of the features in $X$.

We can also prove this experimentally.

As depicted above, we have a linearly inseparable dataset. Next, we apply PCA and reduce the dimensionality to $1$. The dataset remains linearly inseparable.

On the flip side, if we consider a linearly separable dataset, apply PCA, and reduce the dimensions, we notice that the dataset remains linearly separable:

This proves that PCA is a linear dimensionality reduction technique.

However, not all real-world datasets are linear. In such cases, PCA will underperform.

As discussed above, PCA’s primary objective is to capture the overall (or global) data variance.

In other words, PCA aims to find the orthogonal axes along which the **entire dataset** exhibits the most variability.

Thus, during this process, it does not pay much attention to the local relationships between data points.

**It inherently assumes that the global patterns are sufficient to represent the overall data variance.**

This is demonstrated below:

Because of primarily emphasizing on the global structure, PCA is not ideal for visualizing complex datasets where the underlying structure might rely on local relationships or pairwise similarities.

In cases where the data is nonlinear and contains intricate clusters or groups, PCA can fall short of preserving these finer details, as shown in another illustration below:

As depicted above, data points from different clusters may overlap when projected onto the principal components.

This leads to a loss of information about intricate relationships within and between clusters.

So far, we understand what’s specifically lacking in PCA.

While the overall approach is indeed promising, its limitations make it impractical for many real-world datasets.

**t-distributed stochastic neighbor embedding (t-SNE)** is a powerful dimensionality reduction technique **mainly used to visualize high-dimensional datasets** by projecting them into a lower-dimensional space (typically 2-D).

As we will see shortly, t-SNE addresses each of the above-mentioned limitations of PCA:

- It is well-suited for visualization.
- It works well for linearly inseparable datasets.
- It focuses beyond just capturing the global data relationships.

t-SNE is an improvement to the **Stochastic Neighbor Embedding** (SNE) algorithm. It is observed that in comparison to SNE, t-SNE is much easier to optimize.

💡

t-SNE and SNE are two different techniques, both proposed by one common author — Geoffrey Hinton.

So, before getting into the technical details of t-SNE, let’s spend some time understanding the SNE algorithm instead.

To begin, we are given a high-dimensional dataset, which is difficult to visualize.

Thus, the dimensionality is higher than $3$ (typically, it’s much higher than $3$).

The objective is to project it to a lower dimension (say, $2$ or $3$), such that the lower-dimensional representation preserves as much of the local and global structure in the original dataset as possible.

Let’s understand!

Imaging this is our high-dimensional dataset:

💡

Of course, the above dataset is not a high-dimensional dataset. But for the sake of simplicity, let’s assume that it is.

Local structure, as the name suggests, refers to the arrangement of data points that are close to each other in the high-dimensional space.

Thus, preserving the local structure would mean that:

- Red points should stay closer to other red points.
- Blue points should stay closer to other blue points.
- Green points should stay closer to other green points.

So, is preserving the local structure sufficient?

**Absolutely not!**

If we were to focus solely on preserving the local structure, it may lead to a situation where blue points indeed stay closer to each other, but they overlap with the red points, as shown below:

This is not desirable.

Instead, we also want the low-dimensional projections to capture the global structure.

Thus, preserving the global structure would mean that:

- The red cluster is well separated from the other cluster.
- The blue cluster is well separated from the other cluster.
- The green cluster is well separated from the other cluster.

To summarize:

- Preserving the local structure means maintaining the relationships among nearby data points within each cluster.
- Preserving the global structure involves maintaining the broader trends and relationships that apply across all clusters.

Let’s stick to understanding this as visually as possible.

Consider the above high-dimensional dataset again.

Euclidean distance is a good measure to know if two points are close to each other or not.

For instance, in the figure above, it is easy to figure out that `Point A`

and `Point B`

are close to each other but `Point C`

is relatively much distant from `Point A`

.

Thus, the first step of the **SNE algorithm** is to convert these high-dimensional Euclidean distances between data points into conditional probabilities that represent similarities.

For better understanding, consider a specific data point in the above dataset:

Here, points that are closer to the marked point will have smaller Euclidean distances, while other points that are far away will have larger Euclidean distances.

Thus, as mentioned above, for every data point $(i)$, the **SNE algorithm** first converts these high-dimensional Euclidean distances into conditional probabilities $p_{j|i}$.

Here, $p_{j|i}$ represents the conditional probability that a point $x_i$ will pick another point $x_j$ as its neighbor.

This conditional probability is assumed to be proportional to the probability density of a Gaussian centered at $x_i$.

This makes intuitive sense as well. To elaborate further, consider a Gaussian centered at $x_i$.

It is evident from the above Gaussian distribution centered at $x_i$ that:

- For points near $x_i$, $p_{j|i}$ will be relatively high.
- For points far from $x_i$, $p_{j|i}$ will be small.

So, to summarize, for a data point $x_i$, we convert its Euclidean distances to all other points $x_j$ into conditional probabilities $p_{j|i}$.

This conditional probability is assumed to be proportional to the probability density of a Gaussian centered at $x_i$.

Also, as we are **only interested in modeling pairwise similarities**, we set the value of $p_{i|i}=0$. In other words, a point cannot be its own neighbor.

💡

A Gaussian for a point $x_i$ will be parameterized by two parameters: $(\mu_i, \sigma_i^2)$. As the x-axis of the Gaussian measures Euclidean distance, the mean $(\mu_i)$ is zero. What about $\sigma_i^2$? We’ll get back to it shortly. Until then, just assume that we have somehow figured out the ideal value $\sigma_i^2$.

Based on what we have discussed so far, our conditional probabilities $p_{j|i}$ may be calculated using a Gaussian probability density function as follows:

**However, there’s a problem here.**

Yet again, let’s understand this visually.

Earlier, we considered the data points in the same cluster to be closely packed.

Thus, the resultant conditional probabilities were also high:

However, the data points of a cluster might be far from other clusters. Yet, they themselves can be a bit more scattered, as depicted below:

If we were to determine the resultant conditional probabilities $p_{j|i}$ for the marked data point $x_i$, we will get:

In this case, even though the data points belong to the same cluster, the conditional probabilities are much smaller than what we had earlier.

We need to fix this, and a common way to do this is by normalizing the individual conditional probability between $(x_i, x_j) \rightarrow p_{j|i}$ by the sum of all conditional probabilities $p_{k|i}$.

Thus, we can now estimate the final conditional probability $p_{j|i}$ as follows:

- The numerator is the conditional probability between $(x_i, x_j) \rightarrow p_{j|i}$.
- Each of the terms inside the summation in the denominator is the conditional probability between $(x_i, x_k) \rightarrow p_{k|i}$.

To reiterate, we are **only interested in modeling pairwise similarities**. Thus, we set the value of $p_{i|i}=0$. In other words, a point cannot be its own neighbor.

Recall the objective again.

We intend to project the given high-dimensional dataset to lower dimensions (say, $2$ or $3$).

Thus, for every data point $x_i \in \mathbb{R}^n$, we define it’s counterpart $y_i \in \mathbb{R}^2$:

💡

$y_i$ does not necessarily have to be in two dimensions. Here, we have defined $y_i \in \mathbb{R}^2$ just to emphasize that $n$ is larger than $2$.

Next, we use a similar notion to compute the pairwise conditional probability using a Gaussian, which we denote as $q_{j|i}$ in the low-dimensional space.

Furthermore, to simplify calculations, we set the variance of the conditional probabilities $q_{j|i}$ to $\frac{1}{\sqrt{2}}$.

As a result, we can denote $q_{j|i}$ as follows:

Yet again, as we are **only interested in modeling pairwise similarities, **we set the value of $q_{i|i}=0$.

The objective is to have conditional probabilities in the low-dimensional space $(q_{j|i})$ identical to the conditional probabilities in the high-dimensional space $(p_{j|i})$.

Thus, if the projected data points $y_i$ and $y_j$ will correctly model the similarity between the high-dimensional datapoints $x_i$ and $x_j$, the conditional probabilities $q_{j|i}$ and $p_{j|i}$ must be nearly equal.

**This hints that we must minimize the difference between the two conditional probabilities — $q_{j|i}$ and $p_{j|i}$.**

One of the most common and popular ways to quantify the difference between two probability distributions is KL Divergence.

As we have covered this previously, we won’t get into much detail:

...but here’s a quick overview of what it does.

The core idea behind KL divergence is to assess how much information is lost when one distribution is used to approximate another.

Thus, the more information is lost, the more the KL divergence. As a result, the more the dissimilarity.

KL divergence between two probability distributions $P(x)$ and $Q(x)$ is calculated as follows:

The formula for KL divergence can be read as follows:

The KL divergence $D_{KL} (P || Q) $ between two probability distributions $P$ and $Q$ is calculated by summing the above quantity over all possible outcomes $x$. Here:

- $P(x)$ represents the probability of outcome $x$ occurring according to distribution $P$.
- $Q(x)$ represents the probability of the same outcome occurring according to distribution $Q$.

It measures how much information is lost when using distribution $Q$ to approximate distribution $P$.

Imagine this. Say $P$ and $Q$ were identical. This should result in zero loss of information. Let’s verify this from the formula above.

If the probability distributions $P$ and $Q$ are identical, it means that for every $x$, $P(x) = Q(x)$. Thus,

Simplifying, we get:

Hence proved.

Let’s look at another illustration below:

Here, we have observed distribution (the histogram in blue). The objective is to quantify whether it is more like a Gaussian distribution (left) or a Gamma distribution (right).

By measuring the KL divergence, we notice that the observed distribution resembles a Gamma distribution (right).

This is because the KL divergence between:

- The observed distribution and Gamma distribution is low.
- The observed distribution and Gaussian distribution is high.

Based on this notion, KL divergence between two conditional probability distributions – $q_{j|i}$ and $p_{j|i}$ can potentially be a loss function in our problem.

The goal will be to find the most optimal projected points $y_i$ (these are like model parameters), such that the KL divergence between the two conditional probability distributions – $q_{j|i}$ and $p_{j|i}$ is minimum.

Thus, once we have formulated a KL divergence-based loss function, we can minimize it with respect to the points $y_i$. Further, we can use gradient descent to update the points $y_i$.

Let’s do it.

The cost function $C$ of the **SNE algorithm** is given by:

Here:

- $P_i$ denotes the conditional probability distribution over all other data points given datapoint $x_i$.
- $Q_i$ denotes the conditional probability distribution over all other map points given map point $y_i$.

For every pair of points $(i, j)$, we have the following notations:

- $(x_i, x_j) \rightarrow$ the high-dimensional data points $(x_i, x_j)$.
- $(y_i, y_j) \rightarrow$ the low-dimensional mapping of $(x_i, x_j)$.
- $p_{j|i} \rightarrow$ the probability density of the Euclidean distance to $x_j$ when $x_i$ is the center point.
- $q_{j|i} \rightarrow$ the probability density of the Euclidean distance to $y_j$ when $y_i$ is the center point.

In a gist, the rightmost formulation denotes the** **pairwise similarities we intend to capture. That is why we get two summations.

We can further simplify the above cost function $C$ as follows:

The first term will always be a constant. For a given dataset $X$, the pairwise probability densities can never change.

Also, this term is independent of $q_{j|i}$ (or $y_i \rightarrow$ the model parameters). Thus, we can safely ignore the first term.

Now, the loss function looks more like a cross-entropy loss function with an additive constant $a$.

Finding the gradient with respect to $y_i$, we get the following:

👉

Although the above gradient expression is quite simple, it is a bit complicated to derive. For those who are interested, I have provided a derivation at the end of the article.

Assuming $\gamma$ to be the set of all mapped points in the low-dimensional space:

We can update the parameters as follows:

We are yet to discuss the variance computation of the probability density function $p_{j|i}$ we defined for the high-dimensional data points:

$\sigma_i^2$ denotes the variance of the Gaussian centered over each high-dimensional datapoint $x_i$.

It is unlikely that we can specify a single value of $\sigma_i^2$ for all data points. This is evident from the figure below:

The density of the data will vary across all data points.

Thus, in dense regions, we need a small value of $\sigma_i^2$, but in sparse regions, we need a larger value of $\sigma_i^2$.

While finding the absolute best value of $\sigma_i^2$ for every data point $x_i$ appears to be a bit difficult, we can still try.

This is where we introduce a **user-specified hyperparameter** called `perplexity`

.

**Dimensionality reduction** is crucial to gain insight into the underlying structure of high-dimensional datasets.

One technique that typically stands out in this respect is the **Principal Component Analysis (PCA)**.

In a gist, the core objective of PCA is to transform the data $X$ into another space such that new features are uncorrelated and preserve the variance of the original data.

While PCA is widely popular, I have noticed that most folks struggle to design its true end-to-end formulation from scratch.

Many try to explain PCA by relating it to the idea of eigenvectors and eigenvalues, which is true — PCA does perform transformations using eigenvectors and eigenvalues.

But why?

**There are two questions to ask here:**

**How can we be sure that the data projections that involve eigenvectors and eigenvalues in PCA is the most obvious solution to proceed with?****How can we be sure that the above transformation does preserve the data variance?**

In other words, where did this whole notion of eigenvectors and eigenvalues originate from in PCA?

I have always believed that it is not only important to be aware of techniques, but it is also crucial to know how these techniques are formulated end-to-end.

And the best way to truly* *understand any algorithm or methodology is by manually building it from scratch.

It is like replicating the exact thought process when someone first designed the algorithm by approaching the problem logically, with only mathematical tools.

Thus, in this dimensionality-reduction article series, we’ll dive deep into one of the most common techniques:** Principal Component Analysis (PCA)**

We’ll look at:

- The intuition and the motivation behind dimensionality reduction.
- What are vector projections and how do they alter the mean and variance of the data?
**What is the optimization step of PCA?**- What are Lagrange Multipliers and how are they used in PCA optimization?
- What is the final solution obtained by PCA?
- How to determine the number of components in PCA?
- Why should we be careful when using PCA for visualization?
- What are the advantages and disadvantages of PCA?
- Takeaways.

Let’s begin!

Imagine someone gave you the following weight and height information about a few individuals:

It is easy to guess that the height column has more variation than weight.

Thus, even if you were to discard the **weight** column, chances are you could still identify these individuals solely based on their heights.

However, if you were to discard the **height** column, you would likely struggle to identify them.

Why?

Because their heights have more variations than their weights, and it’s clear from this example that, **typically**, if the data has more variation, it holds more information (this might not always be true, though, and we’ll learn more about this later).

**That’s the essence of many dimensionality reduction techniques.**

Simply put, we reduce dimensions by eliminating useless (or least useful) features.

Based on this explanation, one plausible and intuitive way might be to measure column-wise variance and eliminate top $k$ features with the least variance.

This is depicted below:

However, there’s a problem with this approach.

Almost all datasets have correlated features. Thus, in the case of high correlation, this puts a challenge on which feature we should eliminate or retain in the final dataset.

Removing one feature that is highly correlated with another may lead to an incoherent dataset, as both features can be equally important. As a result, it may lead to misleading conclusions.

**Thus, the above approach of eliminating features based on feature variance is only ideal if the features we begin with are entirely uncorrelated.**

So, can we do something to make the features in our original dataset uncorrelated?

Can we apply any transformation that does this?

Of course we can!

For instance, consider the above dummy dataset again where $x_1$ and $x_2$ were correlated:

As shown above, if we can represent the data in a new coordinate system $(x_1^{'}, x_2^{'})$, it is easy to see that there is almost no correlation between the new features.

What's more, the data varies mostly along the dimension $(x_1^{'})$.

As a result, in this new coordinate system, we can safely discard the dimension along which the original data has the least variance, i.e., $(x_2^{'})$.

Think of it this way. If we were to discard the dimension $(x_1^{'})$, we would get the following low-dimensional representation of the data:

It makes much more sense to discard $(x_2^{'})$ because the data varies less along that dimension.

So overall, these are the steps we need to follow:

**Step #1)** Develop a new coordinate system for our data such that the new features are uncorrelated.

**Step #2)** Find the variance along each of the new uncorrelated features.

**Step #3)** Discard the features with the least variance to get the final “dimensionally-reduced dataset.”

Here, it is easy to guess that steps $2$ and $3$ aren’t** **that big of a challenge. Once we have obtained the new uncorrelated features, the only objective is to find their variance and discard the features with the least variances.

Thus, the primary objective is to develop the new coordinate system, $(x_1^{'}, x_2^{'})$ in the above illustration.

In other words, we want to project the data to a low-dimensional space to preserve as much of the original variance as possible while effectively reducing the dimensions.

**As we’ll look shortly, this becomes more like an optimization problem.**

So, how do we figure out the best projection for our data?

As the name suggests, vector projection is about finding the component of one vector that lies in the direction of another vector.

For instance, consider a toy example:

Given a vector $X = (4, 5)$, its projections along the coordinate axes $\hat x_1$ and $\hat x_2$ are:

What we did was we projected the original vector along the direction of some other vector.

A little complicated version of this could be to project it along a vector that is different from the coordinate vectors, $u$ in the following figure:

For simplicity, let’s consider the vector $u$ to be a unit vector — one which has a unit length.

By the notion of cosine similarity, we know that the cosine of the angle between two vectors is given by:

Thus, we can write the length of the projection of $\vec{a}$ on $\vec{b}$ as:

Now, we know the magnitude of the projection. Next, we can do a scaler vector multiplication of this magnitude with the unit vector in the desired direction ($\hat b$) to get the projection vector:

Substituting the value of $cos(\theta)$, and simplifying we get the final projection as:

Until now, we considered just one data point for projection. But our dataset will indeed have many more data points.

The above projection formula will still hold.

In other words, we can individually project each of the vectors as depicted below:

However, the projection will alter the mean and variance of the individual features.

But it’s simple to figure out the new mean and variance after the projection.

Essentially, all data points are vectors in some high-dimensional space. Thus, the mean of the individual features will also be a vector in that same space $(\hat \mu)$.

As a result, it makes intuitive sense that the transformation to the individual data points will equally affect the mean vector as well.

With this in mind, we can get the projected mean vector as follows:

Similarly, we can also determine the variance of the individual features after projection as follows:

So, to summarize, if we are given some data points X as follows:

💡

Each data point can be thought of as a vector $x_i$.

If we want to project these vectors along some unit vector $\hat b$:

...then the projection for a specific data point $x_i$ is given by:

If the individual features had a mean vector $\vec{\mu}$, then the mean of the projections is given by:

What’s more, if the individual features had a variance vector $\vec{\sigma^2}$, then the variance of the projections is given by:

Finally, the idea of finding the variance of the projection can be extended to the covariance matrix of the projection as well.

In simple terms, while variance $\vec{\sigma^2}$ above denoted the individual feature variances, the covariance matrix holds the information about how two features vary together.

Say $\sum$ denotes the covariance matrix for the original features:

Then the covariance matrix of the projections $\sum_{proj}$ can be written as:

With this, we are set to proceed with formulating PCA as an optimization problem.

]]>In one of the earlier articles, we formulated the entire linear regression algorithm from scratch.

In that process, we saw the following:

- How the assumptions originate from the algorithmic formulation of linear regression.
- Why they should not be violated.
- What to do if they get violated.
- and much more.

Overall, the article highlighted the statistical essence of linear regression and under what assumptions it is theoretically formulated.

Linear regression models a linear relationship between the features $(X)$ and the true/observed output variable $(y)$.

The estimate $\hat y$ is written as:

Where:

- $y$ is the observed/true dependent variable.
- $\hat y$ is the modeled output.
- $X$ represents the independent variables (or features).
- $θ$ is the estimated coefficient of the model.
- $\epsilon$ is the random noise in the output variable. This accounts for the variability in the data that is not explained by the linear relationship.

The primary objective of linear regression is to estimate the values of $\theta=(\theta_1,\theta_2,⋯,\theta_n)$ that most closely estimate the observed dependent variable y.

Talking specifically about the $\epsilon$ – the random noise, we assume it to be drawn from a Gaussian:

Another way of formulating this is as follows:

The above says that the conditional distribution of $Y$ given $X$ is a Gaussian with:

- Mean = $θ^TX$ (dependent on X).
- Variance: $\sigma$ (independent of X or constant).

A graphical way of illustrating this is as follows:

The regression line models the mean of the Gaussian distributions, and the mean varies with $X$. Also, all Gaussians have an equal variance.

So, in a gist, with linear regression, we are trying to explain the dependent variable $Y$ as a function of $X$.

We are given $X$, and we have assumed a distribution for $Y$. $X$ will help us model the mean of the Gaussian.

**But it’s obvious to guess that the above formulation raises a limitation on the kind of data we can model with linear regression.**

More specifically, problems will arise when, just like the mean, the variance is also a function of $X$.

However, during the linear regression formulation, we explicitly assumed that the variance is constant and never depends on $X$.

We aren’t done yet.

There’s another limitation.

Linear regression assumes a very specific form for the mean. In other words, the mean should be the linear combination of features $X$, as depicted below:

Yet, there’s every chance that the above may not hold true, and instead, the mean is represented as follows:

So, to summarize, we have made three core assumptions here:

**#1)**If we consider the conditional distribution $P(Y|X)$, we assume it to be a Gaussian.

**#2)**$X$ affects only the mean, that too in a very specific way, which is linear in the individual features $x_j$.

**#3)**Lastly, the variance is constant for the conditional distribution $P(Y|X)$ across all levels of $X$.

But nothing stops real-world datasets from violating these assumptions.

In many scenarios, the data might exhibit complex and nonlinear relationships, heteroscedasticity (varying variance), or even follow entirely different distributions altogether.

Thus, we need an approach that allows us to adapt our modeling techniques to accommodate these real-world complexities.

**Generalized linear models attempt to relax these things.**

More specifically, they consider the following:

- What if the distribution isn’t normal
**but some other distribution from the exponential family?** - What if $X$ has a more sophisticated relationship with the mean?
- What if the variance varies with $X$?

Let’s dive in!

As the name suggests, generalized linear models (GLMs) are a generalization of linear regression models.

Thus, the linear regression model is a specific case of GLMs.

They expand upon the idea of linear regression by accommodating a broader range of data distributions.

Unlike traditional linear regression, which assumes that the response variable follows a normal distribution and has a linear relationship with the predictors (as discussed above), GLMs can handle various response distributions – Binomial, Poisson, Gamma distributions, and more.

However, before proceeding ahead, it is essential to note that with GLMs, the core idea remains around building “**linear models**.” Thus, in GLMs, we never relax the linear formulation:

We’ll understand why we do this shortly.

In GLMs, the first thing we relax is the conditional distribution $P(Y|X)$, which we assumed to be a Gaussian earlier.

We change the normal distribution to some other distribution from the exponential family of distributions, such as:

**Why specifically the exponential family?**

Because as we would see ahead, everything we typically do with the linear model with Gaussian extends naturally to this family. This includes:

- The way we formulate the likelihood function
- The way we derive the maximum likelihood estimators, etc.

But the above formulation gets extremely convoluted when we go beyond these distributions.

This is because, as the name suggests, the probability density functions (PDFs) of distributions in the exponential family can be manipulated into an exponential representation.

This inherent structure simplifies the mathematical manipulations we typically do in maximum likelihood estimation (MLE):

**#1) Define the likelihood function for the entire dataset:**Here, we typically assume that the observations are independent. Thus, the likelihood function for the entire dataset is the product of the individual likelihoods.**#2) Take the logarithm (the obtained function is called log-likelihood):**To simplify calculations and avoid numerical issues, it is common to take the logarithm of the likelihood function.*This step gets simplified if the likelihood values have an exponential term — which the exponential family of distributions can be manipulated to possess.*

**#3) Maximize the log-likelihood:**Finally, the goal is to find the set of parameters $\theta$ that maximize the log-likelihood function.

More specifically, when we take the logarithm in the MLE process, it becomes easier to transform the product in the likelihood function into a summation. This can be further simplified with the likelihood involving exponential terms.

We also saw this in one of the earlier articles:

Lastly, the exponential family of distributions can explain many natural processes.

This family encapsulates many common data distributions encountered in real-world scenarios.

- Count data $\rightarrow$ Poisson distribution may help.
- Binary Outcomes $\rightarrow$ Bernoulli distribution.
- Continuous Positive-Valued Data $\rightarrow$ Gamma distribution.
- The time between events $\rightarrow$ try Exponential Distribution.

So overall, we alter the specification of the normal distribution to some other distribution from the exponential family.

There's another thing we change.

Earlier, we had the formulation that the mean is directly the linear combination of the features.

In GLMs, we extend it to the idea that some function of the mean is the linear combination of the features.

Here, $F$ is called the link function.

While the above notation of putting the link function onto the mean may not appear intuitive, it’s important to recognize that this approach serves a larger purpose within GLMs.

By putting the transformation on mean instead, we maintain the essence of a linear model intact while also accommodating diverse response distributions.

Understanding the purpose of the link function is also simple.

Imagine you had the Bernoulli random variable for the condition distribution of $Y$ given $X$.

We know that its mean ($p$ or $\mu(x)$) will always lie between [0,1].

However, if we did not add a link function and instead preferred the following representation, the mean can have any real value as $\theta^T \cdot X$ can span all real values.

What the link function does is it maps the mean ($\mu(x)$), which is constrained to some specific range, to all possible real values so that the compatibility between the linear model and the distribution’s mean is never broken.

That is why they are also called “**link**” functions. It is because they link the linear predictor $\theta^T \cdot X$ and the parameter for the probability distribution $\mu(x)$.

Thus, to summarize, there are three components in GLMs.

]]>One of the major aspects of training any reliable ML model is avoiding **overfitting**.

In a gist, overfitting occurs when a model learns to perform exceptionally well on the training data.

This may happen because the model is trying too hard to capture all **unrelated and random noise** in our training dataset, as shown below:

In this context, noise refers to any random fluctuations or inconsistencies that might be present in the data.

While learning this noise leads to a lower train set error and lets your model capture more intricate patterns in the training set, it comes at the tremendous cost of poor generalization on unseen data.

One of the most common techniques used to avoid overfitting is **regularization**.

Simply put, the core objective of regularization is to penalize the model for its complexity.

And if you have taken any ML course or read any tutorials about this, the most common they teach is to add a penalty (or regularization) term to the cost function, as shown below:

In the above expressions:

- $y^{(i)}$ is the true value corresponding to sample $i$.
- $\hat y^{(i)}$ is the model’s prediction corresponding to sample $i$, which is dependent on the parameters $(\theta_1, \theta_2, \cdots, \theta_K)$.

While we can indeed validate the effectiveness of regularization experimentally (as shown below), it’s worth questioning the origin of the regularization term** from a probabilistic perspective.**

More specifically:

- Where did this regularization term originate from?
- What does the regularization term precisely measure?
- What real-life analogy can we connect it to?
- Why do we add this regularization term to the loss?
- Why do we square the parameters (specific to L2 regularization)? Why not any other power?
- Is there any probabilistic evidence that justifies the effectiveness of regularization?

Turns out, there is a concrete probabilistic justification for regularization.

Yet, in my experience, most tutorials never bother to cover it, and readers are always expected to embrace these notions as a given.

Thus, this article intends to highlight the origin of regularization purely from a probabilistic perspective and provide a derivation that makes total logical and intuitive sense.

**Let’s begin!**

Before understanding the origin of the regularization term, it is immensely crucial to learn a common technique to model labeled data in machine learning.

It’s called the maximum likelihood estimation (MLE).

We covered this in the following article, but let’s do it again in more detail.

Essentially, whenever we model data, the model is instructed to maximize the likelihood of observing the given data $(X,y)$.

More formally, a model attempts to find a specific set of parameters $\theta$ (also called model weights), which maximizes the following function:

The above function $L$ is called the likelihood function, and in simple words, the above expression says that:

- maximize the likelihood of observing $y$
- given $X$
- when the prediction is parameterized by some parameters $\theta$ (also called weights)

When we begin modeling:

- We know $X$.
- We also know $y$.
- The only unknown is $\theta$, which we are trying to figure out.

Thus, the objective is to find the specific set of parameters $\theta$ that maximizes the likelihood of observing the data $(X,y)$.

This is commonly referred to as **maximum likelihood estimation (MLE)** in machine learning.

MLE is a method for estimating the parameters of a statistical model by maximizing the likelihood of the observed data.

It is a common approach for parameter estimation in various regression models.

Let’s understand it with the help of an example.

Imagine you walk into a kitchen and notice several broken eggshells on the floor.

Here’s a question: **“Which of these three events is more likely to have caused plenty of eggshells on the floor?”**

- Someone was experimenting with them.
- Someone was baking a cake.
- Someone was creating an art piece with them.

Which one do you think was more likely to have happened?

Let’s look at what could have led to eggshells on the floor with the highest likelihood.

The likelihood of:

- Conducting a science experiment that led to eggshells on the floor is MEDIUM.
- Baking a cake that led to eggshells on the floor is HIGH.
- Creating an art piece that led to eggshells on the floor is LOW.

Here, we’re going to go with the event with the highest likelihood, and that’s **“baking a cake.”**

Thus, we infer that the most likely thing that happened was that someone was baking a cake.

**What we did was that we maximized the conditional probability of the event, given an explanation.**

- The event is what we observed, i.e., eggshells on the floor.
- The “explanation,” as the name suggests, is the possible cause we came up with.

We did this because there’s a probability of:

- “Eggshells” given “Experiment”
- “Eggshells” given “Cake”
- “Eggshells” given “Art piece”

We picked the one with the highest conditional probability $P(event|explanation)$.

Simply put, we tried to find the scenario that most likely led to the eggshells on the floor. **This is called maximum likelihood.**

This is precisely what we do in machine learning at times. We have a bunch of data and several models that could have generated that data.

First, we estimate the probability of seeing the data given “Model 1”, “Model 2”, and so on. Next, we pick the model that most likely produced the data.

In other words, we’re maximizing the probability of the data given model.

Next, let’s consider an example of using MLE for simple linear regression.

Consider we have the following data:

The candidate models that could have generated the above data our:

It’s easy to figure out that **Model 2** would have most likely generated the above data.

**But how does a line generate points?**

Let’s understand this specific to linear regression.

We also discussed this more from a technical perspective in the linear regression assumptions article, but let’s understand visually this time.

Consider the diagram below.

Say you have the line shown in Step 1, and you select data points on this line.

Recalling the data generation process from the above article, we discussed that linear regression generates data from a Gaussian distribution with equal variance.

Thus, in step 2, we create Gaussians with equal variance centered at those points.

Lastly, in step 3, we sample points from these Gaussians.

And that is how a line generates data points.

Now, in reality, we never get to see the line that produced the data. Instead, it’s the data points that we observe.

**Thus, the objective boils down to finding the line that most likely produced these points.**

We can formulate this as a maximum likelihood estimation problem by maximizing the likelihood of generating the data points given a model (or line).

Let’s look at the likelihood of generating a specific point from a Gaussian:

Similarly, we can write the likelihood of generating the entire data as the product of individual likelihoods:

Substitution the likelihood values from the Gaussian distribution, we get:

Maximizing the above expression is the same as minimizing the squared sum term in the exponent.

And this is precisely the least squares error.

Linear regression finds the line that minimizes the sum of squared distances.

This proves that finding the line that most likely produces a point using maximum likelihood is exactly the same as minimizing the least square error using linear regression.

So far, we have learned how we estimate the parameters using maximum likelihood estimation.

We also looked at the example of eggshells, and it helped us intuitively understand what MLE is all about.

Let’s recall it.

This time, let’s replace “Art piece” with an “Eggshell throwing contest.”

So yet again, we have three possibilities that may have led to the observed event.

- Someone was experimenting with them.
- Someone was baking a cake.
- Someone was playing an Eggshell throwing contest.

Which one do you think was more likely to have happened?

It’s obvious that baking a cake will still lead to eggshells with a high probability.

But an Eggshell throwing contest will create eggshells on the floor with a very high probability.

It’s certain that an eggshell-throwing contest will lead to eggshells on the floor.

**However, something tells us that baking a cake is still more likely to have happened and not the contest, isn’t it?**

One reason is that an eggs-throwing contest is not very likely – it’s an improbable event.

Thus, even though it’s more likely to have generated the evidence, it’s less likely to have happened in the first place. This means we should still declare “baking a cake” as the more likely event.

But what makes us believe that baking a cake is still more likely to have produced the evidence?

**It’s regularization.**

In this context, regularization lets us add an extra penalty, which quantifies the probability of the event itself.

Even though our contest will undoubtedly produce the evidence, it is not very likely to have happened in the first place.

This lets us still overweigh “baking a cake” and select it as the most likely event.

This is the intuition behind **regularization**.

Now, at this point, we haven’t yet understood the origin of the regularization term.

But we are not yet close to understanding its mathematical formulation.

Specifically talking about L2 regularization, there are a few questions that we need to answer:

- Where did the squared term come from?
- What does this term measure probabilistically?
- How can we be sure that the term we typically add is indeed the penalty term that makes the most probabilistic sense?

The same can be translated to L1 regularization as well.

- Where did the absolute sum of parameters come from?
- What does this term measure probabilistically?
- How can we be sure that the term we typically add is indeed the penalty term that makes the most probabilistic sense?

Let’s delve into the mathematics of regularization now.

KMeans is an unsupervised clustering algorithm that groups data based on distances. It is widely recognized for its simplicity and effectiveness as a clustering algorithm.

Essentially, the core idea is to partition a dataset into distinct clusters, with each point belonging to the cluster whose centroid is closest to it.

While its simplicity often makes it the most preferred clustering algorithm, KMeans has many limitations that hinder its effectiveness in many scenarios.

KMeans comes with its own set of limitations that can restrict its performance in certain situations.

One of the primary limitations is its assumption of spherical clusters.

One intuitive and graphical way to understand the KMeans algorithm is to place a **circle** at the center of each cluster, which encloses the points.

💡

In 3-dimensions, circles can be replaced with spheres. In higher dimensions, they can be thought of as a hyper-sphere.

As KMeans is all about placing circles, its results aren’t ideal when the dataset has irregular shapes or varying sizes, as shown below:

Instead, an ideal clustering should cluster the data as follows:

This rigidity of KMeans to only cluster globular clusters often leads to misclassification and suboptimal cluster assignments.

This limitation is somewhat connected to the one we discussed above.

Imagine we have the following dataset:

Clearly, the blue cluster has a larger spread.

Therefore, ideally, its influence should be larger as well.

However, when assigning a new data point to a cluster, KMeans only considers the distance to the centroid.

This means that it creates a margin for assigning a new data point to a cluster that is equidistant from both centroids.

But considering the area of influence of the right cluster, having this margin more to the left makes more sense.

KMeans clustering performs hard assignments.

In simple words, this means that **a specific data point can belong to only one cluster**.

Thus, it does not provide probabilistic estimates of a given data point belonging to each possible cluster.

Although this might be a problem per se, it limits its usefulness in uncertainty estimation and downstream applications that require a probabilistic interpretation.

These limitations often make KMeans a non-ideal choice for clustering.

Therefore, learning about other better algorithms that we can use to address these limitations is extremely important.

**Therefore, in this article, we will learn about Gaussian mixture models.**

More specifically, we shall cover:

- Shortcomings of KMeans (already covered above)
- What is the motivation behind GMMs?
- How do GMMs work?
- The intuition behind GMMs.
- Plotting some dummy multivariate Gaussian distributions to better understand GMMs.
- The entire mathematical formulation of GMMs.
- How to use Expectation-Maximization to model data using GMMs?
**Coding a GMM from scratch (without sklearn).**- Comparing results of GMMs with KMeans.
- How to determine the optimal number of clusters for GMMs?
- Some practical use cases of GMMs.
- Takeaways.

Let's begin!

As the name suggests, a **Gaussian mixture model** clusters a dataset that has a mixture of multiple Gaussian distributions.

They can be thought of as a generalized twin of KMeans.

**Simply put, in 2 dimensions, while KMeans only create circular clusters, gaussian mixture models can create oval-shaped clusters.**

But how do they do that?

Let's understand in more detail.

]]>Hyperparameter tuning is a tedious and time-consuming task in training machine learning (ML) models.

Typically, the most common way to determine the optimal set of hyperparameters is through repeated experimentation.

Essentially, we define a set of hyperparameter configurations (mainly through random guessing) and run them through our model. Finally, we choose the model which returns the most optimal performance metric.

This is called hyperparameter tuning.

While this approach is feasible for training small-scale ML models, it is practically infeasible when training bigger models.

To get some perspective, imagine if it takes $1.5$ hours to train a model and you have set $20$ different hyperparameter configurations to run through your model.

That's more than a day of training.

Therefore, learning about optimized hyperparameter tuning strategies and utilizing them is extremely important to build large ML models.

**Bayesian optimization** is a highly underappreciated yet immensely powerful approach for tuning hyperparameters.

It revolves around the idea of using **Bayesian statistics** to estimate the distribution of the best hyperparameters for a model.

While iterating over different hyperparameter configurations, the Bayesian optimization algorithm constantly updates its beliefs about the distribution of hyperparameters by observing model performance corresponding to each hyperparameter.

This allows it to take informed steps to select the next set of hyperparameters, gradually converging to an optimal set of hyperparameters.

Thus, in this blog, let's explore how Bayesian optimization works and why it is so powerful.

More specifically, we shall cover:

- Issues with traditional hyperparameter tuning approaches.
- What is the motivation for Bayesian optimization?
- How does Bayesian optimization work?
- The intuition behind Bayesian optimization.
- Results from the research paper which proposed Bayesian optimization for hyperparameter tuning.
- A hands-on Bayesian optimization experiment.
- Comparing Bayesian optimization with grid search and random search.
- Analyzing the results of Bayesian optimization.
- Best practices for using Bayesian optimization.

Let's begin!

Before understanding the importance of using faster tuning strategies, it is essential to understand the problems with existing techniques.

The three most common methods of hyperparameter tuning are:

**#1) Manual search**: This involves human intuition and expertise to manually adjust hyperparameters based on domain knowledge and trial-and-error.

**#2) Grid search: **Grid search explores all possible combinations of hyperparameters specified with a range.

**#3) Random search: **Random search is similar to grid search, but instead of searching all configurations specified in grid search, it tests only randomly selected configurations.

👉

This is mainly applicable to grid search.

It is practically infeasible to train a model (especially those that take plenty of time) for every possible combination of hyperparameters with grid search.

For instance, if there are N different hyperparameters $(H_{1}, H_{2}, \cdots, H_{N})$, each with $(K_{1}, K_{2}, \cdots, K_{N})$ prespecified values, total number of models become:

👉

This is applicable to both grid and random search.

In both techniques, we specify a desired range of hyperparameter values to search over.

Yet, the ideal hyperparameter may exist outside that range.

This inherently introduces some bias in hyperparameter tuning.

Also, expanding the hyperparameter search space will drastically increase the overall run time.

👉

This is applicable to grid search and random search.

Hyperparameters can be categorized into two categories based on the values they can take:

**Continuous**: learning rate, regularization rate, dropout rate, etc.**Discrete**: Batch size, kernel type in support vector machines (SVMs), etc.

Yet, for continuous hyperparameters, we can only set discrete values in grid search and random search.

Thus, it is possible that we may miss the optimal value for a continuous hyperparameter, as illustrated in the image above.

By now, we are convinced that traditional hyperparameter tuning techniques have numerous limitations.

Essentially, it's obvious to understand that the primary objective of hyperparameter tuning is to find a specific set of hyperparameters that optimize some score function.

Considering the score function $(f)$ to be an error function, we can express hyperparameter search as follows:

The objective is to minimize the error function $f$ for all possible values of hyperparameter $H$.

However, another problem with traditional methods like grid search (or random search) is that each hyperparameter trial is independent of other trials.

In other words, they do not consider the previous results to select the subsequent hyperparameters.

As a result, they are entirely uninformed by past evaluations and spend a significant amount of time evaluating “**ideally non-optimal hyperparameters.**”

For instance, consider the grid search diagram we discussed earlier.

The hyperparameter configurations with low accuracy are marked in red, and those with high accuracy are marked in green.

After the $21$ model runs ($9$ green and $12$ red) depicted above, grid search will not make any informed guesses on which hyperparameter it should evaluate next.

]]>Logistic regression is a **regression model** that returns the probability of a binary outcome ($0$ or $1$).

We all know logistic regression does this using the **sigmoid function**.

The sigmoid function maps any real-valued number $x \in \mathbb{R}$ to a value between $0$ and $1$. This ensures that output falls within a valid probability range.

While this may sound intuitive and clear, the notion of using Sigmoid raises a couple of critical questions.

An infinite number of functions can monotonically map a real input $(x)$ to the range $[0,1]$.

**But what is so special about Sigmoid?**

For instance, the figure below shows six possible alternative functions.

**Note: **While the output for the above functions is from $[-1, 1]$, we can squish it to [0,1] by scaling and shifting, as shown below:

Yet, out of infinitely many choices, **Why Sigmoid?**

The output of a logistic regression model is interpreted as a probability.

However, this seemingly straightforward notion raises an essential question: “Can we confidently treat the output of sigmoid as a genuine probability?”

See, it is important to consider that not every numerical value lying within the interval of $[0, 1]$ guarantees that it represents a legitimate probability.

In other words, just outputting a number between $[0, 1]$ isn't sufficient for us to start interpreting it as a probability.

Instead, the interpretation must stem from the formulation of logistic regression and its assumptions.

Thus, in this article, we'll answer the above two questions and understand the origin of Sigmoid in logistic regression.

Going ahead, we'll also understand the common categorization of classification approaches and takeaways.

Let's begin!

Before understanding the origin of the sigmoid function in logistic regression, let's explore some of the common explanations found on the web.

The most profound explanation is providing no explanation at all. In other words, most online resources don't care to discuss it.

In fact, while drafting this article, I did a quick Google search, and disappointingly, NONE of the top-ranked results discussed this.

The sigmoid function is thrown out of thin air along with its plot.

And its usage is justified by elaborating that Sigmoid squishes the output of linear regression between $[0,1]$, which depicts a probability.

A couple of things are wrong here:

- It conveys a misleading idea that logistic regression is just linear regression with a sigmoid. Although the definition does fit in pretty well mathematically, this is not true.
- They didn't elaborate on “why sigmoid function?” Even if we consider the flawed reasoning of squishing the output of linear regression to be correct for a second, why not squish it with any other function instead?

Another inaccurate answer was generated from the search query above.

Let's look at what's wrong with this explanation.

- The answer says, "
*It is differentiable and continuous*."**Counterpoint:**There are infinitely many functions that can be differentiable and continuous. Why Sigmoid?

- The answer says, "
*During training time, the sigmoid function solves the problem of exploding or vanishing gradients depending on your input and initialization*."**Counterpoint:**The rationale of vanishing and exploding gradient only makes sense in neural networks. This has nothing to do with logistic regression.

- The answer says: "
*Logistic regression models use the sigmoid function to estimate the probability of a binary event, such as dead vs.*.*alive, sick vs well, fraudulent vs. honest transaction, etc.*"**Our counterpoint:**How do we prove that it models a probability? There are infinitely many functions that map can potentially map the output to $[0,1]$, which can be possibly interpreted as a probability. What makes Sigmoid special?

Some answers derive the sigmoid equation from the fact that logistic regression is used to model “**log odds**.”

Let's understand “log odds” first.

In probability, “odds” is defined as the ratio of the probability of the occurrence of an event $(P(event))$ and the probability of non-occurrence $(1-P(event))$.

“Log odds,” as the name suggests, is the logarithm of odds, as shown below:

These answers convey that logistic regression is used to model log odds as a linear combination of the features.

Assuming $X = (x_{1}, x_{2}, \cdots, x_{n})$ as the features and $W = (w_{1}, w_{2}, \cdots, w_{n})$ as weights of the model, we get:

Exponentiating both sides and solving for $p$, we get:

This is precisely the sigmoid function.

While this may sound convincing, there are some major flaws here.

- How do we know that logistic regression models log odds?
- In reality, it is only when we derive the sigmoid function that we get to know about log odds. Thus, the “log odds” formulation is mathematically derived from the sigmoid representation of logistic regression, not the other way around. But this answer took the opposite approach and assumed that logistic regression is used to model log odds.

By now, you must be convinced that there are some fundamental misalignments in proving the utility and origin of the sigmoid function in the above explanations.

There are some more incorrect explanations here:

Most individuals are introduced to the Sigmoid function using one of the above explanations.

But as discussed, none of them are correct.

In my opinion, an ideal answer should derive logistic regression in terms of Sigmoid **from scratch and with valid reasoning**.

Thus, one should not know anything about it beforehand.

It is like replicating the exact thought process when a statistician first formulated the algorithm by approaching the problem logically, with only probability and statistics knowledge.

Ready?

Let's dive in!

For simplicity, consider the following binary classification data in one dimension:

Both classes have an equal number of data points ($n$). Also, let's assume that they were sampled from two different normal distributions with the same variance $\sigma=1$ and different mean ($-2$ and $3$).

Plotting the class-wise distribution (also called the **class-conditional distribution**), we get:

Being 1-dimensional data, it is easy to visually fit a vertical decision boundary that separates the two classes as much as possible.

Let's see if we can represent this classification decision mathematically.

Essentially, the objective is to predict the probability of each class given an input.

Once we do that, we can classify a point to either of two classes based on whichever one is higher.

Now, what information do we have so far:

- We know the class conditionals $p(X|y=1)$ and $p(X|y=0)$, which are Gaussians. These are the individual distributions of each class.
- We know the priors $p(y=1)$ and $p(y=0)$. We assumed that both classes have equal data points ($n$). Thus, prior is $\frac{1}{2}$ for both classes.

With this info, we can leverage a generative approach for classification.

As a result, we can use the** **Bayes rule to model the posterior $p(y=0|X)$ as follows.

💡

$p(y=0|X)$ is read as follows: The probability that the output label is 0, given an input X. This is also called posterior probability.

And $p(y=1|X)$ as follows:

Plotting both posteriors over a continuous range of $X$, we get:

Alternatively, we can merge them into a single plot.

In the above plot:

- There's a clear decision boundary.
- The light pink region indicates that the posterior for "Class A" is higher. This means that $p(y=0|X) > p(y=1|X)$. Thus, the sample is classified as "Class A."
- The blue region indicates that the posterior for "Class B" is higher. This means that $p(y=1|X) > p(y=0|X)$. Thus, the sample is classified as "Class B."
- Both posteriors always add to $1$. This is because the Bayes theorem normalizes them.

💡

On a side note, if we consider the posterior probability curve of the Blue class ("Class B") which is represented by the S shape, it does look like the sigmoid shape we are looking for. Yet, let's not assume it is a Sigmoid and instead, derive it.

Let’s derive the posterior for “Class B” by substituting the relevant values:

We distribute the numerator to the denominator:

Now, we assumed the priors to be equal:

Thus, we get:

Also, we assumed the class conditionals to be Gaussians with the same variance ($\sigma=1$) and different mean ($-2$ and $3$):

Substituting back in the posterior for “Class B,” we get:

Simplifying further, we get the following:

Assuming $z = 5x-\frac{5}{2}$, we get:

**And if you notice closely, this is precisely the sigmoid function!**

Once we have Sigmoid, we can obtain its inverse by solving for $z$:

And that’s how we get to know that logistic regression models log odds.

Now recall all the incorrect reasonings we discussed above and how misleading they were.

This derivation also answers the two questions we intended to answer:

**"Of all possible choices, why sigmoid”**: Because it comes naturally from the Bayes rule for two classes, where the target variable has a Bernoulli distribution.**"How can we be sure that the output is a probability?”**: Because we used the bayes rules to model the posterior, which is precisely used to determine the probability of an event.

The above formulation of Sigmoid also lets us translate a generative model to a discriminative model, both of which are widely used in classification problems.

If you don’t know what they are, here’s a quick refresher.

In general, classification models can be grouped into two categories: **discriminative models** and **generative models**.

Discriminative models learn the decision boundaries between the classes, while generative models learn the input distributions for each class.

In other words, discriminative models directly learn the function $f(x)$ that maps an input vector ($x$) to a label ($y$).

This may happen in two ways:

- They may try to determine $f(x)$ without using any probability distribution. They assign a label for each sample directly, without providing any probability estimates, such as decision trees and SVM.
- They may learn the
**posterior class probabilities**$P(y = k|x)$ first from the training data. Based on the probabilities, they assign a new sample $x$ to the class with the highest posterior probability, such as logistic regression and neural networks.

Whereas, generative models first estimate the **class-conditional densities **$P(x|y = k)$ and the **prior class probabilities** $P(y = k)$ from the training data.

Then, they estimate the posterior class probabilities using Bayes’ theorem:

Examples of generative models are:

- Naive Bayes
- Gaussian mixture models (GMMs)
- Hidden Markov models (HMMs)
- Linear discriminant analysis (LDA), etc.

If that is clear, let's dive back into the Sigmoid derivation.

The derivation of the posterior estimation above gave us an expression in the **linear form** of $x$ before applying Sigmoid.

This happened because we assumed two Gaussians with:

- Same priors ($p(y=1)$ and $p(y=0)$)
- Same variance $(\sigma)$

...and this allowed us to cancel the quadratic term of x that would have appeared otherwise, as shown below.

What would happen if the distributions had:

- Different priors
- Different variance

For instance, consider two dummy 2D classification datasets.

On the left, each class has the same number of samples **(i.e., same priors)** and is generated using a normal distribution of **equal variance**.

On the right, each class has a different number of samples **(i.e., different priors)** and is generated using a normal distribution of **unequal variance**.

What information do we have:

- We know the class conditionals $p(X|y=1)$ and $p(X|y=0)$, which are Gaussians.
- Although unequal, we know the priors $p(y=1)$ and $p(y=0)$.

As a result, we can still leverage a **generative approach** for classification and use** **Bayes’ rule to model the posterior $p(y=1|X)$ as follows.

Plotting the decision boundary obtained using Bayes’ rule, we get:

The decision boundary is linear if the two Gaussians have the same variance (this is called **covariance matrix **for non-1D dataset) and equal priors.

However, if they have different covariance matrices, the decision boundary is non-linear (or parabolic), which also makes sense.

It is expected because the normal distribution that generated the blue data points has a higher variance.

This means that its data points are more spread out, and thus, it has a large area of influence than the left one.

The essence is that with unequal variance and different priors, the squared term will not cancel out. The priors won't cancel out either.

Thus, the final posterior probability estimation must look like this:

Here, $m$ is the ratio of the priors, which is **always positive**. Thus, it can be represented as the exponent of some other constant $M$.

Simplifying, we get:

Now, let’s try to model these two dummy classification datasets with logistic regression using its 2D features (i.e., a discriminative approach).

Did you notice something?

In contrast to the generative approach, both decision regions are linearly separated, which is not entirely correct.

This was expected because we modeled the data without any polynomial features.

In other words, while modeling the data with logistic regression, we did this:

But due to unequal variance and different priors in this 2D dataset, we saw from the generative approach that polynomial features were expected:

Had we engineered the features precisely, it would have resulted in a similar decision boundary as obtained with the generative approach.

Nonetheless, we also had an advantage when using logistic regression.

In this case, we modeled the data without prior information about the data generation process, which was necessary for the generative approach.

**This indicates that if we model the posterior directly with a discriminative approach or understand the class conditionals first to use a generative approach, both have some pros and cons.**

Let’s understand!

In classification tasks, the choice between generative and discriminative models involves trade-offs.

In the generative framework, one crucial prerequisite is the prior knowledge of the class conditionals and priors.

For instance, when we used the generative approach above, we had prior info about them.

Here, while priors are directly evident from the data – just count the occurrence of each label, class conditionals, at times, may demand effort and understanding of the data.

**But once the generative process is perfectly understood, accurate modeling automatically follows.**

In contrast, discriminative approaches focus on modeling the posterior probabilities $(p(y=1|X)$) directly.

But the effort of estimating the class conditionals in a generative approach translates to feature engineering in the discriminative framework.

Thus, understanding the trade-offs between these approaches is essential in choosing an appropriate model for a given classification problem.

**Takeaways**

- Discriminative models learn the decision boundary between classes. Generative models understand the underlying data distribution.
- Discriminative models are often faster to train. Yet, they may not perform as well on tasks where the underlying data distribution is complex or uncertain. This requires extensive feature engineering.

**In general:**

- Generative models impose a much stronger assumption on the class conditionals (or require prior info about the distribution). But when we know the distribution, it requires less training data than discriminative models.
- When the assumption of class conditionals is not certain, discriminative models do better. This is because they don’t need to model the distribution of the features. However, this does require extensive feature engineering.

In my experience, in most courses/books/tutorials, solutions are always introduced cleverly without giving proper reasoning and justification.

The same is the case with introducing the Sigmoid function in logistic regression.

While these resources teach the “how” stuff, it is only when we get to know the “why” stuff that we can confidently apply those learnings to real-life applications.

Thus, a solid understanding of the underlying concepts is foundational to validate their practical applicability.

It also benefits us in developing new approaches by addressing the limitations of existing ones.

But that is only possible when the foundational knowledge is learned perfectly.

This is precisely what we intend to distill in these weekly deep dives, which you mind interesting to read next:

As always, thanks for reading :)

**Become a full member so that you never miss an article:**

Any questions?

Feel free to post them in the comments.

Or

If you wish to connect privately, feel free to initiate a chat here:

]]>One of the biggest hurdles data science teams face is transitioning their data-driven pipeline from Jupyter notebooks to executable, reproducible, and organized files comprising functions and classes.

This involves building an error-free, well-tested pipeline that can be executed reliably.

For instance, imagine a situation where we wrote a function in our data pipeline that divides two numbers:

While working in a Jupyter Notebook, we, as the programmer, know what should go as an input and what to expect as an output.

But transitioning that directly to a data pipeline **without testing** **and error handling** can easily lead to pipeline failures.

Therefore, to alleviate errors in the data pipeline and ensure it works as expected, testing them against different example inputs (or data situations) becomes super important.

In other words, this boils down to creating a reliable automation framework for the pipeline.

For starters, an **automation framework** is an organized approach to writing test cases, executing them, and reporting the results of testing automation.

We should always test our data pipeline because it:

- Ensures that the code works as expected end-to-end — improving reliability.
- Helps us in detecting edge cases where different methods might fail.
- Allows other team members to understand our code with the help of our tests — reducing code complexity.
- Assists in swapping an existing code with an optimized/improved version without interfering with other functions.
- Supports in generating actionable insights about the code, such as run-time, expected input/output, etc.

Thankfully, Python provides a handful of open-source libraries such as unittest, nose, behave, etc.

But the one that stands out due to its ease of use and flexibility is Pytest.

Therefore, in this article, we’ll do a detailed walkthrough on leveraging Pytest to build a testing suite for our Python project.

Let’s begin 🚀!

To begin, Pytest is a testing framework for writing test suites, executing them on a pipeline, and generating test reports.

The best thing about Pytest is its intuitive and uncomplicated process of integrating it with the data pipeline.

Testing is already a job that data scientists don’t look forward to with much interest.

Considering this, Pytest makes it easy to write test suites, which in turn, immensely helps in determining the reliability of a data science project.

To install Pytest, execute the following command:

Next, we import Pytest to leverage its core testing functionalities.

With this, we are ready to write test cases and execute them.

Let’s define a `divide_two_numbers()`

method in `project.py`

.

Next, let’s create a separate file for tests (`test_file.py`

).

For the method we defined above (`divide_two_numbers()`

), let’s write a test function. This will check the output data type returned by the method against different inputs.

Here, the second test case is kept intentionally to raise an error.

To execute this test function, open the command line in the current working directory and run the command `pytest`

:

Pytest gives a comprehensive testing summary of our pipeline.

It shows the files it collected to run tests from, `test_file.py`

in this case.

Next, all test failures are listed separately with their respective errors.

Lastly, we also get a test summary.

As expected, the method failed on the second assertion, in which the denominator was zero.

This allows us to update our function to handle such cases.

All this is fine, but now we should ask two questions:

- In the above
`pytest`

command, we never specified the name of the test file. However, in the results, we see a mention of`test_file.py`

only. How did Pytest infer that`test_file.py`

is the specific file we have listed all our test cases in? - Next, how did Pytest find the methods it needs to execute? In other words, how did Pytest differentiate the
`divide_two_numbers()`

from the`test_divide_two_numbers_method()`

and figured out that test cases are listed in the latter method.

The answer lies in the technique of “Test Searching” (also called “Test Discovery”) in Pytest.

Let’s understand.

Pytest implements the following test discovery method to find test functions, files, and Python classes:

]]>The first algorithm that every machine learning course teaches is linear regression.

It’s undoubtedly one of the most fundamental and widely used algorithms in statistics and machine learning.

This is primarily because of its simplicity and ease of interpretation.

However, despite its prevalence, there is often a surprising oversight – understanding the assumptions that ensure the effectiveness of linear regression.

In other words, linear regression is only effective when certain assumptions are validated. We must revisit the model (or data) if the assumptions are violated.

In my experience, when someone first learns linear regression, either of the two things happens:

- The teaching material does not highlight the critical assumptions that go into linear regression, which guarantees it to work effectively.
- Even if it does, in most cases, folks are never taught about the origin of those assumptions.

But it is worth questioning the underlying assumptions of using a specific learning algorithm and why those assumptions should not be violated.

In other words, have you ever wondered what is the origin of linear regression's assumptions?

They can't just appear from thin air, can they?

There should be some concrete reasoning to support them, right?

Thus, in this blog, let me walk you through the origin of each of the assumptions of linear regression in detail.

We will learn the following:

- A quick overview of Linear Regression.
- Why we use Mean Squared Error in linear regression – from a purely statistical perspective.
- The assumed data generation process of linear regression.
- The critical assumptions of Linear Regression.
- Why these assumptions are essential.
- How these assumptions are derived.
- How to validate them.
- Measures we can take if the assumptions are violated.

Let's begin!

Linear regression models a linear relationship between the features $(X)$ and the true/observed output variable $(y)$.

The estimate $\hat y $ is written as:

Where:

- $y$ is the observed/true dependent variable.
- $\hat y$ is the modeled output.
- $X$ represents the independent variables (or features).
- $\theta$ is the estimated coefficient of the model.
- $\epsilon$ is the random noise in the output variable. This accounts for the variability in the data that is not explained by the linear relationship.

The primary objective of linear regression is to estimate the values of $\theta = (\theta_{1}, \theta_{2}, \cdots, \theta_{n})$ that most closely estimate the observed dependent variable $y$.

Once we have formulated the model, it's time to train it. But how do we learn the parameters?

Everyone knows we use the **mean squared loss function**.

But why this function? Why not any other function?

Let's understand.

As discussed above, we predict our target variable $y$ using the inputs $X$ as follows (subscript $i$ is the index of the data point):

Here, $\Large \epsilon_{i}$ is an error term that captures the unmodeled noise for a specific data point ($i$).

We assume the noise is drawn from a Gaussian distribution with zero mean based on the central limit theorem (there's more on why we assume Gaussian noise soon).

Thus, the probability of observing the error term can be written as:

We call this the distribution of $y$ given $x$ parametrized by $\theta$. We are not conditioning on $\theta$ because it’s not a random variable.

Instead, for a specific set of parameters, the above tells us the probability of observing the given data point $i$.

Next, we define the likelihood as:

The likelihood is a function of $\theta$. It means that by varying $\theta$, we can fit a distribution to the observed data.

Finding the best fit is called **maximum likelihood estimation (MLE)**.

We further write it as a product for individual data points in the following form because we assume independent observations.

Thus, we get:

Since the log transformation is monotonic, we use the log-likelihood below to optimize MLE.

Simplifying, we get:

To find the best Gaussian that describes the true underlying model which generates our data, in other words, the best θ, we need to find the parameters that maximizes the log-likelihood.

Maximizing the above expression is equivalent to minimizing the second term.

**And if you notice closely, it's precisely the least squares.**

And this is the origin of least-squares in linear regression and that's why we fit a linear regression model using least-squares.

Now recall the faulty reasoning we commonly hear about minimizing least-squares.

- Some justify it by saying that squared error is differentiable, which is why we use it.
**WRONG.** - Some compare it to absolute loss and say that squared error penalizes large errors.
**WRONG.**

Each of these explanations is incorrect and flawed.

Instead, there's a clear reasoning behind why using least-squares gives us the optimal solution.

To summarize, linear regression tries to find the best model in the form of a linear predictor plus a Gaussian noise term that maximizes the probability of drawing our data from it.

And for the probability of observing the data to be maximum, least squares should be minimum:

Linear regression has many variations, as summarized below:

We have already covered them in detail below:

We'll derive them pretty soon. But for now, let's learn what they are.

Also, we'll derive them from scratch, i.e., without having any prior information that these assumptions even exist.

This is the most apparent assumption in linear regression.

If you model the dependent variable $(y)$ as a linear combination of independent variables $(X)$, you inherently assume it is true and viable.

The second assumption of linear regression is that the error terms $\epsilon$ follow a normal distribution.

When the residuals are normally distributed, it enables more robust statistical inference, hypothesis testing, and constructing more reliable confidence intervals.

Departure from normality can lead to misleading conclusions and affect the validity of statistical tests.

The third assumption is homoscedasticity, which refers to the constant variance of the error term $\epsilon$ across all levels of the independent variable(s).

In other words, the spread of the residuals should be consistent along the regression line.

If the assumption is violated and the error term's variability changes with the predicted values, it is known as **Heteroscedasticity**, which can lead to biased and inefficient estimates.

The fourth assumption of linear regression is the independence of errors.

Simply put, the absence of autocorrelation implies that each data point's error terms (residuals) are not correlated with the error terms of other data points in the dataset.

Violating this assumption leads to autocorrelation, where the errors are correlated across observations, affecting the model's accuracy and reliability.

Although not listed in the initial assumptions, Multicollinearity is another important consideration in multiple linear regression, where two or more independent variables are highly correlated.

Multicollinearity can also arise if an independent variable can be approximated as a linear combination of some other independent variables.

High Multicollinearity can cause numerical instability and make it challenging to interpret the individual effects of each independent variable.

Now that we know the assumptions of linear regression, let's understand where these assumptions come from.

**But from here, I want you to forget that these assumptions even exist. **

See, any real answer should help you design the algorithm and derive these assumptions from scratch without knowing anything about them beforehand.

It is like replicating the exact thought process when a statistician first formulated the algorithm.

To understand this point better, ask yourself a couple of questions.

At the time of formulating the algorithm:

- Were they aware of the algorithm?
**NO.** - Were they aware of the assumptions that will eventually emerge from the algorithm?
**NO.**

All they had was some observed data, which they wanted to model.

Thus, we should imitate that thought process too.

Essentially, whenever we model a dataset, we think about the process that could have generated the observed data.

This is called the data generation process.

The data generation process for linear regression involves simulating a linear relationship between the dependent variable and one or more independent variables.

This process can be broken down into the following steps:

]]>The true strength of a neural network comes from activation functions. They allow the model to learn non-linear complex relationships between inputs and outputs.

That is why they are considered a core component of all neural networks.

There are many activation functions one can use to learn non-linear patterns. These include Sigmoid, Tanh, etc.

But among all popular choices, many folks struggle to intuitively understand ReLU's power.

With its seemingly linear shape, calling it a non-linear activation function isn't intuitive.

An obvious question is: "**How does ReLU allow a neural network to capture non-linearity?**"

If you have been in that situation, let me help.

By the end of this blog, you will have an intuitive understanding of ReLU, how it works, and why it is so effective.

Let's begin!

Before understanding ReLU's effectiveness, let's understand the purpose of using activation functions.

Imagine a neural network **without any activation functions**.

The input $x$ is multiplied by a set of weights $W_{1}$ and a bias $b_{1}$ is added.

$$ Z_{1} = W_{1} \cdot x + b_{1} $$

The above output is then passed to the next layer for transformation using a new set of weights $W_{2}$ and biases $b_{2}$.

$$ Z_{2} = W_{2} \cdot Z_{1} + b_{2} $$

This process goes on and on until the output layer.

In such a scenario, the neural network would essentially be a series of linear transformations and translations stacked together:

- First, the weight matrix $W$ applies a linear transformation.
- Next, the bias term $b$ translates the obtained output.

Thus, without activation functions, the network never captures any non-linear patterns.

To understand better, consider the following 2D dataset:

Clearly, the data is linearly inseparable.

Next, we will replicate the transformation done by a neural network. For simplicity and to create a visualization, we'll do these in 2D.

Let the weight matrix and bias be:

$$ W = \begin{bmatrix} 1 & -2 \\ -2 & 2 \\ \end{bmatrix} \ ; \ b = \begin{bmatrix} 2 \\ -1 \\ \end{bmatrix} $$

The transformation is visualized below:

As shown, the entire transformation does nothing but scales, rotates and shifts the data. However, linear inseparability remains unaffected.

The same is true for a neural network without activations.

No matter how many layers we add, the output will always be a linear transformation of the input.

In fact, one can squish all the linear transformations into a single weight matrix $W$, as shown below:

$$ W_{k} \cdot (W_{k-1} (\dots (W_{2} \cdot (W_{1}\cdot x)))) \rightarrow Wx$$

The absence of an activation function severely limits the expressiveness and modeling power of the neural network. As a result, the network can only learn linear relationships.

We can also verify this experimentally:

Despite having sufficient layers, each with many neurons, the decision boundary stays linear at every epoch.

However, by adding an activation function (Tanh, in this case), the network progressively learns a non-linear decision boundary:

This explains the importance of activation functions.

To reiterate...

To model non-linear problems, there must be some non-linearity in the networks. Activation functions do precisely that!

Before understanding how ReLU adds non-linearity to a neural network, let's consider why we typically prefer ReLU over other activation functions.

In other words, let's look at some advantages of using ReLU over its contenders.

👉

Feel free to skip this part if you are well-versed with the advantages of ReLU.

ReLU involves a simple mathematical operation free from complex transformations such as exponentials, sinusoids, etc.

Therefore, it can be computed quickly and easily during a forward pass.

Moreover, gradient computation performed during the backward pass is equally simple.

Thus, computational efficiency makes ReLU a preferred choice, especially for large neural networks with many parameters.

Overfitting is a common problem in neural networks. Dropout is a regularization technique that randomly drops (or zeros-out) units from a neural network during training.

This prevents units from co-adapting excessively and encourages the network to learn more diverse and robust features.

As depicted in the above equation, by using ReLU, a neuron with a negative activation is turned off.

Therefore, ReLU, in a way, contributes to Dropout and adds a bit more regularization to the network.

There's a common problem with activation functions such as Sigmoid (or Tanh).

At times, the gradients may become very small.

For instance, for large positive and negative inputs, the gradient of the Sigmoid curve is small.

Thus, gradients may become increasingly smaller during backpropagation, leading to insignificant parameter updates.

Vanishing gradients make it difficult for the network to learn, especially the initial layers of the network.

ReLU, on the other hand, produces a constant gradient for positive inputs, thereby avoiding vanishing gradients.

This results in faster learning.

Now that we understand what makes ReLU advantageous over other activation functions, we'll see how ReLU acts as a non-linear activation function.

]]>