TODAY'S ISSUE
TODAY’S DAILY DOSE OF DATA SCIENCE
Mixed Precision Training
Typical deep learning libraries are really conservative when it comes to assigning data types.
The data type assigned by default is usually 64-bit or 32-bit, when there is also scope for 16-bit, for instance. This is also evident from the code below:
As a result, we are not entirely optimal at efficiently allocating memory.
Of course, this is done to ensure better precision in representing information.
However, this precision always comes at the cost of additional memory utilization, which may not be desired in all situations.
In fact, it is also observed that many tensor operations, especially matrix multiplication, are much faster when we operate under smaller precision data types than larger ones, as demonstrated below:
Moreover, since float16
is only half the size of float32
, its usage reduces the memory required to train the network.
This also allows us to train larger models, train on larger mini-batches (resulting in even more speedup), etc.
Mixed precision training is a pretty reliable and widely adopted technique in the industry to achieve this.
As the name suggests, the idea is to employ lower precision float16
(wherever feasible, like in convolutions and matrix multiplications) along with float32
— that is why the name “mixed precision.”
This is a list of some models I found that were trained using mixed precision:
It’s pretty clear that mixed precision training is much more popularly used, but we don’t get to hear about it often.
Before we get into the technical details…
From the above discussion, it must be clear that as we use a low-precision data type (float16
), we might unknowingly introduce some numerical inconsistencies and inaccuracies.
To avoid them, there are some best practices for mixed precision training that I want to talk about next, along with the code.
Mixed precision training in PyTorch and Best Practices
Leveraging mixed precision training in PyTorch requires a few modifications in the existing network training implementation.
Consider this is our current PyTorch model training implementation:
The first thing we introduce here is a scaler
object that will scale the loss value:
We do this because, at times, the original loss value can be so low, that we might not be able to compute gradients in float16
with full precision.
Such situations may not produce any update to the model’s weights.
Scaling the loss to a higher numerical range ensures that even small gradients can contribute to the weight updates.
But these minute gradients can only be accommodated into the weight matrix when the weight matrix itself is represented in high precision, i.e., float32
.
Thus, as a conservative measure, we tend to keep the weights in float32
.
That said, the loss scaling step is not entirely necessary because, in my experience, these little updates typically appear towards the end stages of the model training.
Thus, it can be fair to assume that small updates may not drastically impact the model performance.
But don’t take this as a definite conclusion, so it’s something that I want you to validate when you use mixed precision training.
Moving on, as the weights (which are matrices) are represented in float32
, we can not expect the speedup from representing them in float16
, if they remain this way:
To leverage these flaot16
-based speedups, here are the steps we follow:
- We make a
float16
copy of weights during the forward pass. - Next, we compute the loss value in
float32
and scale it to have more precision in gradients, which works infloat16
.- The reason we compute gradients in float16 is because, like forward pass, gradient computations also involve matrix multiplications.
- Thus, keeping them in
float16
can provide additional speedup.
- Once we have computed the gradients in
float16
, the heavy matrix multiplication operations have been completed. Now, all we need to do is update the original weight matrix, which is infloat32
. - Thus, we make a
float32
copy of the above gradients, remove the scale we applied in Step 2, and update thefloat32
weights. - Done!
The mixed-precision settings in the forward pass are carried out by the torch.autocast()
context manager:
Now, it’s time to handle the backward pass.
- Line 13 →
scaler.scale(loss).backward()
: Thescaler
object scales the loss value andbackward()
is called to compute the gradients. - Line 14 →
scaler.step(opt)
: Unscale gradients and update weights. - Line 15 →
scaler.update()
: Update the scale for the next iteration. - Line 16 →
opt.zero_grad()
: Zero gradients.
Done!
The efficacy of mixed precision scaling over traditional training is evident from the image below:
Mixed precision training is over 2.5x faster than conventional training.
Isn’t that impressive?
👉 Over to you: What are some other reliable ways to speed up machine learning model training?
EXTENDED PIECE #1
Beyond grid and random search
There are many issues with Grid search and random search.
- They are computationally expensive due to exhaustive search.
- The search is restricted to the specified hyperparameter range. But what if the ideal hyperparameter exists outside that range?
- They can ONLY perform discrete searches, even if the hyperparameter is continuous.
Bayesian optimization solves this.
It’s fast, informed, and performant, as depicted below:
Learning about optimized hyperparameter tuning and utilizing it will be extremely helpful to you if you wish to build large ML models quickly.
EXTENDED PIECE #2
Beyond linear regression
Linear regression makes some strict assumptions about the type of data it can model, as depicted below.
Can you be sure that these assumptions will never break?
Nothing stops real-world datasets from violating these assumptions.
That is why being aware of linear regression’s extensions is immensely important.
Generalized linear models (GLMs) precisely do that.
They relax the assumptions of linear regression to make linear models more adaptable to real-world datasets.
THAT'S A WRAP
No-Fluff Industry ML resources to
Succeed in DS/ML roles
At the end of the day, all businesses care about impact. That’s it!
- Can you reduce costs?
- Drive revenue?
- Can you scale ML models?
- Predict trends before they happen?
We have discussed several other topics (with implementations) in the past that align with such topics.
Here are some of them:
- Learn sophisticated graph architectures and how to train them on graph data in this crash course.
- So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
- Run large models on small devices using Quantization techniques.
- Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
- Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
- Learn how to scale and implement ML model training in this practical guide.
- Learn 5 techniques with implementation to reliably test ML models in production.
- Learn how to build and implement privacy-first ML systems using Federated Learning.
- Learn 6 techniques with implementation to compress ML models.
All these resources will help you cultivate key skills that businesses and companies care about the most.
SPONSOR US
Advertise to 450k+ data professionals
Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.