Model Compression: A Step Towards Efficient Machine Learning
Model accuracy alone (or an equivalent performance metric) rarely determines which model will be deployed.
Much of the engineering effort goes into making the model production-friendly.
Because typically, the model that gets shipped is NEVER solely determined by performance — a misconception that many have.
Instead, we also consider several operational and feasibility metrics, such as:
- Inference Latency: Time taken by the model to return a prediction.
- Model size: The memory occupied by the model.
- Ease of scalability, etc.
For instance, consider the image below. It compares the accuracy and size of a large neural network we developed in the article with its pruned (or reduced/compressed) versions:
Looking at these results, don’t you strongly prefer deploying the model that is 72% smaller, but is still (almost) as accurate as the large model?
Of course, this depends on the task but in most cases, it might not make any sense to deploy the large model when one of its largely pruned versions performs equally well.
We discussed and implemented 4 model compression techniques in the article below, which ML teams regularly use to save 1000s of dollars in running ML models in production.