Sparse Random Projections

An alternative to PCA for highly dimensional datasets.

👉
Hey! Enjoy our free data science newsletter! Subscribe below and receive a free data science PDF (530+ pages) with 150+ core data science and machine learning lessons.

TODAY'S ISSUE

TODAY’S DAILY DOSE OF DATA SCIENCE

Sparse Random Projections

This is the time complexity of PCA:

The cubic relationship with the number of dimensions makes it impractical on high-dimensional datasets, say, 1000d, 2000d, and more.

I find it quite amusing about PCA.

Think about it: You want to reduce the dimension, but the technique itself isn’t feasible when you have lots of dimensions.

Anyway…

Sparse random projection is a pretty fine solution to handle this problem.

Let me walk you through the intuition before formally defining it.


Consider the following 2000-dimensional dataset:

Let’s find the distance between two random points here:

Now imagine multiplying the above high-dimensional matrix X with a random projection matrix M of shape (m, d), which results in the projected matrix:

In this case, I projected X to 1000 dimensions from 2000 dimensions.

Now, if we find the distance between the same two points again, we get this:

The two distance values are quite similar, aren’t they?

This hints that if we have a pretty high-dimensional dataset, we can transform it to a lower dimension while almost preserving the distance between any two points.

If you read the deep dive on the curse of dimensionality, we covered the mathematical details behind why distance remains almost preserved.

But let’s not stop there since we also need to check if the projected data is any useful.

Recall that we created a blobs dataset.

So let’s utilize KMeans on both the high-dimensional original data and the projected data, and measure the clustering quality using silhouette score.

  • Clustering X is implemented below:
  • Clustering X_projected is implemented below (for random projection, we make use of SparseRandomProjection from sklearn):

Look at the silhouette scores:

  • On X, we get 0.8267 (the higher, the better).
  • On X_projcected, we get 0.8233 (the higher, the better).

These two scores are highly similar, which indicates that both datasets produce a similar quality of clustering.

Formally speaking, random projections allow us to project high-dimensional datasets into relatively lower-dimensional spaces while almost preserving the distance between the data points.

This has a mathematical intuition attached to it as well, which we covered here: A mathematical deep dive on the curse of dimensionality.

I mentioned “relatively” above since this won’t work if the two spaces significantly differ in their dimensionality.

I ran the above experiment on several dimensions, and the results are shown below:

Original:
Silhouette = 0.8267

Projected:
Components = 2000, Silhouette = 0.8244
Components = 1500, Silhouette = 0.8221
Components = 1000, Silhouette = 0.8233
Components = 750,  Silhouette = 0.8232
Components = 500,  Silhouette = 0.8158
Components = 300,  Silhouette = 0.8212
Components = 200,  Silhouette = 0.8207
Components = 100,  Silhouette = 0.8102
Components = 50,   Silhouette = 0.8041
Components = 20,   Silhouette = 0.7783
Components = 10,   Silhouette = 0.6491
Components = 5,    Silhouette = 0.6665
Components = 2,    Silhouette = 0.5108

From the above results, it is clear that as the dimension we project the data to decreases, the clustering quality drops, which makes intuitive sense as well.

Thus, the dimension you should project to becomes a hyperparameter you can tune based on your specific use case and dataset.

In my experience, random projections produce highly unstable results when the data you begin with is low-dimensional (<100 dimensions).

Thus, I only recommend it when the number of features is huge, say 700-800+, and PCA’s run-time is causing problems.

That said, random projection is also used in an LLM fine-tuning technique called VeRA:

Recall that in LoRA, every layer has a different pair of low-rank matrices A and B, and both matrices are trained.

In VeRA, however, matrices A and B are frozen, random, and shared across all model layers. VeRA focuses on learning small, layer-specific scaling vectors, denoted as b and d, which are the only trainable parameters in this setup.

We covered 5 LLM fine-tuning techniques here:

5 LLM Fine-tuning Techniques
...explained visually

Also, we formulated PCA from scratch (with all underlying mathematical details) here:

Formulating the Principal Component Analysis (PCA) Algorithm From Scratch
Approaching PCA as an optimization problem.

👉 Over to you: What are some ways to handle high-dimensional datasets when you want to use PCA?

IN CASE YOU MISSED IT

​​Traditional RAG vs. HyDE​

One critical problem with the traditional â€‹â€‹â€‹â€‹RAG system​​​​ is that questions are not semantically similar to their answers.

As a result, several irrelevant contexts get retrieved during the retrieval step due to a higher cosine similarity than the documents actually containing the answer.

HyDE solves this.

The following visual depicts how it differs from traditional RAG and HyDE.

​We covered this in detail in a newsletter issue published last week →​

IN CASE YOU MISSED IT

​Build a Reasoning Model Like DeepSeek-R1​

If you have used DeepSeek-R1 (or any other reasoning model), you must have seen that they autonomously allocate thinking time before producing a response.

Last week, we shared how to embed reasoning capabilities into any LLM.

We trained our own reasoning model like DeepSeek-R1 (with code).

To do this, we used:

  • UnslothAI for efficient fine-tuning.
  • Llama 3.1-8B as the LLM to add reasoning capabilities to.

​Find the implementation and detailed walkthrough newsletter here →

THAT'S A WRAP

No-Fluff Industry ML resources to

Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!

  • Can you reduce costs?
  • Drive revenue?
  • Can you scale ML models?
  • Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Here are some of them:

  • Learn sophisticated graph architectures and how to train them on graph data in this crash course.
  • So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
  • Run large models on small devices using Quantization techniques.
  • Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
  • Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
  • Learn how to scale and implement ML model training in this practical guide.
  • Learn 5 techniques with implementation to reliably test ML models in production.
  • Learn how to build and implement privacy-first ML systems using Federated Learning.
  • Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →


Join the Daily Dose of Data Science Today!

A daily column with insights, observations, tutorials, and best practices on data science.

Get Started!
Join the Daily Dose of Data Science Today!

Great! You’ve successfully signed up. Please check your email.

Welcome back! You've successfully signed in.

You've successfully subscribed to Daily Dose of Data Science.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.