[Hands-on] Multimodal RAG using DeepSeek's Janus

TODAY’S DAILY DOSE OF DATA SCIENCE

MultiModal RAG with DeepSeek Janus

After DeepSeek-R1, DeepSeek dropped more open-weight multimodal models—Janus, Janus-Pro, and Janus-Flow.

They can understand images and generate images from text input.

Moreover, they beat OpenAI's DALL-E 3 and Stable Diffusion in GenEval and DPG-Bench benchmarks.

Today, let’s do a hands-on demo of building a multimodal RAG with Janus-Pro on a complex document shown below:

It has several complex diagrams, text within visualizations, and tables—perfect for multimodal RAG.

We’ll use:

Colpali to understand and embed docs using vision capabilities.
Qdrant as the vector database.
DeepSeek’s latest Janus-Pro multimodal LLM to generate a response.

This image shows the final outcome of today's issue (the code is available towards the end):

Let's build it!

1) Embed data

We extract each document page as an image and embed it using ColPali.

We did a full architectural breakdown of ColPali in Part 9 of the RAG crash course and also optimized it with binary quantization.

ColPali uses vision capabilities to understand the context. It produces patches for every page, and each patch gets an embedding vector.

This is implemented below:

2) Vector database

Embeddings are ready. Next, we create a Qdrant vector database and store these embeddings in it, as demonstrated below:

3) Download DeepSeek Janus

Next, we set up our DeepSeek's latest Janus-Pro by downloading it from HuggingFace.

4) Query vector database and generate a response

Next, we:

Query the vector database to get the most relevant pages.
Pass the pages (as images) along with the query to DeepSeek Janus-Pro to generate the response.

Done!

We have implemented a 100% local Multimodal RAG powered by DeepSeek's latest Janus-Pro.

There’s some streamlit part we have shown here, but after building it, we get this clear and neat interface.

In this example, it produces the right response by retrieving the correct page and understanding a complex visualization👇

Here's one more example with a correct response:

Wasn’t that easy and straightforward?

The code for today's demo is available here: Multimodal RAG with DeepSeek.

👉 Over to you: What other demos would you like to see with DeepSeek?

Thanks for reading!

Faster hyperparmater search

Beyond grid and random search

There are many issues with Grid search and random search.

They are computationally expensive due to exhaustive search.
The search is restricted to the specified hyperparameter range. But what if the ideal hyperparameter exists outside that range?
They can ONLY perform discrete searches, even if the hyperparameter is continuous.

Bayesian optimization solves this.

It’s fast, informed, and performant, as depicted below:

Learning about optimized hyperparameter tuning and utilizing it will be extremely helpful to you if you wish to build large ML models quickly.

Learn Bayesian Optimization from scratch here →

PRIVACY-PRESERVING ML

Train models on private data with federated learning

There’s so much data on your mobile phone right now — images, text messages, etc.

And this is just about one user — you.

But applications can have millions of users. The amount of data we can train ML models on is unfathomable.

The problem?

This data is private.

So consolidating this data into a single place to train a model.

The solution?

Federated learning is a smart way to address this challenge.

The core idea is to ship models to devices, train the model on the device, and retrieve the updates:

But this isn't as simple as it sounds.

1) Since the model is trained on the client side, how to reduce its size?

2) How do we aggregate different models received from the client side?

3) [IMPORTANT] Privacy-sensitive datasets are always biased with personal likings and beliefs. For instance, in an image-related task:

Some devices may only have pet images.
Some devices may only have car images.
Some people may love to travel, and may primarily have travel-related images.
How to handle such skewness in data distribution?

Learn how to implement federated learning systems (beginner-friendly) →

No-Fluff Industry ML resources to

Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Develop Industry ML skills

Here are some of them:

Learn sophisticated graph architectures and how to train them on graph data in this crash course.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
Run large models on small devices using Quantization techniques.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
Learn how to scale and implement ML model training in this practical guide.
Learn 5 techniques with implementation to reliably test ML models in production.
Learn how to build and implement privacy-first ML systems using Federated Learning.
Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Advertise to 600k+ data professionals

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →

The Full MCP Blueprint: Testing, Security, and Sandboxing in MCPs (Part B)

The Full MCP Blueprint: Testing, Security and Sandboxing in MCPs (Part A)

The Full MCP Blueprint: Integrating Sampling into MCP Workflows