[Hands-on] RAG over audio files

TODAY’S DAILY DOSE OF DATA SCIENCE

[Hands-on] RAG over audio files using AssemblyAI and DeepSeek-R1

While most RAG systems are built over text, plenty of data exists as speech (audio) and we need reliable ways to do RAG over them.

So today, let’s build a RAG app over audio files with DeepSeek-R1.

Here’s an overview of our app:

Step 1) Takes an audio file and transcribes it using AssemblyAI.
Steps 2-3) Stores it in a Qdrant vector database.
Steps 4-6) Queries the database to get context.
Steps 7-8) Uses DeepSeek-R1 as the LLM to generate a response.

AssemblyAI has always been my go-to for building speech-driven AI applications.

Transcribe with AssemblyAI

It’s an AI transcription platform that provides state-of-the-art AI models for any task related to speech & audio understanding.

Now let's jump into code!

The GitHub repo is linked towards the end of this issue.

Let's build

Implementation

To transcribe audio files, get an API key from AssemblyAI and store it in the `.env` file. Get the API key here →

Next, we use AssemblyAI to transcribe audio with speaker labels. To do this:

We set up the transcriber object.
We enable speaker label detection in the config.
We transcribe the audio using AssemblyAI.

Moving on, we embed transcripts and store them in a vector database. To do this, we:

Load the embedding model and generate embeddings.
Connect to Qdrant and create a collection.
Store the embeddings.

Now comes retrieval, where we query the vector database to retrieve sentences in the transcripts that are similar to the query:

Convert the query into an embedding.
Search the vector database.
Retrieve the top results.

Finally, after retrieving the context:

We construct a prompt.
We use DeepSeek-R1 through Ollama to generate a response.

To make this accessible, we wrap the entire app in a Streamlit interface. It’s a simple UI where you can upload and chat with the audio file directly:

0:00

/0:37

That was simple, wasn’t it?

The code is available here: RAG over audio files.

A departing note

Why AssemblyAI?

We first used AssemblyAI over two years ago, and in my experience, it has the most developer-friendly and intuitive SDKs to integrate speech AI into applications.

Transcribe with AssemblyAI

AssemblyAI first trained Universal-1 on 12.5 million hours of audio, outperforming every other model in the industry (from Google, OpenAI, etc.) across 15+ languages.

Recently, they released Universal-2, their most advanced speech-to-text model yet.

Here’s how Universal-2 compares with Universal-1:

24% improvement in proper nouns recognition
21% improvement in alphanumeric accuracy
15% better text formatting

Its performance compared to other popular models in the industry is shown below:

Isn’t that impressive?

We love AssemblyAI’s mission of supporting developers in building next-gen voice applications in the simplest and most effective way possible.

They have already made a big dent in speech technology, and we're eager to see how they continue from here.

Get started with:

Their API docs are available here if you want to explore their services: AssemblyAI API docs.

🙌 A big thanks to AssemblyAI, who very kindly partnered with us on this demo and allowed us to showcase their industry-leading AI transcription services.

👉 Over to you: What would you use AssemblyAI for?

Thanks for reading!

No-Fluff Industry ML resources to

Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Develop Industry ML skills

Here are some of them:

Learn sophisticated graph architectures and how to train them on graph data in this crash course.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
Run large models on small devices using Quantization techniques.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
Learn how to scale and implement ML model training in this practical guide.
Learn 5 techniques with implementation to reliably test ML models in production.
Learn how to build and implement privacy-first ML systems using Federated Learning.
Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Advertise to 600k+ data professionals

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →

The Full MCP Blueprint: Testing, Security, and Sandboxing in MCPs (Part B)

The Full MCP Blueprint: Testing, Security and Sandboxing in MCPs (Part A)

The Full MCP Blueprint: Integrating Sampling into MCP Workflows

[Hands-on] RAG over audio files

TODAY’S DAILY DOSE OF DATA SCIENCE

[Hands-on] RAG over audio files using AssemblyAI and DeepSeek-R1

Let's build

​Implementation​

A departing note

​Why AssemblyAI?

No-Fluff Industry ML resources to

Succeed in DS/ML roles

SPONSOR US

Advertise to 600k+ data professionals

Read next

The Full MCP Blueprint: Testing, Security, and Sandboxing in MCPs (Part B)

The Full MCP Blueprint: Testing, Security and Sandboxing in MCPs (Part A)

The Full MCP Blueprint: Integrating Sampling into MCP Workflows

Join the Daily Dose of Data Science Today!

Implementation

Why AssemblyAI?