Dec 26, 2024 LLMs

Building a 100% local mini-ChatGPT

...using Llama-3.2 Vision and Chainlit.

Avi Chawla

👉

TODAY'S ISSUE

TODAY’S DAILY DOSE OF DATA SCIENCE

Building a 100% local mini-ChatGPT using Llama 3.2 Vision

We built a mini-ChatGPT that runs locally on your computer and is powered by the open-source Llama3.2-vision model.

Here's a demo before we show you how we built it:

0:00

/0:33

You can chat with it just like you would chat with ChatGPT, and provide multimodal prompts.

Here’s what we used:

Ollama for serving the open-source Llama3.2-vision model locally.
Chainlit, which is an open-source tool that lets you build production-ready conversational AI apps in minutes.

The code is available on GitHub here: Local ChatGPT.

Let's build it.

We'll assume you are familiar with multimodal prompting. If you are not, we covered in Part 5 of our RAG crash course (open-access).

We begin with the import statements and define the start_chat method, which is invoked as soon as a new chat session starts:

We use the @cl.on_chat_start decorator in the above method.

Next, we define another method which will be invoked to generate a response from the LLM:

The user inputs a prompt.
We add it to the interaction history.
We generate a response from the LLM.
We store the LLM response in the interaction history.

Finally, we define the main method:

Done!

Run the app as follows:

This launches the app shown below:

The code is available on GitHub here: Local ChatGPT.

We launched this repo recently, wherein we’ll publish the code for such hands-on AI engineering newsletter issues.

This repository will be dedicated to:

In-depth tutorials on LLMs and RAGs.
Real-world AI agent applications.
Examples to implement, adapt, and scale in your projects.

Find it here: AI Engineering Hub (and do star it).

👉 Over to you: What more functionalities would you like to see in the above app?

IN CASE YOU MISSED IT

LoRA/QLoRA—Explained From a Business Lens

Consider the size difference between BERT-large and GPT-3:

GPT-4 (not shown here) is 10x bigger than GPT-3.

I have fine-tuned BERT-large several times on a single GPU using traditional fine-tuning:

But this is impossible with GPT-3, which has 175B parameters. That's 350GB of memory just to store model weights under float16 precision.

This means that if OpenAI used traditional fine-tuning within its fine-tuning API, it would have to maintain one model copy per user:

If 10 users fine-tuned GPT-3 → they need 3500 GB to store model weights.
If 1000 users fine-tuned GPT-3 → they need 350k GB to store model weights.
If 100k users fine-tuned GPT-3 → they need 35 million GB to store model weights.

And the problems don't end there:

OpenAI bills solely based on usage. What if someone fine-tunes the model for fun or learning purposes but never uses it?
Since a request can come anytime, should they always keep the fine-tuned model loaded in memory? Wouldn't that waste resources since several models may never be used?

LoRA (+ QLoRA and other variants) neatly solved this critical business problem.

We covered this in detail here →

Published on Dec 26, 2024

Building a 100% local mini-ChatGPT

TODAY’S DAILY DOSE OF DATA SCIENCE

Building a 100% local mini-ChatGPT using Llama 3.2 Vision​

IN CASE YOU MISSED IT

​LoRA/QLoRA—Explained From a Business Lens

Building a 100% local mini-ChatGPT using Llama 3.2 Vision

LoRA/QLoRA—Explained From a Business Lens