Skip to main content
LLMs

Building a 100% local mini-ChatGPT

...using Llama-3.2 Vision and Chainlit.

Avi Chawla
Avi Chawla
πŸ‘‰

TODAY'S ISSUE

TODAY’S DAILY DOSE OF DATA SCIENCE

Building a 100% local mini-ChatGPT using Llama 3.2 Vision​

We built a mini-ChatGPT that runs locally on your computer and is powered by the open-source Llama3.2-vision model.

Here's a demo before we show you how we built it:

0:00
/0:33

You can chat with it just like you would chat with ChatGPT, and provide multimodal prompts.

Here’s what we used:

  • Ollama for serving the open-source Llama3.2-vision model locally.
  • ​Chainlit​, which is an open-source tool that lets you build production-ready conversational AI apps in minutes.

The code is available on GitHub here: ​Local ChatGPT​.

Let's build it.

We'll assume you are familiar with multimodal prompting. If you are not, we covered in ​Part 5 of our RAG crash course​ (open-access).


We begin with the import statements and define the start_chat method, which is invoked as soon as a new chat session starts:

We use the @cl.on_chat_start decorator in the above method.

Next, we define another method which will be invoked to generate a response from the LLM:

  • The user inputs a prompt.
  • We add it to the interaction history.
  • We generate a response from the LLM.
  • We store the LLM response in the interaction history.

Finally, we define the main method:

Done!

Run the app as follows:

This launches the app shown below:

The code is available on GitHub here: ​Local ChatGPT​.

We launched this repo recently, wherein we’ll publish the code for such hands-on AI engineering newsletter issues.

This repository will be dedicated to:

  • In-depth tutorials on LLMs and RAGs.
  • Real-world AI agent applications.
  • Examples to implement, adapt, and scale in your projects.

Find it here: ​AI Engineering Hub​ (and do star it).

πŸ‘‰ Over to you: What more functionalities would you like to see in the above app?

IN CASE YOU MISSED IT

​LoRA/QLoRAβ€”Explained From a Business Lens

Consider the size difference between BERT-large and GPT-3:

GPT-4 (not shown here) is 10x bigger than GPT-3.

I have fine-tuned BERT-large several times on a single GPU using traditional fine-tuning:

But this is impossible with GPT-3, which has 175B parameters. That's 350GB of memory just to store model weights under float16 precision.

This means that if OpenAI used traditional fine-tuning within its fine-tuning API, it would have to maintain one model copy per user:

  • If 10 users fine-tuned GPT-3 β†’ they need 3500 GB to store model weights.
  • If 1000 users fine-tuned GPT-3 β†’ they need 350k GB to store model weights.
  • If 100k users fine-tuned GPT-3 β†’ they need 35 million GB to store model weights.

And the problems don't end there:

  • OpenAI bills solely based on usage. What if someone fine-tunes the model for fun or learning purposes but never uses it?
  • Since a request can come anytime, should they always keep the fine-tuned model loaded in memory? Wouldn't that waste resources since several models may never be used?

​​LoRA​​ (+ ​​QLoRA and other variants​​) neatly solved this critical business problem.

​We covered this in detail here β†’

Published on Dec 26, 2024