Marc André Ueberall - Using RAG in LLM and other gibberish

I write this article as a follow-up to my last post because I was asked on reddit how the AI part of my Obsidian setup works. I know (because I have some degrees of academic knowledge about it) that artificial intelligence is currently not really intelligent … but we are working on it. Let’s call it “an algorithm that can understand, process and distribute knowledge with sufficient accuracy”. This article is neither about generating pseudo-literature nor unitary art using an AI model, yet we use exactly this model to help us memorize and process knowledge … so that we don’t have to.

Using a LLM or any other kind of GAN (generative adversarial network) locally is not only 1000 times more flexible than using online models hosted by corporations, it is also better for your privacy because you don’t give away any information about you, your system and files.

First of all, a LLM is a large langue model. Some people think that LLM equals ChatGPT like RPG equals D&D … this misconception is quite common. ChatGPT is a LLM and so is Llama, Gemini and Mistral and we will make use of the latter three. You might ask what an animal, a constellation and a mediterranean northwestern wind can do for us. Those three are helping us to keep our privacy because you can download all of them in any form, shape, size and flavor you can imagine.

There is a whole community of people who take the base variants of said models and generate subversions trained on special use cases. For example Mistral-Nemo-Gutenberg-Doppel-12B-v2.Q4_K_M.GGUF is a special version of Mistral trained on public domain books from Project Gutenberg. Let’s assume that we use this model in the course of this post. But what exactly are those gibberish parts between the dashes?

Let’s focus on them one by one. The Mistral-Nemo-Gutenberg-Doppel just tells us, that this model is based on the open Mistral model by a French company and finetuned on two DPO datasets containing public domain books from the Gutenberg Project. The 12B in the name tells us something about the estimated size or complexity. It uses 12 billion parameters and is fairly small compared to ChatGPT4 which uses 1.8 trillion parameters. But exactly this low complexity gives us the chance to run it locally. Q4_K_M tells us something about the type, the higher the number the larger the models file size. The extension GGUF gives us a hint about how the model was generated and written to the file.

Most models are hosted on the website huggingface.co but can also be downloaded directly from LM Studio which we will talk about later.

When choosing a model, you always have to keep your hardware in mind. You should at least have a Cuda enabled nVidia GeForce like a RTX 3xxx or 4xxx model, and the more VRAM the better. The higher your VRAM is, the bigger models you can use without constantly offloading into your RAM which makes the whole process sluggish and even unusable the larger the models are and the higher the difference between your VRAM and the size of the model is. For example Mistral-Nemo-Gutenberg-Doppel-12B-v2.Q4_K_M uses 7.6GB of VRAM and you should be fine with a RTX 3xxx with 8GB. You can also use multiple graphic cards or a fancy nVidia AI card.

Ok, now you have the hardware and a basic idea how to read the names of LLM models. What now? You simply have to find a way to put the model onto your card and communicate with it. We do this in this example with LM Studio, but there are far more alternatives like ollama and others. You can grab LM Studio for free from https://lmstudio.ai/ for Linux, Windows and MacOS.

Like I mentioned earlier, LM Studio offers us to download a model right out of the app itself. After the installation and first startup, you will find four icons on the left side of the window: a speech bubble for local chats directly in LM Studio, a stylized command window to run a local server, a folder to manage our downloaded models, and a magnifying glass to search for models on huggingface. Let’s click on the magnifying glass and search for Mistral-Nemo-Gutenberg-Doppel-12B …

screenshot of the lm studio mission control — LM Studio Mission Control

On the left side, we find the search results, and on the right side a short description of the model itself. Do you see the download option like Q3_K_M there? Drop down the choices and select your desired version. You will see a little rocket on green background on those models that are fitting right inside the VRAM of your hardware. Some model variants have a thumbs-up icon next to it which tells us, that this is the recommended version with the best ratio of speed and accuracy. After you select the model and variant, click on Download and wait for it to complete.

You can now test you model in the built-in chat module by selecting the speech bubble on the left. On the top you will now see a drop-down menu telling us to load a model into VRAM, select your model and wait for it so load. On the bottom of the window, you see an input field just like ChatGPT where you are asked to enter your message … come on, don’t be shy, try it! But be warned that most, if not all of the models you will use from now on are NSFW, completely uncensored and without any moral compas. In the text field where you enter your messages is a file attachment button. Currently you can add up to 5 files with 50MB in total. After you attach a file, it gets loaded and you can query information from the file, let the AI summarize topics or do other things. This is the RAG (retrieval-augmented generation) part we will use later in Obsidian. RAG allows us to extend the knowledge of the model without the need for cost and work intensive specialized finetunings.

screenshot of lm studios server screen — Server setup

After you played around with it, eject the model on the top next to the selection field. Switch over to the server mode by clicking the command window icon on the left. Don’t be overwhelmed by all those settings within LM Studio, you can leave them at defaults in most cases. When starting a server, make sure to enable CORS and leave everything at default. Now load your desired model, wait for it to load and click on Start Server. That’s it … you are now running a local LLM server.

The RAG feature is currently an experimental feature and only available within the chat module of LM Studio but the server implementation will soon be added. Be warned that if you are going to use a text document to provide information, you cannot use the Obsidian copilot plugin at the moment. But we will have a workaround for that …

In Obsidian, open the settings window and select the settings for the Copilot plugin. First of all, we want to add a custom model by unfolding the Add Custom Model tab. Give the model a name, select lm-studio as the provider and fill in your local server adress. You can find it in LM Studio on the server page on the right side. Now add a /v1 right after the port the server uses (i.e. 1234 by default) and click Add Model. Now find your newly added model in the above list, enable CORS and make it the default. Click Save and Reload and close the window.

screenshot of obsidian adding a custom model to copilot — Adding a custom model

We are now ready to go and can use the chat window of Copilot from within Obsidian. At startup, the window gives a few useful hints about how to include the active or a special note.

We should now come to the workaround I promised until the RAG feature is available for servers in LM Studio. Imagine the following situation: I’m going to play a session of Cyberpunk in Night City and want the setup to know about the details of the setting. The server does currently not allow me to add PDFs so I switch to the local chat mode in LM Studio, attach the desired books that contain the information I want to include and start to query it.

please summarize the weather conditions in night city.

I copy the result and paste it to a note in Obsidian. You can even generate a small Wiki inside Obsidian and fill it with specific data about your setting for general use.

please summarize the gangs of night city. include information like where and how they operate, their appearance etc.

Again … copy-pasta into a note. Continue until you are satisfied with the result. It is up to you whether to use a single note or split the information in smaller chunks.

Now we are back in server mode and Obsidian. Let’s say you want to check if the neighborhood your PC is currently in is controlled by gang xyz.

what gang is controlling heywood. include information from [[<NAME_OF_YOU_NOTE>]].

If you are lucky, there is a wiki on fandom or somewhere else you can use as a source of information to fill your own little setting-wiki.

I think that might be everything you need to know to get started. Happy prompting!

Published by Marc André Ueberall