Accelerate LLAMA & LangChain with Local GPU

8 min readDec 19, 2023

The past year has been very exciting, as ChatGPT has become widely used and a valuable tool for completing tasks more efficiently and time saver. My main usage of it so far has been for text summarisation, grammar fixes (including for this article), finding useful information, trip planning, prompt generation, and many other things.

As part of exploring its capabilities beyond optimising prompts and identifying potential use cases, I discovered LangChain, which offers an amazing world of possibilities. It enables the chaining of multiple models and tools to achieve a specific result by building context-aware, reasoning applications.

AI and human working together — AI and human are working together

The main question is, why would I want to run it locally on my computer instead of using one of the available services?

There are a couple of reasons for this. Firstly, there’s the cost. For example, Today GPT costs around $0.0010 / 1K tokens for input and $0.0020 / 1K tokens for output. If I were to use it heavily, with a load of 4k tokens for input and output, it would be around $0.012, multiplied by 1 million times (if I wanted to build an app and fill a database with chains), which would be around $12k.

The second, and more significant reason, is the flexibility of trying different models and seeing how they can be used for various applications. There are several sources to get started, including open-source LLaMa 2 models directly from Meta, and others available on Hugging Face, for example.

I decided to invest in buying a PC to learn more about how it works, and I made an initial mistake by not buying a suitable graphics card. However, I will explain how you can overcome this issue (Was able eventually run 13b model on GPU and 70B on CPU).

PC configuration

By exploring different options, I came up with a setup that should be sufficient to run all the tools and models I need (including multiple databases, Docker, IDEs, and the ability to load and train models in the future).

The basic setup consisted of an i7 CPU, 64GB of RAM, and a 2TB M2 SSD.

For the graphics card, I chose the Nvidia RTX 4070 Ti 12GB. At first glance, the setup looked promising, but I soon discovered that the 12GB of graphics memory was not enough to run larger models with more than 2.7B parameters. In fact, a minimum of 16GB is required to run a 7B model, which is a basic LLaMa 2 model provided by Meta.

+---------------+--------------+
| LLaMa 2 Model |  Memory Size |
+---------------+--------------+
| 7B            |  13 GB       |
| 13B           |  24 GB       |
| 30B           |  60 GB       |
| 65B           |  120 GB      |
+---------------+--------------+

This was a major drawback, as the next level graphics card, the RTX 4080 and 4090 with 16GB and 24GB, costs around $1.6K and $2K only for the card, which is a significant jump in price and a higher investment.

However, after exploring different options, I found a few ways to run even larger models (up to 70B) on this PC, although it’s not recommended as my PC’s cooling system was overloaded.

Setup the environment

As the operating system, I chose Ubuntu, and I focused on setting up a Python environment since most of the frameworks I explored are Python-based.

There are multiple instructions available for setting up the environment, but my favourite video for a step-by-step setup is this one. However, I will also list the steps here for convenience.

To begin with need to add essential support, curl abilities and GIT.

sudo apt update
sudo apt upgrade
sudo apt install build-essential curl git
sudo apt install git-core

Then continue of setting up Python and Conda for easy environment management for Python.

sudo apt install python3-pip
sudo apt install conda

The final step before we are jumping into frameworks for running models is to install the graphic card support from Nvidia, we will use Cuda for that.

Nvidia support for graphic card, Cuda, Video for instructions for installation
Add path, follow this instructions

Frameworks I explored

Two main frameworks I explored for running models where OpenLLM and LLaMa.cpp. While OpenLLM was more easy to spin up, I had difficulty in connecting with LangChain and I filed a bug to mitigate it. LLaMa.cpp was more flexible and support quantized to load bigger models and integration with LangChain was smooth.

OpenLLM

It is an easy way to run LLM models locally, the framework provide you an easy installation and loading and running the model on your machine. Providing RESTful API or gRPC support and Web UI as well.

I used VLLM runtime implementation, it worked on majority of the models.

Before starting it is the best to create new environment in order not destroy any other environment, we will use Conda for it.

conda create --name openllm python=3.11
conda activate openllm

To setup and run the model you need to do installations of the framework ant the runner.

conda install openllm
conda install "openllm[vllm]"

In order to run the model (This is the best I could run on my machine)

openllm start facebook/opt-2.7b --backend vllm --port 3000

Then you can just go to http://localhost:3000 to see the web UI or play around by using this code example.

import openllm

client = openllm.client.HTTPClient('http://localhost:3000')
client.query('Explain to me the difference between "further" and "farther"')

Some tips that I spend some time in order to figure it out:

In order to clean not needed models as every time you are using new model it is downloading it locally you can clea them by doing running

openllm prune

When closing the terminal the posrt is still accupied by the server if you are not stopping it, you need to run that to stop the server and release the post

kill $(lsof -t -i:3000)

It is nice to watch the load on the GPU, you can do it by running

watch nvidia-smi

I was able to connect from from LangChain code only by calling HTTP server but for invoking OpenLLM directly didn’t worked for me, I filed issue in the project, let me know if you able to figure it out.

llama.cpp

This is an amazing project with a great community built around it. I found it easy to make it work and connect it with LangChain. The instructions are very clear and straightforward, so you can easily follow them or continue reading.

Navigate to folder where you want to have the project on and clone the code from Github

git clone ggerganov/llama.cpp

Now you will need to build the code, and in order to run in with GPU support you will need to build with this specific flags, otherwise it will run on CPU and will be really slow! (I was able to run the 70B only on the CPU, but it was very slow!!! The output was 1 letter per second)

cd llama.cpp
make clean && LLAMA_CUBLAS=1 make -j

Before you will be able to run the model you will need to convert the model you have to GGML format. I used the basic LLaMa models.

Copy the model to the models folder, include the tokenizer.model and the params.json files.
Navigate to llama.cpp root of the project (I was not able to run 7b as is as I have not enough GPU memory, I was able only after I had quantized it)

python3 convert.py models/llama-2-7b/

Now for the final stage run this to run the model (Keep in mind you can play around --n-gpu-layers and -n in order to see what is working the best for you)

./main --model ../models/llama-2-7b/ggml-model-f16.gguf -n 128 --interactive -ins --n-gpu-layers 15000

2.7B model was the biggest I could run on the GPU (Not the Meta one as the 7B need more then 13GB memory on the graphic card), but you can actually use Quantization technic to make the model smaller, just to compare the sizes before and after (After quantization 13B was running smooth).

+-------+-----------+-------+------+------+------+------+------+
| Model |  Measure  |  F16  | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 |
+-------+-----------+-------+------+------+------+------+------+
| 7B    | file size | 13.0G | 3.5G | 3.9G | 4.3G | 4.7G | 6.7G |
| 13B   | file size | 25.0G | 6.8G | 7.6G | 8.3G | 9.1G | 13G  |
+-------+-----------+-------+------+------+------+------+------+

In order to quantize the model you will need to execute quantize script, but before you will need to install couple of more things.

To setup environment we will use Conda.

conda create --name llama-cpp python=3.11
conda activate llama-cpp

Run from the llama.cpp root folder

python3 -m pip install -r requirements.txt

Execute the quantize command

./quantize ../models/llama-2-13b/ggml-model-f16.gguf ../models/llama-2-13b/ggml-model-q4_0.gguf q4_0

And now you are ready to run the model of 7B as need much less memory, only 6.8GB, enjoy!

./main --model ../models/llama-2-13b/ggml-model-q4_0.gguf -n 128 --interactive -ins --n-gpu-layers 15000

LangChain

As described in their website, Build context-aware, reasoning applications with LangChain’s flexible abstractions and AI-first toolkit.

Long story short, You can use LangChain to build chatbots or personal assistants, to summarise, analyse, or generate Q&A over documents or structured data, to write or understand code, to interact with APIs, and to create other applications that take advantage of generative AI.

Now when you have all ready to run it all you can complete the setup and play around with it using local environment (For full instraction check the documentation).

We will start from stepping new environment using Conda.

conda create --name langchain python=3.11
conda activate langchain

Install the package to support GPU.

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

write this Python code and enjoy learninig and playing around

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate


def run_myllm():
    template = """Question: {question}

    Answer: Let's work this out in a step by step way to be sure we have the right answer."""

    prompt = PromptTemplate(template=template, input_variables=["question"])

    # Callbacks support token-wise streaming
    callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

    n_gpu_layers = 15000  # Change this value based on your model and your GPU VRAM pool.
    n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

    # Make sure the model path is correct for your system!
    llm = LlamaCpp(
        model_path="[Path to your folder]/models/llama-2-13b/ggml-model-q4_0.gguf",
        temperature=0.75,
        max_tokens=2000,
        n_gpu_layers=n_gpu_layers,
        n_batch=n_batch,
        callback_manager=callback_manager,
        verbose=True,  # Verbose is required to pass to the callback manager
    )

    llm_chain = LLMChain(prompt=prompt, llm=llm)
    question = "Question: A rap battle between Stephen Colbert and John Oliver"

    llm_chain.run(question)

if __name__ == '__main__':
    run_myllm()

To summarise

In this article, I demonstrated how to run LLAMA and LangChain accelerated by GPU on a local machine, without relying on any cloud services. We’ve discussed the reasons why running locally is beneficial, and how to overcome the issue of insufficient GPU memory.

Step-by-step instructions for setting up the environment where provided, installing the necessary packages, and running the models. With these tools, you can unlock the full potential of LLAMA and LangChain and create your own AI applications. So, don’t wait any longer, and start experimenting with LLAMA and LangChain on your own machine today!

Let me know if you find any mistakes or have any suggestions for improvements.