When I first considered adding a large language model (LLM) to a Docker image, I encountered a significant challenge: LLMs are enormous, and including them in a Docker container can drastically increase the container’s size. Moreover, running such a Docker image requires considerable system resources.
After extensive research, I discovered a few key strategies to address these issues:
- Quantized Models: It’s not necessary to load the full model files. Many models are available in GGUF or GGML formats, which are quantized versions of LLM files. Quantization is a technique used to reduce the size of models by lowering precision. This is achieved by converting the weights and biases from floating-point numbers to integers, significantly reducing the memory required. To delve deeper into this technique, I referred to a research paper titled Post-training 4-bit Quantization for Large Language Models, and an insightful blog post by Maarten Grootendorst on quantization.

Credit: Visual guide of Quantization
Here’s a brief overview of what I learned:
- Quantization reduces model size by mapping continuous weights to discrete integers, like INT8, which, while lowering precision, retains most of the model’s accuracy.

Credit: Visual guide of Quantization
For instance, in Symmetric Quantization, if we have weights ranging from -10.8 to 10.8, the zero point remains unchanged. To convert this to an 8-bit format, we calculate the scaling factor:
1- First Step
s = Scaling Factor
b= Number of bits= For INT8 = 8
α = Alpha = Max Weights Value = 10.8
s=(128 – 1)/10.8
= 127/10.8 = 11.76
2- Second Step
x = s.x’
x’= 5.4
s= 11.76
x= 5.4*11.76 = 64 (As shown in Image above).
- While this approach reduces complexity, it introduces challenges like quantization error and accuracy loss, which may require advanced calibration techniques to mitigate. The key takeaway is that Quantization Errors are inversely proportional to the number of bits used.
- Understanding this technique is crucial as it enables the reduction of LLM sizes, making them more accessible.
Ollama:
To simplify the process of running these quantized LLMs on a local machine, I found
Ollama—an open-source tool that streamlines the deployment and operation of LLMs locally. Ollama is actively maintained, lightweight, and easily extensible, allowing developers to build and manage LLMs on their machines without the need for complex configurations or reliance on external servers.
Running Ollama in Docker
To run Ollama in a Docker container, follow these steps:
Prerequisites: Ensure that Docker is installed on your system. You can find the full documentation here.
- Download the Docker Image:
- docker run -d -v ollama:/root/.ollama -p 11434:11434 –name ollama ollama/ollama
- Run the Model:
- docker exec -it ollama ollama run llama3
And that’s it! Your LLM Docker image is now up and running.
Download and Run Chat GPT Like Web UI on Ollama

1- docker run -d -p 3000:8080 -v ollama:/root/.ollama -v open-webui:/app/backend/data –name open-webui –restart always ghcr.io/open-webui/open-webui:ollama