LLM Everywhere Docker for Local and Hugging Face Hosting
December 8, 2025 · 1641 words · 8 min
has become a powerhouse in the field of machine learning (ML). Their large collection of pretrained
has become a powerhouse in the field of machine learning (ML). Their large collection of pretrained models and user-friendly interfaces have entirely changed how we approach AI/ML deployment and spaces. If you’re interested in looking deeper into the integration of Docker and Hugging Face models, a comprehensive guide can be found in the article “ .” The Large Language Model (LLM) — a marvel of language generation — is an astounding invention. In this article, we’ll look at how to use the Hugging Face hosted Llama model in a Docker context, opening up new opportunities for natural language processing (NLP) enthusiasts and researchers. Hugging Face (HF) provides a comprehensive platform for training, fine-tuning, and deploying . And, LLMs provide a state-of-the-art model capable of performing tasks like text generation, completion, and classification. The robust containerization technology makes it easier to package, distribute, and operate programs. It guarantees that ML models operate consistently across various contexts by enclosing them within Docker containers. Reproducibility is ensured, and the age-old “it works on my machine” issue is resolved. For the majority of models on Hugging Face, two options are available. Examples of quantization techniques used in AI model quantization include the GGML and GPTQ models. This can mean quantization either during or after training. By reducing model weights to a lower precision, the GGML and GPTQ models — two well-known quantized models — minimize model size and computational needs. HF models load on the GPU, which performs inference significantly more quickly than the CPU. Generally, the model is huge, and you also need a lot of VRAM. In this article, we will utilize the GGML model, which operates well on CPU and is probably faster if you don’t have a good GPU. We will also be using transformers and ctransformers in this demonstration, so let’s first understand those: We will utilize the , signup, and request for access. To create an Access token that will be used in the future, go to your Hugging Face profile settings and select from the left-hand sidebar (Figure 1). Save the value of the created Access Token. Before exploring the realm of the LLM, we must first configure our Docker environment. Install Docker first, following the instructions on the official based on your operating system. After installation, execute the following command to confirm your setup: The following command runs a container with the Hugging Face image and exposes port from the container to the host machine. It will also set the environment variable to the value you provided. The script is the Python script that you want to run in the container. This will start the container and open a terminal to it. You can then interact with the container and its processes in the terminal. To exit the container, press . To access the container’s web server, open a web browser and navigate to . You should see the landing page for your Hugging Face model (Figure 2). Open your browser and go to To get started, you can clone or download the Hugging Face existing . A file is a text file that lists the Python packages and modules that a project needs to run. It is used to manage the project’s dependencies and to ensure that all developers working on the project are using the same versions of the required packages. The following Python packages are required to run the Hugging Face model. Note that this model is large, and it may take some time to download and install. You may also need to increase the memory allocated to your Python process to run the model. The following section provides a breakdown of the Dockerfile. The first line tells Docker to use the official Python 3.9 image as the base image for our image: The following line creates a new user named user with the user ID 1000. The flag tells Docker to create a home directory for the user. Next, this line sets the working directory for the container to . It’s time to copy the requirements file from the current directory to in the container. Also, this line upgrades the pip package manager in the container. This line sets the default user for the container to user. The following line copies the contents of the current directory to in the container. The flag tells Docker to create hard links instead of copying the files, which can improve performance and reduce the size of the image. The flag tells Docker to change the ownership of the copied files to the user user. Once you have built the Docker image, you can run it using the command. This will start a new container running the Python 3.9 image with the non-root user user. You can then interact with the container using the terminal. The Python code shows how to use Gradio to create a demo for a text-generation model trained using transformers. The code allows users to input a text prompt and generate a continuation of the text. Gradio is a Python library that allows you to create and share interactive machine learning demos with ease. It provides a simple and intuitive interface for creating and deploying demos, and it supports a wide range of machine learning frameworks and libraries, including transformers. This Python script is a Gradio demo for a text chatbot. It uses a pretrained text generation model to generate responses to user input. We’ll break down the file and look at each of the sections. The following line imports the type from the module. This type is used to represent a sequence of values that can be iterated over. The next line imports the library as well. The following line imports the module from the library, which is a popular machine learning library for natural language processing. Next, this line imports the and functions from the model module. These functions are used to calculate the input token length of a text and generate text using a pretrained text generation model, respectively. The next two lines configure the logging module to print information-level messages and to use the transformers logger. The following lines define some constants that are used throughout the code. Also, the lines define the text that is displayed in the Gradio demo. This line logs an information-level message indicating that the code is starting. This function clears the textbox and saves the input message to the state variable. The following function displays the input message in the chatbot and adds the message to the chat history. This function deletes the previous response from the chat history and returns the updated chat history and the previous response. The following function generates text using the pre-trained text generation model and the given parameters. It returns an iterator that yields a list of tuples, where each tuple contains the input message and the generated response. The following function generates a response to the given message and returns the empty string and the generated response. Here’s the : The and functions comprise the main part of the code. The function is responsible for generating a response given a message, a history of previous messages, and various generation parameters, including: The UI component and running the API server are handled by . Basically, is where you initialize the application and other configuration. File: The Python script is a chat bot that uses an LLM to generate responses to user input. The script uses the following steps to generate a response: The main function of the class is to store print-ready text in a queue. This queue can then be used by a downstream application as an iterator to access the generated text in a non-blocking way. To import the necessary modules and libraries for text generation with transformers, we can use the following code: This will import the necessary modules for tokenizing and generating text with transformers. To define the model to import, we can use: This step defines the model ID as , a scaled-down version of the Meta 7B chat LLama model. Once you have imported the necessary modules and libraries and defined the model to import, you can load the tokenizer and model using the following code: This will load the tokenizer and model from the Hugging Face Hub.The job of a tokenizer is to prepare the model’s inputs. Tokenizers for each model are available in the library. Define the model to import; again, we’re using . You need to set the variables and values in config for , , , and : You can also create the space and commit files to it to host applications on Hugging Face and test directly. The following command builds a Docker image for the model on the platform. The image will be tagged with the name . The following command will start a new container running the Docker image and expose port on the host machine. The environment variable sets the Hugging Face Hub token, which is required to download the model from the Hugging Face Hub. Next, open the browser and go to to see local LLM Docker container output (Figure 3). You can also view containers via the Docker Desktop (Figure 4). Deploying the LLM GGML model locally with Docker is a convenient and effective way to use natural language processing. Dockerizing the model makes it easy to move it between different environments and ensures that it will run consistently. Testing the model in a browser provides a user-friendly interface and allows you to quickly evaluate its performance. This setup gives you more control over your infrastructure and data and makes it easier to deploy advanced language models for a variety of applications. It is a significant step forward in the deployment of large language models.