
Photo by the author Ideogram
Starting many vast language models can be useful whether to compare models, configure failures in the event of failure, or adjusting behavior (such as using one model for coding and the other for technical writing). In this way, we often employ LLM in practice. There are applications like Poe.com which offer this type of configuration. This is a single platform where you can launch a lot of LLM. But what if you want to do all this locally, save on the costs of the API interface and keep the privacy of data?
Well, a real problem arises here. Configuring this usually means juggling with different ports, running separate processes and manually switching between them. Not perfect.
This is exactly the pain Llama-Swap Solves. This is the Open Source Proxy server, which is very delicate (only one binary) and allows you to easily switch between many local LLM. Simply put, he listens to the OpenAI API connections on the computer and automatically launches or stops the appropriate model server based on the desired model. Let’s break up how it works and go through the step -by -step configuration to run it on the local computer.
# How does lama-flower work
Conceptually, Llama-Swap stands in front of LLM servers as an knowledgeable router. When the request of the API comes (e.g. A POST /v1/chat/completions Call), looks at "model" JSON box. Then he charges the appropriate server process for this model, turning off any other model if necessary. For example, if you ask for a model first "A" and then ask for a model "B"Llama-SWAP will automatically stop the server for “A” and launch the server for “B”, thanks to which each request is supported by the right model. This vigorous change is see-through, so customers see the expected answer without worrying about basic processes.
By default, Llama-SWAP allows only one model to work (unloads others when switching). However, the function of its groups allows you to change this behavior. The group can mention several models and control their conversion behavior. For example, setting swap: false In the group means that all group members can act together without unloading. In practice, you can employ one group for heavyweight models (only one vigorous at once) and another “parallel” for petite models that you want to act simultaneously. This gives full control over the employ of resources and co -depths on one server.
# Preliminary requirements
Before starting to make sure your system has the following:
- Python 3 (> = 3.8): Needed for basic scripts and instrumentation.
- Homebrew (on macOS): Facilitates the LLM Runtimes installation. For example, you can install flame.cpp server from:
It ensures llama-server Binary for hosting models locally.
- flame.cpp (
llama-server): The binary server compatible with OpenAI (installed via homebrew above or built of a source), which actually launches the LLM model. - CLI face hugging: For downloading models directly to a local computer without logging in on a page or manual navigation on the model’s pages. Install it with:
pip install -U "huggingface_hub[cli]"
- Hardware: Every current processor will work. In the case of faster inference, GPU is useful. (On Apple Silicon Macs you can start the processor or try Pythorch Backend MPS for supported models. In Linux/Windows with NVIDIA GPU you can employ docker containers/miracles to accelerate.)
- Docker (Optional): To run the pre -built Docker images. However, I decided not to employ this guide, because these images are designed mainly for X86 (Intel/AMD) systems and do not work reliably on Mac Apple Silicon (M1/M2) computers. Instead, I used the method of installing bare metal, which works directly on macOS without a container bedspread.
To sum up, you will need the Python environment and the local LLM server (like the “Llama.cpp` server). We will employ them to host two sample models on one computer.
# Step by step instructions
// 1. Installation of Llama-SWAP
Download the latest version of Llam-SWAP for your operating system with GitHub releases the page. For example, I saw v126 As the latest edition. Run the following commands:
# Step 1: Download the correct file
curl -L -o llama-swap.tar.gz
https://github.com/mostlygeek/llama-swap/releases/download/v126/llama-swap_126_darwin_arm64.tar.gz
Output:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 3445k 100 3445k 0 0 1283k 0 0:00:02 0:00:02 --:--:-- 5417k
Now separate the file, make it executable and test it by checking the version:
# Step 2: Extract it
tar -xzf llama-swap.tar.gz
# Step 3: Make it executable
chmod +x llama-swap
# Step 4: Test it
./llama-swap --version
Output:
version: 126 (591a9cdf4d3314fe4b3906e939a17e76402e1655), built at 2025-06-16T23:53:50Z
// 2. Downloading and preparing two or more LLM
Choose two sample models to be launched. We will employ Qwen2.5-0.5b AND Smollm2-135m (petite models) with Hugging. You need model files (in GGUF or similar format) on your computer. For example, the employ of cli face hug:
mkdir -p ~/llm-models
huggingface-cli download bartowski/SmolLM2-135M-Instruct-GGUF
--include "SmolLM2-135M-Instruct-Q4_K_M.gguf" --local-dir ~/llm-models
huggingface-cli download bartowski/Qwen2.5-0.5B-Instruct-GGUF
--include "Qwen2.5-0.5B-Instruct-Q4_K_M.gguf" --local-dir ~/llm-models
It will be:
- Create a catalog
llm-modelsin the user’s home folder - Download GGUF files safely in this folder. After downloading, you can confirm that there is:
Exit:
SmolLM2-135M-Instruct-Q4_K_M.gguf
Qwen2.5-0.5B-Instruct-Q4_K_M.gguf
// 3. Creating Llama-SWAP configuration
Llama-SWAP uses one YAML file to define server models and commands. Create config.yaml File with such content:
models:
"smollm2":
cmd: |
llama-server
--model /path/to/models/llm-models/SmolLM2-135M-Instruct-Q4_K_M.gguf
--port ${PORT}
"qwen2.5":
cmd: |
llama-server
--model /path/to/models/llm-models/Qwen2.5-0.5B-Instruct-Q4_K_M.gguf
--port ${PORT}
Replace /path/to/models/ with an actual local path. Each entry under models: gives an identifier (like "qwen2.5") and shell cmd: To start the server. We employ llama-server (with lama.cpp) with --model pointing to the GGUF file i --port ${PORT}. . ${PORT} Macro says Llama-Swap to automatically assign a free port to each model. . groups The section is optional. I skipped it in this example, so by default Llama-Swap will start only one model at once. In this configuration, you can adapt many options to the model (alias, time limit, etc.). For more information on the available options, see Full configuration sample file.
// 4. Running Llama-SWAP
With binary and config.yaml Ready, start Lama-SWAP indicating the configuration:
./llama-swap --config config.yaml --listen 127.0.0.1:8080
It launches the proxy server on localhost:8080. It will read it config.yaml And (initially) do not charge models until the first request comes. Llama-SWAP will now support API requests at the port 8080handing them over to the appropriate basis llama-server The process based on "model" parameter.
// 5. Interaction with your models
Now you can make the API interface in the OpenAI style to test any model. Install Jq If you do not have it before starting the following commands:
// Using QWEN2.5
curl -s http://localhost:8080/v1/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer no-key"
-d '{
"model": "qwen2.5",
"prompt": "User: What is Python?nAssistant:",
"max_tokens": 100
}' | jq '.choices[0].text'
Output:
"Python is a popular general-purpose programming language. It is easy to learn, has a large standard library, and is compatible with many operating systems. Python is used for web development, data analysis, scientific computing, and machine learning.nPython is a language that is popular for web development due to its simplicity, versatility and its use of modern features. It is used in a wide range of applications including web development, data analysis, scientific computing, machine learning and more. Python is a popular language in the"
// Using SMOLLM2
curl -s http://localhost:8080/v1/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer no-key"
-d '{
"model": "smollm2",
"prompt": "User: What is Python?nAssistant:",
"max_tokens": 100
}' | jq '.choices[0].text'
Output:
"Python is a high-level programming language designed for simplicity and efficiency. It's known for its readability, syntax, and versatility, making it a popular choice for beginners and developers alike.nnWhat is Python?"
Each model will answer according to training. The beauty of Llama-Swap is that you don’t have to raise anything by hand-just change "model" field and supports the rest. As shown in the above examples, you will see:
qwen2.5: more discussed, technical reactionsmollm2: simpler, more concise answer
This confirms that Llama-Swap is routing demands to the right model!
# Application
Congratulations! You configured Llama-Swap to start two LLM on one computer, and now you can switch between them in the flight via API connections. We installed proxy, prepared Yaml configuration with two models and saw how the Llama-SWAP routes demands the right back.
Next steps: You can develop it with:
- Larger models (like
TinyLlamaINPhi-2INMistral) - Groups for a simultaneous portion
- Integration with LangchainIN Fastapyor other limits
Have fun, studying various models and configurations!
Canwal Mehreen He is a machine learning engineer and a technical writer with a deep passion for data learning and AI intersection with medicine. He is the co -author of the ebook “maximizing performance from chatgpt”. As a Google 2022 generation scholar for APAC, it tells diversity and academic perfection. It is also recognized as a variety of terradate at Tech Scholar, Mitacs Globalink Research Scholar and Harvard Wecode Scholar. Kanwalwal is a scorching supporter of changes, after establishing FemCodes to strengthen women in the STEM fields.
