The local AI gateway you've been waiting for.
Zallama is a memory-aware, multimodal local AI gateway powered by llama.cpp, parakeet.cpp, and kokoro.cpp. Spin up servers dynamically, query OpenAI-compatible text/voice endpoints, and let Zallama automatically unload models to fit your memory budget.
One-line installation
Ecosystem Engine
One Interface, Unlimited Local Modalities
Zallama wraps complex C++ machine learning runners into a unified, high-performance gateway matching OpenAI's API specs.
Text & Reasoning Core
Powered by llama-server, Zallama loads text generation and reasoning models. Watch thinking models like DeepSeek-R1 output reasoning blocks in real-time inside the interactive CLI or custom API clients.
<think>
Analyzing prompt 'Explain quantum tunneling'...
Let's breakdown: wave function, potential barriers, Schrödinger equation...
</think>
Quantum tunneling is a phenomenon where a particle passes through a potential energy barrier...
Memory-Budget Eviction
Say goodbye to manually terminating model servers. Define a memory budget (e.g., 12GB) and Zallama will evict the Least-Recently-Used (LRU) model servers to fit incoming models.
Multilingual ASR
Zallama abstracts parakeet-server to support local audio transcription at `/v1/audio/transcriptions`. Upload any audio (MP3, WebM, M4A, FLAC) and Zallama uses ffmpeg to auto-transcode and transcribe it instantly.
Vision (Multimodal)
Attach an mmproj projector artifact in your model registry. Zallama binds base models with visual projection matrices automatically, letting you analyze images natively.
chart.png
2.4 MB
Local Speech Synthesis
Generate lifelike voices locally using standalone `kokoro-server` based on `kokoro.cpp`. Standalone binary links ONNX Runtime statically for easy deployment. Query the OpenAI-compatible `/v1/audio/speech` endpoint.
$ zallama pull kokoro:82m
Loaded model. Speech audio file output saved successfully.
Compiling & Installation
How to Setup Zallama
Compile the C++ binaries for your platform, then run the unified Python-based daemon wrapper.
1. Compile Llama Engine
Compiles llama.cpp with CUDA support. You MUST pass a tag or branch name parameter (e.g. b4600).
2. Compile Parakeet ASR
Compiles the speech-to-text parakeet-server, copies shared libraries, and patches paths via patchelf.
3. Compile Kokoro TTS
Compiles the speech synthesis kokoro-server statically. ONNX Runtime is fetched and linked dynamically or statically.
4. Setup Zallama System
Initializes local Python virtual environment, registers symlinks, autocompletes, and registers an always-on daemon service. Run with root access.
Hugging Face Downloader
Accelerated Downloads
Zallama leverages aria2c to download files with up to 8 concurrent threads. When unavailable, it defaults to a parallel Python HTTP engine. Just query presets or use direct Hugging Face repository file routes.
Pro Tip: Modality Detection
ASR and TTS GGUF models are auto-detected and registered based on filenames. You can use the --type flag to force configuration when downloading custom models (e.g. --type tts).
Unsloth Llama 3.2 3B Preset
Preset ID: llama3.2:3b
Custom Hugging Face Model File
Qwen2.5 Coder Instruct GGUF
Parakeet Multilingual v3 ASR
Transcribe French, Spanish, German, etc.
Kokoro v0.19 82M TTS Model
Voice synthesis for English/French/etc.
Dynamic Process Scheduler
Interactive Eviction Visualizer
Experience how Zallama handles memory budgeting. Set a limit, load models, and watch the process manager automatically unload the Least-Recently-Used (LRU) models when the budget is reached.
System Load Monitor
Loaded Models
0
Estimated Headroom
12.0 GB
Daemon Memory Map (registry.yaml)
Developer Integration
OpenAI Compatible API
Integrate Zallama directly with your existing AI tools, frameworks, and agents by simply updating the API base url.
Systemd Service Setup
Run Zallama as an always-on backend service that boots up automatically with your server. If the installation was not executed with root access, you can register and launch the daemon manually with these standard systemctl commands.
Log Tracking
You can easily view logs for the service using: journalctl -u zallama -f
[Unit] Description=Zallama — Local LLM Server After=network.target [Service] Type=simple User=cook WorkingDirectory=/home/cook/Documents/Dev/Dev-ai/zallama ExecStart=/home/cook/Documents/Dev/Dev-ai/zallama/zallama serve Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target