Zallama v1.0.0
Local AI Gateway & Process Manager

The local AI gateway you've been waiting for.

Zallama is a memory-aware, multimodal local AI gateway powered by llama.cpp, parakeet.cpp, and kokoro.cpp. Spin up servers dynamically, query OpenAI-compatible text/voice endpoints, and let Zallama automatically unload models to fit your memory budget.

One-line installation

git clone https://github.com/rzafiamy/zallama.git && cd zallama && sudo bash install.sh
zallama CLI — interactive
bash
Commands

Ecosystem Engine

One Interface, Unlimited Local Modalities

Zallama wraps complex C++ machine learning runners into a unified, high-performance gateway matching OpenAI's API specs.

Text & Reasoning Core

Powered by llama-server, Zallama loads text generation and reasoning models. Watch thinking models like DeepSeek-R1 output reasoning blocks in real-time inside the interactive CLI or custom API clients.

DeepSeek-R1-8B output

<think>

Analyzing prompt 'Explain quantum tunneling'...

Let's breakdown: wave function, potential barriers, Schrödinger equation...

</think>

Quantum tunneling is a phenomenon where a particle passes through a potential energy barrier...

Memory-Budget Eviction

Say goodbye to manually terminating model servers. Define a memory budget (e.g., 12GB) and Zallama will evict the Least-Recently-Used (LRU) model servers to fit incoming models.

Active RAM Budget
8.4GB / 12GB (70%)

Multilingual ASR

Zallama abstracts parakeet-server to support local audio transcription at `/v1/audio/transcriptions`. Upload any audio (MP3, WebM, M4A, FLAC) and Zallama uses ffmpeg to auto-transcode and transcribe it instantly.

Vision (Multimodal)

Attach an mmproj projector artifact in your model registry. Zallama binds base models with visual projection matrices automatically, letting you analyze images natively.

IMG

chart.png

2.4 MB

"Analyze this chart" → Qwen-VL extracts data...

Local Speech Synthesis

Generate lifelike voices locally using standalone `kokoro-server` based on `kokoro.cpp`. Standalone binary links ONNX Runtime statically for easy deployment. Query the OpenAI-compatible `/v1/audio/speech` endpoint.

Kokoro TTS execution

$ zallama pull kokoro:82m

Loaded model. Speech audio file output saved successfully.

Compiling & Installation

How to Setup Zallama

Compile the C++ binaries for your platform, then run the unified Python-based daemon wrapper.

Step 1 build-ggml-llama.sh

1. Compile Llama Engine

Compiles llama.cpp with CUDA support. You MUST pass a tag or branch name parameter (e.g. b4600).

./build-ggml-llama.cpp.sh b4600
Step 2 build-ggml-parakeet.sh

2. Compile Parakeet ASR

Compiles the speech-to-text parakeet-server, copies shared libraries, and patches paths via patchelf.

./build-ggml-parakeet.cpp.sh master
Step 3 build-ggml-kokoro.sh

3. Compile Kokoro TTS

Compiles the speech synthesis kokoro-server statically. ONNX Runtime is fetched and linked dynamically or statically.

./build-ggml-kokoro.cpp.sh v0.1.0
Step 4 install.sh

4. Setup Zallama System

Initializes local Python virtual environment, registers symlinks, autocompletes, and registers an always-on daemon service. Run with root access.

sudo bash install.sh

Hugging Face Downloader

Accelerated Downloads

Zallama leverages aria2c to download files with up to 8 concurrent threads. When unavailable, it defaults to a parallel Python HTTP engine. Just query presets or use direct Hugging Face repository file routes.

Pro Tip: Modality Detection

ASR and TTS GGUF models are auto-detected and registered based on filenames. You can use the --type flag to force configuration when downloading custom models (e.g. --type tts).

Unsloth Llama 3.2 3B Preset

Preset ID: llama3.2:3b

zallama pull llama3.2:3b

Custom Hugging Face Model File

Qwen2.5 Coder Instruct GGUF

zallama pull unsloth/Qwen2.5-Coder-7B-Instruct-GGUF/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf

Parakeet Multilingual v3 ASR

Transcribe French, Spanish, German, etc.

zallama pull mudler/parakeet-cpp-gguf/tdt-0.6b-v3-q8_0.gguf

Kokoro v0.19 82M TTS Model

Voice synthesis for English/French/etc.

zallama pull kokoro:82m

Dynamic Process Scheduler

Interactive Eviction Visualizer

Experience how Zallama handles memory budgeting. Set a limit, load models, and watch the process manager automatically unload the Least-Recently-Used (LRU) models when the budget is reached.

Memory Budget Config
RAM/VRAM Limit: 12.0 GB

System Load Monitor

Memory Budget Allocation 0.0GB / 12.0GB (0%)

Loaded Models

0

Estimated Headroom

12.0 GB

[MANAGER] Idle. Waiting for models to start.

Daemon Memory Map (registry.yaml)

Developer Integration

OpenAI Compatible API

Integrate Zallama directly with your existing AI tools, frameworks, and agents by simply updating the API base url.

Endpoint Modalities
Production Deploy

Systemd Service Setup

Run Zallama as an always-on backend service that boots up automatically with your server. If the installation was not executed with root access, you can register and launch the daemon manually with these standard systemctl commands.

Log Tracking

You can easily view logs for the service using: journalctl -u zallama -f

zallama.service
[Unit]
Description=Zallama — Local LLM Server
After=network.target

[Service]
Type=simple
User=cook
WorkingDirectory=/home/cook/Documents/Dev/Dev-ai/zallama
ExecStart=/home/cook/Documents/Dev/Dev-ai/zallama/zallama serve
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
Activate Commands
sudo systemctl enable --now zallama