Local AI Gateway & Process Manager

The local AI gateway you've been waiting for.

Zallama is a memory-aware, multimodal local AI gateway powered by llama.cpp, parakeet.cpp, and kokoro.cpp. Spin up servers dynamically, query OpenAI-compatible text/voice endpoints, and let Zallama automatically unload models to fit your memory budget.

Quick Setup Interactive Visualizer

One-line installation

git clone https://github.com/rzafiamy/zallama.git && cd zallama && sudo bash install.sh

zallama CLI — interactive

bash

Commands

Ecosystem Engine

One Interface, Unlimited Local Modalities

Zallama wraps complex C++ machine learning runners into a unified, high-performance gateway matching OpenAI's API specs.

Text & Reasoning Core

Powered by llama-server, Zallama loads text generation and reasoning models. Watch thinking models like DeepSeek-R1 output reasoning blocks in real-time inside the interactive CLI or custom API clients.

DeepSeek-R1-8B output

<think>

Analyzing prompt 'Explain quantum tunneling'...

Let's breakdown: wave function, potential barriers, Schrödinger equation...

</think>

Quantum tunneling is a phenomenon where a particle passes through a potential energy barrier...

Memory-Budget Eviction

Say goodbye to manually terminating model servers. Define a memory budget (e.g., 12GB) and Zallama will evict the Least-Recently-Used (LRU) model servers to fit incoming models.

Active RAM Budget

8.4GB / 12GB (70%)

Multilingual ASR

Zallama abstracts parakeet-server to support local audio transcription at `/v1/audio/transcriptions`. Upload any audio (MP3, WebM, M4A, FLAC) and Zallama uses ffmpeg to auto-transcode and transcribe it instantly.

Vision (Multimodal)

Attach an mmproj projector artifact in your model registry. Zallama binds base models with visual projection matrices automatically, letting you analyze images natively.

IMG

chart.png

2.4 MB

"Analyze this chart" → Qwen-VL extracts data...

Local Speech Synthesis

Generate lifelike voices locally using standalone `kokoro-server` based on `kokoro.cpp`. Standalone binary links ONNX Runtime statically for easy deployment. Query the OpenAI-compatible `/v1/audio/speech` endpoint.

Kokoro TTS execution

$ zallama pull kokoro:82m

Loaded model. Speech audio file output saved successfully.

Compiling & Installation

How to Setup Zallama

Compile the C++ binaries for your platform, then run the unified Python-based daemon wrapper.

Step 1 build-ggml-llama.sh

1. Compile Llama Engine

Compiles llama.cpp with CUDA support. You MUST pass a tag or branch name parameter (e.g. b4600).

./build-ggml-llama.cpp.sh b4600

Step 2 build-ggml-parakeet.sh

2. Compile Parakeet ASR

Compiles the speech-to-text parakeet-server, copies shared libraries, and patches paths via patchelf.

./build-ggml-parakeet.cpp.sh master

Step 3 build-ggml-kokoro.sh

3. Compile Kokoro TTS

Compiles the speech synthesis kokoro-server statically. ONNX Runtime is fetched and linked dynamically or statically.

./build-ggml-kokoro.cpp.sh v0.1.0

Step 4 install.sh

4. Setup Zallama System

Initializes local Python virtual environment, registers symlinks, autocompletes, and registers an always-on daemon service. Run with root access.

sudo bash install.sh

Hugging Face Downloader

Accelerated Downloads

Zallama leverages aria2c to download files with up to 8 concurrent threads. When unavailable, it defaults to a parallel Python HTTP engine. Just query presets or use direct Hugging Face repository file routes.

Pro Tip: Modality Detection

ASR and TTS GGUF models are auto-detected and registered based on filenames. You can use the --type flag to force configuration when downloading custom models (e.g. --type tts).

Unsloth Llama 3.2 3B Preset

Preset ID: llama3.2:3b

zallama pull llama3.2:3b

Custom Hugging Face Model File

Qwen2.5 Coder Instruct GGUF

zallama pull unsloth/Qwen2.5-Coder-7B-Instruct-GGUF/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf

Parakeet Multilingual v3 ASR

Transcribe French, Spanish, German, etc.

zallama pull mudler/parakeet-cpp-gguf/tdt-0.6b-v3-q8_0.gguf

Kokoro v0.19 82M TTS Model

Voice synthesis for English/French/etc.

zallama pull kokoro:82m

Dynamic Process Scheduler

Interactive Eviction Visualizer

Experience how Zallama handles memory budgeting. Set a limit, load models, and watch the process manager automatically unload the Least-Recently-Used (LRU) models when the budget is reached.

Memory Budget Config

RAM/VRAM Limit: 12.0 GB

System Load Monitor

Memory Budget Allocation 0.0GB / 12.0GB (0%)

Loaded Models

Estimated Headroom

12.0 GB

[MANAGER] Idle. Waiting for models to start.

Daemon Memory Map (`registry.yaml`)

Developer Integration

OpenAI Compatible API

Integrate Zallama directly with your existing AI tools, frameworks, and agents by simply updating the API base url.

Endpoint Modalities

Production Deploy

Systemd Service Setup

Run Zallama as an always-on backend service that boots up automatically with your server. If the installation was not executed with root access, you can register and launch the daemon manually with these standard systemctl commands.

Log Tracking

You can easily view logs for the service using: journalctl -u zallama -f

zallama.service

[Unit]
Description=Zallama — Local LLM Server
After=network.target

[Service]
Type=simple
User=cook
WorkingDirectory=/home/cook/Documents/Dev/Dev-ai/zallama
ExecStart=/home/cook/Documents/Dev/Dev-ai/zallama/zallama serve
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Activate Commands

sudo systemctl enable --now zallama

The local AI gateway you've been waiting for.

Ecosystem Engine

Text & Reasoning Core

Memory-Budget Eviction

Multilingual ASR

Vision (Multimodal)

Local Speech Synthesis

Compiling & Installation

1. Compile Llama Engine

2. Compile Parakeet ASR

3. Compile Kokoro TTS

4. Setup Zallama System

Hugging Face Downloader

Unsloth Llama 3.2 3B Preset

Custom Hugging Face Model File

Parakeet Multilingual v3 ASR

Kokoro v0.19 82M TTS Model

Dynamic Process Scheduler

System Load Monitor

Daemon Memory Map (`registry.yaml`)

Developer Integration

Chat Completion

Vision Analytics

Speech-to-Text ASR

Speech Synthesis (TTS)

Systemd Service Setup

The local AI gateway you've been waiting for.

Ecosystem Engine

Text & Reasoning Core

Memory-Budget Eviction

Multilingual ASR

Vision (Multimodal)

Local Speech Synthesis

Compiling & Installation

1. Compile Llama Engine

2. Compile Parakeet ASR

3. Compile Kokoro TTS

4. Setup Zallama System

Hugging Face Downloader

Unsloth Llama 3.2 3B Preset

Custom Hugging Face Model File

Parakeet Multilingual v3 ASR

Kokoro v0.19 82M TTS Model

Dynamic Process Scheduler

System Load Monitor

Daemon Memory Map (registry.yaml)

Developer Integration

Chat Completion

Vision Analytics

Speech-to-Text ASR

Speech Synthesis (TTS)

Systemd Service Setup

Daemon Memory Map (`registry.yaml`)