Getting Started
High-performance LLM inference with zero dependencies
Install
Python 3.10+ · PyPI · Releases · Build from source
Quick Start (CLI)
Download and list cached models
Ask a question directly with -m
Set a default model so -m is optional
Quantize a model
Converts to 4-bit (default scheme: gaf4_64). Converted models are saved with a -GAF4 suffix; the original remains available.
| Scheme | Description |
|---|---|
gaf4_32 | 4-bit, group 32 — highest accuracy, largest |
gaf4_64 | 4-bit, group 64 — balanced (default) |
gaf4_128 | 4-bit, group 128 — smallest 4-bit |
gaf8_32 | 8-bit, group 32 — near-original quality |
gaf8_64 | 8-bit, group 64 |
gaf8_128 | 8-bit, group 128 |
4-bit reduces model size ~4x with some quality loss. 8-bit preserves more quality at ~2x reduction. Smaller group sizes improve accuracy but increase file size.
Start the HTTP server
By default, the server listens on http://127.0.0.1:8258:
- Console UI:
http://127.0.0.1:8258/ - OpenResponses-compatible API:
http://127.0.0.1:8258/v1 - Override port with:
talu serve --port 9000
API compatibility target: OpenResponses.
Python
Basic chat:
from talu import Chat
chat = Chat("LiquidAI/LFM2-350M", system="You are helpful.")
response = chat("What is the capital of France?")
print(response)
response = response.append("Now answer in one sentence.")
print(response)
Shared client for multiple chats:
from talu import Client
client = Client("LiquidAI/LFM2-350M")
alice = client.chat(system="You are concise.")
bob = client.chat(system="You are detailed.")
print(alice("Explain recursion."))
print(bob("Explain recursion."))
client.close()
Persistent profile-backed sessions:
import talu
profile = talu.Profile("work")
chat = talu.Chat("LiquidAI/LFM2-350M", profile=profile)
chat("Draft release notes.", stream=False)
print(talu.list_sessions(profile="work", limit=1))
Supported Models
Models are downloaded from HuggingFace on first use. The models below have been verified. Other sizes and variants based on the same architecture are expected to work as well. This list is updated as coverage expands.
Qwen
Qwen/Qwen3-0.6BQwen/Qwen3-1.7BQwen/Qwen3-4B
LLaMA
meta-llama/Llama-3.2-1Bmeta-llama/Llama-3.2-1B-Instruct
Mistral
mistralai/Ministral-3B-Instruct
Gemma
google/gemma-3-270m-itgoogle/gemma-3-1b-it
Phi
microsoft/Phi-3-mini-128k-instructmicrosoft/Phi-3.5-mini-instructmicrosoft/Phi-4-mini-instructmicrosoft/Phi-4-mini-reasoning
Granite
ibm-granite/granite-4.0-h-350mibm-granite/granite-4.0-micro
LFM
LiquidAI/LFM2-350MLiquidAI/LFM2-1.2BLiquidAI/LFM2-2.6BLiquidAI/LFM2.5-1.2B-InstructLiquidAI/LFM2.5-1.2B-Thinking