Getting Started

High-performance LLM inference with zero dependencies

Install

$ pip install talu

Python 3.10+ · PyPI · Releases · Build from source

Quick Start (CLI)

CLI Guide →

Download and list cached models

$ talu get LiquidAI/LFM2-350M

$ talu ls

Ask a question directly with `-m`

$ talu ask -m LiquidAI/LFM2-350M "What is 2+2?"

Set a default model so `-m` is optional

$ talu set LiquidAI/LFM2-350M

$ talu set show

$ talu ask "Tell me a short joke."

Quantize a model

$ talu convert LiquidAI/LFM2-350M

$ talu set LiquidAI/LFM2-350M-GAF4

$ talu ask "Explain quantization in one sentence."

Converts to 4-bit (default scheme: gaf4_64). Converted models are saved with a -GAF4 suffix; the original remains available.

Scheme	Description
`gaf4_32`	4-bit, group 32 — highest accuracy, largest
`gaf4_64`	4-bit, group 64 — balanced (default)
`gaf4_128`	4-bit, group 128 — smallest 4-bit
`gaf8_32`	8-bit, group 32 — near-original quality
`gaf8_64`	8-bit, group 64
`gaf8_128`	8-bit, group 128

4-bit reduces model size ~4x with some quality loss. 8-bit preserves more quality at ~2x reduction. Smaller group sizes improve accuracy but increase file size.

Start the HTTP server

$ talu serve

By default, the server listens on http://127.0.0.1:8258:

Console UI: http://127.0.0.1:8258/
OpenResponses-compatible API: http://127.0.0.1:8258/v1
Override port with: talu serve --port 9000

API compatibility target: OpenResponses.

Python

Python Examples →

Basic chat:

from talu import Chat

chat = Chat("LiquidAI/LFM2-350M", system="You are helpful.")
response = chat("What is the capital of France?")
print(response)

response = response.append("Now answer in one sentence.")
print(response)
        

Shared client for multiple chats:

from talu import Client

client = Client("LiquidAI/LFM2-350M")
alice = client.chat(system="You are concise.")
bob = client.chat(system="You are detailed.")

print(alice("Explain recursion."))
print(bob("Explain recursion."))
client.close()
        

Persistent profile-backed sessions:

import talu

profile = talu.Profile("work")
chat = talu.Chat("LiquidAI/LFM2-350M", profile=profile)
chat("Draft release notes.", stream=False)

print(talu.list_sessions(profile="work", limit=1))
        

Supported Models

Models are downloaded from HuggingFace on first use. The models below have been verified. Other sizes and variants based on the same architecture are expected to work as well. This list is updated as coverage expands.

Qwen

Qwen/Qwen3-0.6B
Qwen/Qwen3-1.7B
Qwen/Qwen3-4B

LLaMA

meta-llama/Llama-3.2-1B
meta-llama/Llama-3.2-1B-Instruct

Mistral

mistralai/Ministral-3B-Instruct

Gemma

google/gemma-3-270m-it
google/gemma-3-1b-it

Phi

microsoft/Phi-3-mini-128k-instruct
microsoft/Phi-3.5-mini-instruct
microsoft/Phi-4-mini-instruct
microsoft/Phi-4-mini-reasoning

Granite

ibm-granite/granite-4.0-h-350m
ibm-granite/granite-4.0-micro

LFM

LiquidAI/LFM2-350M
LiquidAI/LFM2-1.2B
LiquidAI/LFM2-2.6B
LiquidAI/LFM2.5-1.2B-Instruct
LiquidAI/LFM2.5-1.2B-Thinking