Getting Started

High-performance LLM inference with zero dependencies

Install

$ pip install talu

Python 3.10+ · PyPI · Releases · Build from source

Quick Start (CLI)

CLI Guide →

Download and list cached models

$ talu get LiquidAI/LFM2-350M
$ talu ls

Ask a question directly with -m

$ talu ask -m LiquidAI/LFM2-350M "What is 2+2?"

Set a default model so -m is optional

$ talu set LiquidAI/LFM2-350M
$ talu set show
$ talu ask "Tell me a short joke."

Quantize a model

$ talu convert LiquidAI/LFM2-350M
$ talu set LiquidAI/LFM2-350M-GAF4
$ talu ask "Explain quantization in one sentence."

Converts to 4-bit (default scheme: gaf4_64). Converted models are saved with a -GAF4 suffix; the original remains available.

SchemeDescription
gaf4_324-bit, group 32 — highest accuracy, largest
gaf4_644-bit, group 64 — balanced (default)
gaf4_1284-bit, group 128 — smallest 4-bit
gaf8_328-bit, group 32 — near-original quality
gaf8_648-bit, group 64
gaf8_1288-bit, group 128

4-bit reduces model size ~4x with some quality loss. 8-bit preserves more quality at ~2x reduction. Smaller group sizes improve accuracy but increase file size.

Start the HTTP server

$ talu serve

By default, the server listens on http://127.0.0.1:8258:

API compatibility target: OpenResponses.

Python

Python Examples →

Basic chat:

from talu import Chat

chat = Chat("LiquidAI/LFM2-350M", system="You are helpful.")
response = chat("What is the capital of France?")
print(response)

response = response.append("Now answer in one sentence.")
print(response)

Shared client for multiple chats:

from talu import Client

client = Client("LiquidAI/LFM2-350M")
alice = client.chat(system="You are concise.")
bob = client.chat(system="You are detailed.")

print(alice("Explain recursion."))
print(bob("Explain recursion."))
client.close()

Persistent profile-backed sessions:

import talu

profile = talu.Profile("work")
chat = talu.Chat("LiquidAI/LFM2-350M", profile=profile)
chat("Draft release notes.", stream=False)

print(talu.list_sessions(profile="work", limit=1))

Supported Models

Models are downloaded from HuggingFace on first use. The models below have been verified. Other sizes and variants based on the same architecture are expected to work as well. This list is updated as coverage expands.

Qwen

LLaMA

Mistral

Gemma

Phi

Granite

LFM