Top-Level API

Talu - Fast LLM inference in pure Python

Functions

Function	Description
ask()	One-shot question-answer with automatic resource cleanup.
list_sessions()	List sessions in a profile.
convert()	Convert a model to an optimized format for efficient infe...

def ask

`talu.ask( model: str, prompt: str, system: str | None = None, config: GenerationConfig | None = None, kwargs: Any ) → Response`

One-shot question-answer with automatic resource cleanup.

This is the simplest way to get a single response from a model. Resources are automatically released when the function returns, preventing memory leaks in loops.

Warning

Performance: This function loads the model on every call.

For repeated queries, use Client or Chat instead to avoid paying model load time on each call:

# SLOW: Loads model 100 times
for q in questions:
    r = talu.ask("Qwen/Qwen3-0.6B", q)  # ~2-5s load + generation

# FAST: Loads model once
chat = talu.Chat("Qwen/Qwen3-0.6B")
for q in questions:
    r = chat(q)  # Just generation time

The returned Response is "detached" - calling .append() on it will raise an error. For multi-turn conversations, use talu.Chat() instead.

Parameters

model: Model identifier (local path, HuggingFace ID, or URI).
prompt: The message to send (required).
system: Optional system prompt.
config: Generation configuration. **kwargs: Generation overrides (temperature, max_tokens, etc.).

Returns

Response object (str-able, with metadata access). Cannot be replied to.

Raises

ModelError: If the model cannot be loaded.
GenerationError: If generation fails.

Example - Simple question

>>> response = talu.ask("Qwen/Qwen3-0.6B", "What is 2+2?")
>>> print(response)
4

Example - With system prompt

>>> response = talu.ask(
...     "Qwen/Qwen3-0.6B",
...     "Hello!",
...     system="You are a pirate.",
... )
>>> print(response)
Ahoy there, matey!

Example - Safe in loops (no resource leak)

>>> for question in questions:
    ...     response = talu.ask(model, question)  # Auto-cleans each iteration
    ...     results.append(str(response))

For repeated use, prefer Chat (loads model once): >>> chat = talu.Chat("Qwen/Qwen3-0.6B") >>> for question in questions: ... response = chat(question) # Fast: model already loaded ... results.append(str(response))

For conversations with append

>>> chat = talu.Chat("Qwen/Qwen3-0.6B")
>>> r1 = chat("Hello!")
>>> r2 = r1.append("Tell me more")  # Works because Chat is attached

def list_sessions

`talu.list_sessions( profile: str | None = None, search: str | None = None, limit: int = 50 ) → list[dict]`

List sessions in a profile.

Parameters

profile: Profile name, or None for the default profile.
search: Filter sessions by text content.
limit: Maximum number of sessions to return.

def convert

`talu.convert( model: str, scheme: str | None = None, platform: str | None = None, quant: str | None = None, output_dir: str | None = None, destination: str | None = None, force: bool = False, offline: bool = False, verify: bool = False, overrides: dict[str, str] | None = None, max_shard_size: int | str | None = None, dry_run: bool = False ) → str | dict`

Convert a model to an optimized format for efficient inference.

Parameters

model : str

Model to convert. Can be:

Model ID: "Qwen/Qwen3-0.6B", "meta-llama/Llama-3-8B"
Local path: "./my-model" or "/path/to/model"

scheme : str, optional

Explicit quantization scheme. Each scheme encodes all necessary parameters (method, bits, group_size). You can use specific keys or aliases.

If not set, uses platform and quant for automatic resolution. When neither scheme nor platform/quant are specified, defaults to gaf4_64.

User-Friendly Aliases (Recommended):

"4bit" / "q4" / "int4": Maps to gaf4_64 (balanced 4-bit)
"8bit" / "q8" / "int8": Maps to gaf8_64 (near-lossless)
"mlx" / "mlx4" / "gaf4": Maps to gaf4_64 (Apple Silicon optimized)
"mlx8" / "gaf8": Maps to gaf8_64 (8-bit MLX)
"fp8": Maps to fp8_e4m3 (H100/vLLM inference)

Grouped Affine (MLX compatible): gaf4_32, gaf4_64, gaf4_128, gaf8_32, gaf8_64, gaf8_128

Hardware float (not yet implemented): fp8_e4m3, fp8_e5m2, mxfp4, nvfp4

platform : str, optional

Target platform for scheme resolution ("cpu", "metal", "cuda"). When set, resolves to the appropriate scheme for that platform and quant level. Ignored if scheme is explicitly set.

quant : str, optional

Quantization level ("4bit", "8bit"). Used with platform. Defaults to "4bit" if platform is set but quant is not.

output_dir : str, optional

Parent directory for auto-named output. Defaults to ~/.cache/talu/models (or $TALU_HOME/models). Ignored if destination is set.

destination : str, optional

Explicit output path (overrides output_dir).

force : bool, default False

Overwrite existing output directory.

offline : bool, default False

If True, do not use network access when resolving model URIs.

verify : bool, default False

After conversion, verify the model by loading it and generating a few tokens. Catches corruption, missing files, and basic inference failures early.

overrides : dict, optional

Reserved for future use. Not currently supported.

max_shard_size : int | str, optional

Maximum size per shard file. When set, splits large models into multiple SafeTensors files. Can be bytes (int) or human-readable string (e.g., "5GB", "500MB").

dry_run : bool, default False

If True, estimate conversion without writing files. Returns a dict with estimation results (total_params, estimated_size_bytes, shard_count, scheme, bits_per_param).

Returns

str | dict When dry_run=False: Absolute path to the converted model directory. When dry_run=True: Dict with estimation results.

Raises

ConvertError: If conversion fails (network error, unsupported format, etc.), or if verification fails when verify=True.
ValueError: If invalid scheme or override is provided.

Examples

>>> import talu
>>> path = talu.convert("Qwen/Qwen3-0.6B")  # Uses gaf4_64 by default

>>> path = talu.convert("Qwen/Qwen3-0.6B", scheme="gaf4_32")  # Higher quality

>>> # Platform-aware conversion
>>> path = talu.convert("Qwen/Qwen3-0.6B", platform="metal")  # → gaf4_64

>>> # Sharded output for large models
>>> path = talu.convert("meta-llama/Llama-3-70B", max_shard_size="5GB")

See Also -------- list_schemes : List all available quantization schemes. verify : Verify a converted model.

Top-Level API

Functions

def ask

talu.ask( model: str, prompt: str, system: str | None = None, config: GenerationConfig | None = None, kwargs: Any) → Response

Parameters

Returns

Raises

Example - Simple question

Example - With system prompt

Example - Safe in loops (no resource leak)

For conversations with append

def list_sessions

talu.list_sessions( profile: str | None = None, search: str | None = None, limit: int = 50) → list[dict]

Parameters

def convert

Parameters

Returns

Raises

Examples

`talu.ask( model: str, prompt: str, system: str | None = None, config: GenerationConfig | None = None, kwargs: Any ) → Response`

`talu.list_sessions( profile: str | None = None, search: str | None = None, limit: int = 50 ) → list[dict]`