Top-Level API

Talu - Fast LLM inference in pure Python

Functions

Function Description
ask() One-shot question-answer with automatic resource cleanup.
list_sessions() List sessions in a profile.
convert() Convert a model to an optimized format for efficient infe...

def ask

talu.ask(
    model: str,
    prompt: str,
    system: str | None = None,
    config: GenerationConfig | None = None,
    kwargs: Any
)Response

One-shot question-answer with automatic resource cleanup.

This is the simplest way to get a single response from a model. Resources are automatically released when the function returns, preventing memory leaks in loops.

Warning

Performance: This function loads the model on every call.

For repeated queries, use Client or Chat instead to avoid paying model load time on each call:

# SLOW: Loads model 100 times
for q in questions:
    r = talu.ask("Qwen/Qwen3-0.6B", q)  # ~2-5s load + generation
# FAST: Loads model once
chat = talu.Chat("Qwen/Qwen3-0.6B")
for q in questions:
    r = chat(q)  # Just generation time

The returned Response is "detached" - calling .append() on it will raise an error. For multi-turn conversations, use talu.Chat() instead.

Parameters
model

Model identifier (local path, HuggingFace ID, or URI).

prompt

The message to send (required).

system

Optional system prompt.

config

Generation configuration. **kwargs: Generation overrides (temperature, max_tokens, etc.).

Returns

Response object (str-able, with metadata access). Cannot be replied to.

Raises
ModelError

If the model cannot be loaded.

GenerationError

If generation fails.

Example - Simple question
>>> response = talu.ask("Qwen/Qwen3-0.6B", "What is 2+2?")
>>> print(response)
4
Example - With system prompt
>>> response = talu.ask(
...     "Qwen/Qwen3-0.6B",
...     "Hello!",
...     system="You are a pirate.",
... )
>>> print(response)
Ahoy there, matey!
Example - Safe in loops (no resource leak)
>>> for question in questions:
    ...     response = talu.ask(model, question)  # Auto-cleans each iteration
    ...     results.append(str(response))

For repeated use, prefer Chat (loads model once): >>> chat = talu.Chat("Qwen/Qwen3-0.6B") >>> for question in questions: ... response = chat(question) # Fast: model already loaded ... results.append(str(response))

For conversations with append
>>> chat = talu.Chat("Qwen/Qwen3-0.6B")
>>> r1 = chat("Hello!")
>>> r2 = r1.append("Tell me more")  # Works because Chat is attached

def list_sessions

talu.list_sessions(
    profile: str | None = None,
    search: str | None = None,
    limit: int = 50
)list[dict]

List sessions in a profile.

Parameters
profile

Profile name, or None for the default profile.

search

Filter sessions by text content.

limit

Maximum number of sessions to return.


def convert

talu.convert(
    model: str,
    scheme: str | None = None,
    platform: str | None = None,
    quant: str | None = None,
    output_dir: str | None = None,
    destination: str | None = None,
    force: bool = False,
    offline: bool = False,
    verify: bool = False,
    overrides: dict[str, str] | None = None,
    max_shard_size: int | str | None = None,
    dry_run: bool = False
)str | dict

Convert a model to an optimized format for efficient inference.

Parameters
model : str

Model to convert. Can be:

  • Model ID: "Qwen/Qwen3-0.6B", "meta-llama/Llama-3-8B"
  • Local path: "./my-model" or "/path/to/model"
scheme : str, optional

Explicit quantization scheme. Each scheme encodes all necessary parameters (method, bits, group_size). You can use specific keys or aliases.

If not set, uses platform and quant for automatic resolution. When neither scheme nor platform/quant are specified, defaults to gaf4_64.

User-Friendly Aliases (Recommended):

  • "4bit" / "q4" / "int4": Maps to gaf4_64 (balanced 4-bit)
  • "8bit" / "q8" / "int8": Maps to gaf8_64 (near-lossless)
  • "mlx" / "mlx4" / "gaf4": Maps to gaf4_64 (Apple Silicon optimized)
  • "mlx8" / "gaf8": Maps to gaf8_64 (8-bit MLX)
  • "fp8": Maps to fp8_e4m3 (H100/vLLM inference)

Grouped Affine (MLX compatible): gaf4_32, gaf4_64, gaf4_128, gaf8_32, gaf8_64, gaf8_128

Hardware float (not yet implemented): fp8_e4m3, fp8_e5m2, mxfp4, nvfp4

platform : str, optional

Target platform for scheme resolution ("cpu", "metal", "cuda"). When set, resolves to the appropriate scheme for that platform and quant level. Ignored if scheme is explicitly set.

quant : str, optional

Quantization level ("4bit", "8bit"). Used with platform. Defaults to "4bit" if platform is set but quant is not.

output_dir : str, optional

Parent directory for auto-named output. Defaults to ~/.cache/talu/models (or $TALU_HOME/models). Ignored if destination is set.

destination : str, optional

Explicit output path (overrides output_dir).

force : bool, default False

Overwrite existing output directory.

offline : bool, default False

If True, do not use network access when resolving model URIs.

verify : bool, default False

After conversion, verify the model by loading it and generating a few tokens. Catches corruption, missing files, and basic inference failures early.

overrides : dict, optional

Reserved for future use. Not currently supported.

max_shard_size : int | str, optional

Maximum size per shard file. When set, splits large models into multiple SafeTensors files. Can be bytes (int) or human-readable string (e.g., "5GB", "500MB").

dry_run : bool, default False

If True, estimate conversion without writing files. Returns a dict with estimation results (total_params, estimated_size_bytes, shard_count, scheme, bits_per_param).

Returns

str | dict When dry_run=False: Absolute path to the converted model directory. When dry_run=True: Dict with estimation results.

Raises
ConvertError

If conversion fails (network error, unsupported format, etc.), or if verification fails when verify=True.

ValueError

If invalid scheme or override is provided.

Examples
>>> import talu
>>> path = talu.convert("Qwen/Qwen3-0.6B")  # Uses gaf4_64 by default
>>> path = talu.convert("Qwen/Qwen3-0.6B", scheme="gaf4_32")  # Higher quality
>>> # Platform-aware conversion
>>> path = talu.convert("Qwen/Qwen3-0.6B", platform="metal")  # → gaf4_64
>>> # Sharded output for large models
>>> path = talu.convert("meta-llama/Llama-3-70B", max_shard_size="5GB")

See Also -------- list_schemes : List all available quantization schemes. verify : Verify a converted model.