Top-Level API
Talu - Fast LLM inference in pure Python
Functions
| Function | Description |
|---|---|
| ask() | One-shot question-answer with automatic resource cleanup. |
| list_sessions() | List sessions in a profile. |
| convert() | Convert a model to an optimized format for efficient infe... |
def ask
talu.ask(
model: str,
prompt: str,
system: str | None = None,
config: GenerationConfig | None = None,
kwargs: Any
) → Response
model: str,
prompt: str,
system: str | None = None,
config: GenerationConfig | None = None,
kwargs: Any
) → Response
One-shot question-answer with automatic resource cleanup.
This is the simplest way to get a single response from a model. Resources are automatically released when the function returns, preventing memory leaks in loops.
Performance: This function loads the model on every call.
For repeated queries, use Client or Chat instead to avoid paying model load time on each call:
# SLOW: Loads model 100 times
for q in questions:
r = talu.ask("Qwen/Qwen3-0.6B", q) # ~2-5s load + generation
# FAST: Loads model once
chat = talu.Chat("Qwen/Qwen3-0.6B")
for q in questions:
r = chat(q) # Just generation time
The returned Response is "detached" - calling .append() on it will raise an error. For multi-turn conversations, use talu.Chat() instead.
Parameters
modelModel identifier (local path, HuggingFace ID, or URI).
promptThe message to send (required).
systemOptional system prompt.
configGeneration configuration. **kwargs: Generation overrides (temperature, max_tokens, etc.).
Returns
Response object (str-able, with metadata access). Cannot be replied to.
Raises
ModelErrorIf the model cannot be loaded.
GenerationErrorIf generation fails.
Example - Simple question
>>> response = talu.ask("Qwen/Qwen3-0.6B", "What is 2+2?")
>>> print(response)
4
Example - With system prompt
>>> response = talu.ask(
... "Qwen/Qwen3-0.6B",
... "Hello!",
... system="You are a pirate.",
... )
>>> print(response)
Ahoy there, matey!
Example - Safe in loops (no resource leak)
>>> for question in questions:
... response = talu.ask(model, question) # Auto-cleans each iteration
... results.append(str(response))
For repeated use, prefer Chat (loads model once): >>> chat = talu.Chat("Qwen/Qwen3-0.6B") >>> for question in questions: ... response = chat(question) # Fast: model already loaded ... results.append(str(response))
For conversations with append
>>> chat = talu.Chat("Qwen/Qwen3-0.6B")
>>> r1 = chat("Hello!")
>>> r2 = r1.append("Tell me more") # Works because Chat is attached
def list_sessions
talu.list_sessions(
profile: str | None = None,
search: str | None = None,
limit: int = 50
) → list[dict]
profile: str | None = None,
search: str | None = None,
limit: int = 50
) → list[dict]
List sessions in a profile.
Parameters
profileProfile name, or None for the default profile.
searchFilter sessions by text content.
limitMaximum number of sessions to return.
def convert
talu.convert(
model: str,
scheme: str | None = None,
platform: str | None = None,
quant: str | None = None,
output_dir: str | None = None,
destination: str | None = None,
force: bool = False,
offline: bool = False,
verify: bool = False,
overrides: dict[str, str] | None = None,
max_shard_size: int | str | None = None,
dry_run: bool = False
) → str | dict
model: str,
scheme: str | None = None,
platform: str | None = None,
quant: str | None = None,
output_dir: str | None = None,
destination: str | None = None,
force: bool = False,
offline: bool = False,
verify: bool = False,
overrides: dict[str, str] | None = None,
max_shard_size: int | str | None = None,
dry_run: bool = False
) → str | dict
Convert a model to an optimized format for efficient inference.
Parameters
model: strModel to convert. Can be:
- Model ID:
"Qwen/Qwen3-0.6B","meta-llama/Llama-3-8B" - Local path:
"./my-model"or"/path/to/model"
- Model ID:
scheme: str, optionalExplicit quantization scheme. Each scheme encodes all necessary parameters (method, bits, group_size). You can use specific keys or aliases.
If not set, uses
platformandquantfor automatic resolution. When neitherschemenorplatform/quantare specified, defaults togaf4_64.User-Friendly Aliases (Recommended):
"4bit"/"q4"/"int4": Maps togaf4_64(balanced 4-bit)"8bit"/"q8"/"int8": Maps togaf8_64(near-lossless)"mlx"/"mlx4"/"gaf4": Maps togaf4_64(Apple Silicon optimized)"mlx8"/"gaf8": Maps togaf8_64(8-bit MLX)"fp8": Maps tofp8_e4m3(H100/vLLM inference)
Grouped Affine (MLX compatible):
gaf4_32,gaf4_64,gaf4_128,gaf8_32,gaf8_64,gaf8_128Hardware float (not yet implemented):
fp8_e4m3,fp8_e5m2,mxfp4,nvfp4platform: str, optionalTarget platform for scheme resolution (
"cpu","metal","cuda"). When set, resolves to the appropriate scheme for that platform and quant level. Ignored ifschemeis explicitly set.quant: str, optionalQuantization level (
"4bit","8bit"). Used withplatform. Defaults to"4bit"if platform is set but quant is not.output_dir: str, optionalParent directory for auto-named output. Defaults to
~/.cache/talu/models(or$TALU_HOME/models). Ignored ifdestinationis set.destination: str, optionalExplicit output path (overrides
output_dir).force: bool, default FalseOverwrite existing output directory.
offline: bool, default FalseIf True, do not use network access when resolving model URIs.
verify: bool, default FalseAfter conversion, verify the model by loading it and generating a few tokens. Catches corruption, missing files, and basic inference failures early.
overrides: dict, optionalReserved for future use. Not currently supported.
max_shard_size: int | str, optionalMaximum size per shard file. When set, splits large models into multiple SafeTensors files. Can be bytes (int) or human-readable string (e.g.,
"5GB","500MB").dry_run: bool, default FalseIf True, estimate conversion without writing files. Returns a dict with estimation results (total_params, estimated_size_bytes, shard_count, scheme, bits_per_param).
Returns
str | dict When dry_run=False: Absolute path to the converted model directory. When dry_run=True: Dict with estimation results.
Raises
ConvertErrorIf conversion fails (network error, unsupported format, etc.), or if verification fails when
verify=True.ValueErrorIf invalid scheme or override is provided.
Examples
>>> import talu
>>> path = talu.convert("Qwen/Qwen3-0.6B") # Uses gaf4_64 by default
>>> path = talu.convert("Qwen/Qwen3-0.6B", scheme="gaf4_32") # Higher quality
>>> # Platform-aware conversion
>>> path = talu.convert("Qwen/Qwen3-0.6B", platform="metal") # → gaf4_64
>>> # Sharded output for large models
>>> path = talu.convert("meta-llama/Llama-3-70B", max_shard_size="5GB")
See Also -------- list_schemes : List all available quantization schemes. verify : Verify a converted model.