Converter

Model Conversion API

Classes

Class Description
VerificationResult Result of model verification.
ModelInfo Model architecture and configuration information.

Functions

Function Description
convert() Convert a model to an optimized format for efficient infe...
verify() Verify a model can load and generate text.
list_schemes() List available quantization schemes with descriptions and...
describe() Get model architecture and configuration information.

class VerificationResult

VerificationResult(
    self,
    success: bool,
    model_path: str,
    output: str = '',
    tokens_generated: int = 0,
    error: str | None = None
)

Result of model verification.

Attributes
success : bool

Whether verification passed.

model_path : str

Path to the verified model.

output : str

Generated output from the test prompt.

tokens_generated : int

Number of tokens successfully generated.

error : str | None

Error message if verification failed.

Example
>>> result = VerificationResult(success=True, model_path="/path/to/model")
>>> bool(result)
True
>>> result.success
True

class ModelInfo

ModelInfo(
    self,
    vocab_size: int,
    hidden_size: int,
    num_layers: int,
    num_heads: int,
    num_kv_heads: int,
    intermediate_size: int,
    max_seq_len: int,
    head_dim: int,
    rope_theta: float,
    norm_eps: float,
    quant_bits: int,
    quant_group_size: int,
    model_type: str | None,
    architecture: str | None,
    tie_word_embeddings: bool,
    use_gelu: bool,
    num_experts: int,
    experts_per_token: int
)

Model architecture and configuration information.

This class provides a read-only view of model metadata extracted from config.json without loading the model weights. Useful for:

Attributes
vocab_size : int

Vocabulary size

hidden_size : int

Hidden dimension (d_model)

num_layers : int

Number of transformer layers

num_heads : int

Number of attention heads

num_kv_heads : int

Number of key-value heads (for GQA)

intermediate_size : int

FFN intermediate dimension

max_seq_len : int

Maximum sequence length

head_dim : int

Dimension per attention head

rope_theta : float

RoPE base frequency

norm_eps : float

Layer norm epsilon

quant_bits : int

Quantization bits (4, 8, or 16)

quant_group_size : int

Quantization group size

model_type : str or None

Model type string (e.g., "qwen3", "llama")

architecture : str or None

Architecture class name

tie_word_embeddings : bool

Whether embeddings are tied

use_gelu : bool

Whether GELU activation is used

num_experts : int

Number of MoE experts (0 if not MoE)

experts_per_token : int

Experts used per token

Example
>>> from talu.converter import describe
>>> info = describe("Qwen/Qwen3-0.6B")  # doctest: +SKIP
>>> info.num_layers  # doctest: +SKIP
28
>>> info.is_quantized  # doctest: +SKIP
False

Quick Reference

Properties

Name Type
is_moe bool
is_quantized bool

Properties

is_moe: bool

Whether the model uses Mixture of Experts.

is_quantized: bool

Whether the model is quantized (not fp16).


def convert

talu.converter.convert(
    model: str,
    scheme: str | None = None,
    platform: str | None = None,
    quant: str | None = None,
    output_dir: str | None = None,
    destination: str | None = None,
    force: bool = False,
    offline: bool = False,
    verify: bool = False,
    overrides: dict[str, str] | None = None,
    max_shard_size: int | str | None = None,
    dry_run: bool = False
)str | dict

Convert a model to an optimized format for efficient inference.

Parameters
model : str

Model to convert. Can be:

  • Model ID: "Qwen/Qwen3-0.6B", "meta-llama/Llama-3-8B"
  • Local path: "./my-model" or "/path/to/model"
scheme : str, optional

Explicit quantization scheme. Each scheme encodes all necessary parameters (method, bits, group_size). You can use specific keys or aliases.

If not set, uses platform and quant for automatic resolution. When neither scheme nor platform/quant are specified, defaults to gaf4_64.

User-Friendly Aliases (Recommended):

  • "4bit" / "q4" / "int4": Maps to gaf4_64 (balanced 4-bit)
  • "8bit" / "q8" / "int8": Maps to gaf8_64 (near-lossless)
  • "mlx" / "mlx4" / "gaf4": Maps to gaf4_64 (Apple Silicon optimized)
  • "mlx8" / "gaf8": Maps to gaf8_64 (8-bit MLX)
  • "fp8": Maps to fp8_e4m3 (H100/vLLM inference)

Grouped Affine (MLX compatible): gaf4_32, gaf4_64, gaf4_128, gaf8_32, gaf8_64, gaf8_128

Hardware float (not yet implemented): fp8_e4m3, fp8_e5m2, mxfp4, nvfp4

platform : str, optional

Target platform for scheme resolution ("cpu", "metal", "cuda"). When set, resolves to the appropriate scheme for that platform and quant level. Ignored if scheme is explicitly set.

quant : str, optional

Quantization level ("4bit", "8bit"). Used with platform. Defaults to "4bit" if platform is set but quant is not.

output_dir : str, optional

Parent directory for auto-named output. Defaults to ~/.cache/talu/models (or $TALU_HOME/models). Ignored if destination is set.

destination : str, optional

Explicit output path (overrides output_dir).

force : bool, default False

Overwrite existing output directory.

offline : bool, default False

If True, do not use network access when resolving model URIs.

verify : bool, default False

After conversion, verify the model by loading it and generating a few tokens. Catches corruption, missing files, and basic inference failures early.

overrides : dict, optional

Reserved for future use. Not currently supported.

max_shard_size : int | str, optional

Maximum size per shard file. When set, splits large models into multiple SafeTensors files. Can be bytes (int) or human-readable string (e.g., "5GB", "500MB").

dry_run : bool, default False

If True, estimate conversion without writing files. Returns a dict with estimation results (total_params, estimated_size_bytes, shard_count, scheme, bits_per_param).

Returns

str | dict When dry_run=False: Absolute path to the converted model directory. When dry_run=True: Dict with estimation results.

Raises
ConvertError

If conversion fails (network error, unsupported format, etc.), or if verification fails when verify=True.

ValueError

If invalid scheme or override is provided.

Examples
>>> import talu
>>> path = talu.convert("Qwen/Qwen3-0.6B")  # Uses gaf4_64 by default
>>> path = talu.convert("Qwen/Qwen3-0.6B", scheme="gaf4_32")  # Higher quality
>>> # Platform-aware conversion
>>> path = talu.convert("Qwen/Qwen3-0.6B", platform="metal")  # → gaf4_64
>>> # Sharded output for large models
>>> path = talu.convert("meta-llama/Llama-3-70B", max_shard_size="5GB")

See Also -------- list_schemes : List all available quantization schemes. verify : Verify a converted model.


def verify

talu.converter.verify(
    model_path: str,
    prompt: str | None = None,
    max_tokens: int = 5
)VerificationResult

Verify a model can load and generate text.

Performs a quick sanity check by loading the model and generating a few tokens. This catches corruption, missing files, and basic inference failures early.

Parameters
model_path : str

Path to the model directory to verify.

prompt : str, optional

Custom prompt to use. Defaults to "The capital of France is".

max_tokens : int, default 5

Number of tokens to generate. Keep small for speed.

Returns

VerificationResult Result with success status, output, and any error message.

Examples

Basic verification:

>>> result = talu.verify("./models/qwen3-q4")
>>> if result:
...     print(f"OK: generated {result.tokens_generated} tokens")
... else:
...     print(f"FAILED: {result.error}")

With custom prompt:

>>> result = talu.verify(
...     "./models/qwen3-q4",
...     prompt="2 + 2 =",
...     max_tokens=3,
... )
>>> print(result.output)  # Should be " 4" or similar

def list_schemes

talu.converter.list_schemes(include_unimplemented: bool = False, category: str | None = None)dict[str, dict]

List available quantization schemes with descriptions and aliases.

Returns detailed information about each scheme to help users make informed decisions about which scheme to use. Fetches live alias information from the core runtime (Zig) to ensure consistency.

Parameters
include_unimplemented : bool, default False

If True, include schemes that are not yet implemented (fp8_e4m3, fp8_e5m2, mxfp4, nvfp4).

category : str, optional

Filter by category: "gaf" or "hardware". If None, returns all categories.

Returns

dict Dictionary mapping scheme names to their metadata:

  • category: "gaf" or "hardware"
  • bits: Bit width
  • group_size: (gaf/hardware only) Group size
  • description: What this scheme does
  • quality: Relative quality (fair/good/better/high/near-lossless/lossless)
  • size: Approximate size for a 7B model
  • status: "stable" or "not implemented"
  • mlx_compatible: (gaf only) True if compatible with MLX
  • aliases: List of user-friendly aliases (e.g., ["4bit", "q4"])
Example
>>> import talu
>>> schemes = talu.list_schemes()
>>> "gaf4_64" in schemes
True
>>> "description" in schemes["gaf4_64"]
True
>>> "4bit" in schemes["gaf4_64"]["aliases"]
True

See Also -------- convert : Convert a model using one of these schemes.


def describe

talu.converter.describe(model: str)ModelInfo

Get model architecture and configuration information.

Reads config.json without loading model weights. This is useful for pre-flight checks before conversion or understanding model structure.

Parameters
model : str

Path to model directory or HuggingFace model ID.

Returns

ModelInfo Object containing model configuration details.

Raises
ModelError

If model cannot be loaded or parsed.

Example
>>> from talu.converter import describe
>>> info = describe("Qwen/Qwen3-0.6B")  # doctest: +SKIP
>>> info.num_layers > 0  # doctest: +SKIP
True