Converter
Model Conversion API
Classes
| Class | Description |
|---|---|
| VerificationResult | Result of model verification. |
| ModelInfo | Model architecture and configuration information. |
Functions
| Function | Description |
|---|---|
| convert() | Convert a model to an optimized format for efficient infe... |
| verify() | Verify a model can load and generate text. |
| list_schemes() | List available quantization schemes with descriptions and... |
| describe() | Get model architecture and configuration information. |
class VerificationResult
VerificationResult(
self,
success: bool,
model_path: str,
output: str = '',
tokens_generated: int = 0,
error: str | None = None
)
self,
success: bool,
model_path: str,
output: str = '',
tokens_generated: int = 0,
error: str | None = None
)
Result of model verification.
Attributes
success: boolWhether verification passed.
model_path: strPath to the verified model.
output: strGenerated output from the test prompt.
tokens_generated: intNumber of tokens successfully generated.
error: str | NoneError message if verification failed.
Example
>>> result = VerificationResult(success=True, model_path="/path/to/model")
>>> bool(result)
True
>>> result.success
True
class ModelInfo
ModelInfo(
self,
vocab_size: int,
hidden_size: int,
num_layers: int,
num_heads: int,
num_kv_heads: int,
intermediate_size: int,
max_seq_len: int,
head_dim: int,
rope_theta: float,
norm_eps: float,
quant_bits: int,
quant_group_size: int,
model_type: str | None,
architecture: str | None,
tie_word_embeddings: bool,
use_gelu: bool,
num_experts: int,
experts_per_token: int
)
self,
vocab_size: int,
hidden_size: int,
num_layers: int,
num_heads: int,
num_kv_heads: int,
intermediate_size: int,
max_seq_len: int,
head_dim: int,
rope_theta: float,
norm_eps: float,
quant_bits: int,
quant_group_size: int,
model_type: str | None,
architecture: str | None,
tie_word_embeddings: bool,
use_gelu: bool,
num_experts: int,
experts_per_token: int
)
Model architecture and configuration information.
This class provides a read-only view of model metadata extracted from config.json without loading the model weights. Useful for:
- Pre-flight checks before conversion
- Comparing model architectures
- Determining quantization status
- Checking MoE configuration
Attributes
vocab_size: intVocabulary size
hidden_size: intHidden dimension (d_model)
num_layers: intNumber of transformer layers
num_heads: intNumber of attention heads
num_kv_heads: intNumber of key-value heads (for GQA)
intermediate_size: intFFN intermediate dimension
max_seq_len: intMaximum sequence length
head_dim: intDimension per attention head
rope_theta: floatRoPE base frequency
norm_eps: floatLayer norm epsilon
quant_bits: intQuantization bits (4, 8, or 16)
quant_group_size: intQuantization group size
model_type: str or NoneModel type string (e.g., "qwen3", "llama")
architecture: str or NoneArchitecture class name
tie_word_embeddings: boolWhether embeddings are tied
use_gelu: boolWhether GELU activation is used
num_experts: intNumber of MoE experts (0 if not MoE)
experts_per_token: intExperts used per token
Example
Quick Reference
Properties
| Name | Type |
|---|---|
is_moe |
bool |
is_quantized |
bool |
Properties
is_moe: bool
Whether the model uses Mixture of Experts.
is_quantized: bool
Whether the model is quantized (not fp16).
def convert
talu.converter.convert(
model: str,
scheme: str | None = None,
platform: str | None = None,
quant: str | None = None,
output_dir: str | None = None,
destination: str | None = None,
force: bool = False,
offline: bool = False,
verify: bool = False,
overrides: dict[str, str] | None = None,
max_shard_size: int | str | None = None,
dry_run: bool = False
) → str | dict
model: str,
scheme: str | None = None,
platform: str | None = None,
quant: str | None = None,
output_dir: str | None = None,
destination: str | None = None,
force: bool = False,
offline: bool = False,
verify: bool = False,
overrides: dict[str, str] | None = None,
max_shard_size: int | str | None = None,
dry_run: bool = False
) → str | dict
Convert a model to an optimized format for efficient inference.
Parameters
model: strModel to convert. Can be:
- Model ID:
"Qwen/Qwen3-0.6B","meta-llama/Llama-3-8B" - Local path:
"./my-model"or"/path/to/model"
- Model ID:
scheme: str, optionalExplicit quantization scheme. Each scheme encodes all necessary parameters (method, bits, group_size). You can use specific keys or aliases.
If not set, uses
platformandquantfor automatic resolution. When neitherschemenorplatform/quantare specified, defaults togaf4_64.User-Friendly Aliases (Recommended):
"4bit"/"q4"/"int4": Maps togaf4_64(balanced 4-bit)"8bit"/"q8"/"int8": Maps togaf8_64(near-lossless)"mlx"/"mlx4"/"gaf4": Maps togaf4_64(Apple Silicon optimized)"mlx8"/"gaf8": Maps togaf8_64(8-bit MLX)"fp8": Maps tofp8_e4m3(H100/vLLM inference)
Grouped Affine (MLX compatible):
gaf4_32,gaf4_64,gaf4_128,gaf8_32,gaf8_64,gaf8_128Hardware float (not yet implemented):
fp8_e4m3,fp8_e5m2,mxfp4,nvfp4platform: str, optionalTarget platform for scheme resolution (
"cpu","metal","cuda"). When set, resolves to the appropriate scheme for that platform and quant level. Ignored ifschemeis explicitly set.quant: str, optionalQuantization level (
"4bit","8bit"). Used withplatform. Defaults to"4bit"if platform is set but quant is not.output_dir: str, optionalParent directory for auto-named output. Defaults to
~/.cache/talu/models(or$TALU_HOME/models). Ignored ifdestinationis set.destination: str, optionalExplicit output path (overrides
output_dir).force: bool, default FalseOverwrite existing output directory.
offline: bool, default FalseIf True, do not use network access when resolving model URIs.
verify: bool, default FalseAfter conversion, verify the model by loading it and generating a few tokens. Catches corruption, missing files, and basic inference failures early.
overrides: dict, optionalReserved for future use. Not currently supported.
max_shard_size: int | str, optionalMaximum size per shard file. When set, splits large models into multiple SafeTensors files. Can be bytes (int) or human-readable string (e.g.,
"5GB","500MB").dry_run: bool, default FalseIf True, estimate conversion without writing files. Returns a dict with estimation results (total_params, estimated_size_bytes, shard_count, scheme, bits_per_param).
Returns
str | dict When dry_run=False: Absolute path to the converted model directory. When dry_run=True: Dict with estimation results.
Raises
ConvertErrorIf conversion fails (network error, unsupported format, etc.), or if verification fails when
verify=True.ValueErrorIf invalid scheme or override is provided.
Examples
>>> import talu
>>> path = talu.convert("Qwen/Qwen3-0.6B") # Uses gaf4_64 by default
>>> path = talu.convert("Qwen/Qwen3-0.6B", scheme="gaf4_32") # Higher quality
>>> # Platform-aware conversion
>>> path = talu.convert("Qwen/Qwen3-0.6B", platform="metal") # → gaf4_64
>>> # Sharded output for large models
>>> path = talu.convert("meta-llama/Llama-3-70B", max_shard_size="5GB")
See Also -------- list_schemes : List all available quantization schemes. verify : Verify a converted model.
def verify
talu.converter.verify(
model_path: str,
prompt: str | None = None,
max_tokens: int = 5
) → VerificationResult
model_path: str,
prompt: str | None = None,
max_tokens: int = 5
) → VerificationResult
Verify a model can load and generate text.
Performs a quick sanity check by loading the model and generating a few tokens. This catches corruption, missing files, and basic inference failures early.
Parameters
model_path: strPath to the model directory to verify.
prompt: str, optionalCustom prompt to use. Defaults to "The capital of France is".
max_tokens: int, default 5Number of tokens to generate. Keep small for speed.
Returns
VerificationResult Result with success status, output, and any error message.
Examples
Basic verification:
>>> result = talu.verify("./models/qwen3-q4")
>>> if result:
... print(f"OK: generated {result.tokens_generated} tokens")
... else:
... print(f"FAILED: {result.error}")
With custom prompt:
>>> result = talu.verify(
... "./models/qwen3-q4",
... prompt="2 + 2 =",
... max_tokens=3,
... )
>>> print(result.output) # Should be " 4" or similar
def list_schemes
talu.converter.list_schemes(include_unimplemented: bool = False, category: str | None = None) → dict[str, dict]
List available quantization schemes with descriptions and aliases.
Returns detailed information about each scheme to help users make informed decisions about which scheme to use. Fetches live alias information from the core runtime (Zig) to ensure consistency.
Parameters
include_unimplemented: bool, default FalseIf True, include schemes that are not yet implemented (fp8_e4m3, fp8_e5m2, mxfp4, nvfp4).
category: str, optionalFilter by category: "gaf" or "hardware". If None, returns all categories.
Returns
dict Dictionary mapping scheme names to their metadata:
category: "gaf" or "hardware"bits: Bit widthgroup_size: (gaf/hardware only) Group sizedescription: What this scheme doesquality: Relative quality (fair/good/better/high/near-lossless/lossless)size: Approximate size for a 7B modelstatus: "stable" or "not implemented"mlx_compatible: (gaf only) True if compatible with MLXaliases: List of user-friendly aliases (e.g., ["4bit", "q4"])
Example
>>> import talu
>>> schemes = talu.list_schemes()
>>> "gaf4_64" in schemes
True
>>> "description" in schemes["gaf4_64"]
True
>>> "4bit" in schemes["gaf4_64"]["aliases"]
True
See Also -------- convert : Convert a model using one of these schemes.
def describe
talu.converter.describe(model: str) → ModelInfo
Get model architecture and configuration information.
Reads config.json without loading model weights. This is useful for pre-flight checks before conversion or understanding model structure.
Parameters
model: strPath to model directory or HuggingFace model ID.
Returns
ModelInfo Object containing model configuration details.
Raises
ModelErrorIf model cannot be loaded or parsed.