Tokenizer

Tokenizer module - Text encoding and decoding

Classes

Class	Description
Tokenizer	Text-to-token encoder and token-to-text decoder.
TokenArray	Sequence of token IDs with zero-copy NumPy and DLPack sup...
TokenArrayView	Lightweight view into a BatchEncoding's contiguous buffer.
TokenOffset	Byte offset pair mapping a token to its position in sourc...
BatchEncoding	Lazy container for batch tokenization results.

class Tokenizer

`Tokenizer( self, model: str, padding_side: str = 'left', truncation_side: str = 'right' )`

Text-to-token encoder and token-to-text decoder.

Converts text into token IDs from the model's vocabulary and back. Thread-safe after construction.

Attributes

model_path : str: Resolved path to the model directory.
vocab_size : int: Number of tokens in the vocabulary.
eos_token_ids : list[int]: Token IDs that signal end of generation.
bos_token_id : int | None: Beginning-of-sequence token ID.
pad_token_id : int | None: Padding token ID.
padding_side : str: Side for padding ("left" or "right").

Example

>>> tokenizer = Tokenizer("Qwen/Qwen3-0.6B")
>>> tokens = tokenizer.encode("Hello world")
>>> text = tokenizer.decode(tokens)

Quick Reference

Properties

Name	Type
`bos_token`	`str \| None`
`bos_token_id`	`int \| None`
`eos_token_ids`	`tuple[int, ...]`
`eos_tokens`	`list[str]`
`model_max_length`	`int`
`model_path`	`str`
`pad_token`	`str \| None`
`pad_token_id`	`int \| None`
`padding_side`	`str`
`special_ids`	`frozenset[int]`
`truncation_side`	`str`
`unk_token`	`str \| None`
`unk_token_id`	`int \| None`
`vocab_size`	`int`

Methods

Method	Description
`__call__()`	Callable interface for tokenization with zero-c...
`apply_chat_template()`	Format a conversation using the model's chat te...
`close()`	Release native tokenizer resources.
`convert_ids_to_tokens()`	Convert a list of token IDs to their string rep...
`convert_tokens_to_ids()`	Convert a list of token strings to their IDs.
`count_tokens()`	Count the number of tokens in text.
`decode()`	Convert token IDs back to text.
`encode()`	Convert text to token IDs.
`from_json()`	Create a tokenizer directly from JSON content.
`get_vocab()`	Get the complete vocabulary as a dictionary.
`id_to_token()`	Get the string representation of a token ID.
`is_special_id()`	Check if a token ID is a special token.
`primary_eos_token_id()`	Get the primary EOS token ID for insertion.
`token_to_id()`	Get the ID of a token string.
`tokenize()`	Split text into token strings.

Properties

`bos_token: str | None`

String representation of the BOS token.

Example

>>> tokenizer.bos_token  # Llama 3.2
'<|begin_of_text|>'

`bos_token_id: int | None`

Beginning-of-sequence token ID.

Returns None if the model doesn't use a BOS token (e.g., Qwen3).

Example

>>> tokenizer.bos_token_id  # Llama 3.2
128000
>>> tokenizer.bos_token_id  # Qwen3
None

`eos_token_ids: tuple[int, ...]`

Token IDs that signal end of generation.

Returns an immutable, ordered tuple of deduplicated EOS token IDs. Many models have multiple EOS tokens (e.g., Qwen, Llama 3, Gemma 3).

Returns

Tuple of EOS token IDs. May be empty if no EOS tokens are configured.

Example

>>> tokenizer.eos_token_ids
(151643, 151644, 151645)

`eos_tokens: list[str]`

String representations of all EOS tokens.

Returns

List of EOS token strings.

Example

>>> tokenizer.eos_tokens
['<|endoftext|>', '<|im_end|>', '<|end|>']

`model_max_length: int`

Maximum sequence length the model supports.

This value is read from tokenizer_config.json (model_max_length field). It represents the maximum context length the model was trained with.

When truncation=True is used without an explicit max_length, this value is used as the default truncation limit.

Returns 0 if the model does not specify a maximum length.

Example

>>> tokenizer.model_max_length
32768

`model_path: str`

The resolved path to the model directory.

`pad_token: str | None`

String representation of the padding token.

`pad_token_id: int | None`

Padding token ID.

`padding_side: str`

Side for padding: "left" (default, for generation) or "right".

Raises

ValidationError: If set to a value other than "left" or "right".

`special_ids: frozenset[int]`

Immutable set of all special token IDs.

Includes EOS, BOS, UNK, and PAD tokens. Use for fast O(1) membership testing.

Returns

Frozen set of all special token IDs.

Example

>>> 128001 in tokenizer.special_ids
True

`truncation_side: str`

Side for truncation: "right" (default, keeps beginning) or "left" (keeps end).

Raises

ValidationError: If set to a value other than "left" or "right".

`unk_token: str | None`

String representation of the unknown token.

`unk_token_id: int | None`

Unknown token ID.

`vocab_size: int`

Number of tokens in the vocabulary.

Methods

`def call( self, text: str | list[str], text_pair: str | list[str] | None = None, special_tokens: bool | set[str] = True, truncation: bool | str = False, max_length: int | None = None, return_tensors: str | None = None, kwargs: Any ) → BatchEncoding`

Callable interface for tokenization with zero-copy tensor access.

Returns a BatchEncoding object that provides dict-like access to input_ids and attention_mask via the DLPack protocol. This enables zero-copy transfer to PyTorch, JAX, or NumPy.

Padding is applied automatically when exporting to tensors. Control the padding side via tokenizer.padding_side or batch.padding_side.

Parameters

text

Text(s) to tokenize. Single string is wrapped as batch of 1.

text_pair

Not supported.

special_tokens

Control special token insertion. Can be:
True: Add all special tokens (BOS, EOS) - default
False: No special tokens
{"bos"}, {"eos"}, {"bos", "eos"}: Granular control

truncation

False or True.

max_length

Maximum sequence length.

return_tensors

Ignored (use DLPack instead).

Returns

BatchEncoding with dict-like interface for zero-copy tensor access:
batch["input_ids"] → DLPack-compatible accessor
batch["attention_mask"] → DLPack-compatible accessor

Raises

NotImplementedError: If text_pair is provided.
ValidationError: If truncation=True but no max_length specified.

Example

>>> batch = tokenizer(["Hello", "World!"])
>>> input_ids = torch.from_dlpack(batch["input_ids"])
>>> attention_mask = torch.from_dlpack(batch["attention_mask"])
>>> model(input_ids, attention_mask=attention_mask)

Note

For Python list output (debugging only), use batch.to_list().

`def apply_chat_template( self, messages: list, add_generation_prompt: bool = True, tokenize: bool = False ) → str | TokenArray | BatchEncoding`

Format a conversation using the model's chat template.

Parameters

messages: List of message dicts with 'role' and 'content'.
add_generation_prompt: If True, add assistant turn marker.
tokenize: If True, return TokenArray instead of string.

Returns

Formatted prompt string, or TokenArray if tokenize=True.

Raises

StateError: If no chat template is available (neither from model nor from_json()).

Note

When tokenize=True, BOS tokens are handled automatically:

If the chat template includes BOS in output, it won't be doubled
EOS is not added (chat templates use turn markers instead)

This prevents the "double-BOS" bug where models like Llama-3 would receive two BOS tokens (one from the template, one from encode), which can degrade generation quality.

`def close(self) → None`

Release native tokenizer resources.

After calling close(), the tokenizer cannot be used. Safe to call multiple times (idempotent).

Raises

None. This method never raises.

`def convert_ids_to_tokens(self, ids: list[int]) → list[str | None]`

Convert a list of token IDs to their string representations.

Parameters

ids: Token IDs to convert.

`def convert_tokens_to_ids(self, tokens: list[str]) → list[int | None]`

Convert a list of token strings to their IDs.

Parameters

tokens: Token strings to convert.

`def count_tokens( self, text: str, special_tokens: bool | set[str] = True ) → int`

Count the number of tokens in text.

This returns the exact token count that would be used in generation, including BOS/EOS tokens by default. Use this to check if prompts fit within context windows.

Parameters

text

Text to count tokens for.

special_tokens

Control special token counting. Can be:
True (default): Include all special tokens (matches generation)
False: Count only content tokens
{"bos"}, {"eos"}, {"bos", "eos"}: Include specific tokens

Returns

Number of tokens.

Example

>>> # Check if prompt fits in context window
>>> tokens = tokenizer.count_tokens(prompt)
>>> if tokens > 4096:
...     print("Prompt too long!")

>>> # Count without special tokens
>>> content_tokens = tokenizer.count_tokens(text, special_tokens=False)

`def decode( self, tokens: TokenArray | list[int], num_tokens: int | None = None, skip_special_tokens: bool = True ) → str`

Convert token IDs back to text.

Parameters

tokens: Token IDs to decode.
num_tokens: Number of tokens (only for raw pointers).
skip_special_tokens: If True, omit special tokens from output.

Returns

Decoded text string.

Raises

ValidationError: If num_tokens is required but not provided.
TokenizerError: If decoding fails.

`def encode( self, text: str | list[str], special_tokens: bool | set[str] = True, truncation: bool = False, max_length: int | None = None, truncation_side: str | None = None ) → TokenArray | BatchEncoding`

Convert text to token IDs.

Parameters

text

Text to tokenize. String returns TokenArray, list returns BatchEncoding.

special_tokens

Control special token insertion. Can be:
True: Add all special tokens (BOS, EOS) - default
False: No special tokens (raw tokenization)
{"bos"}: Add only BOS token
{"eos"}: Add only EOS token
{"bos", "eos"}: Add both (same as True)

truncation

If True, truncate to max_length. When max_length is not specified, uses model_max_length from tokenizer_config.json.

max_length

Maximum sequence length. If None and truncation=True, uses model_max_length.

truncation_side

"left" or "right". Overrides tokenizer default.
"right" (default): Keep beginning, truncate end
"left": Keep end, truncate beginning (useful for RAG)

Returns

TokenArray for single string, BatchEncoding for list.

Raises

ValidationError: If special_tokens is not bool or set[str], or if text is not str or list[str].
TokenizerError: If encoding fails.
MemoryError: If buffer allocation fails.

Note

Special token behavior depends on the model's tokenizer configuration:

Models with postprocessor (BERT, RoBERTa): BOS/EOS are added via the tokenizer's postprocessor when special_tokens=True.

Chat models (Llama 3, Qwen3, Gemma): Special tokens are typically added via chat templates, not the postprocessor. For these models, special_tokens=True may not add BOS/EOS to raw text. Use apply_chat_template() for proper formatting.

If bos_token_id is None (e.g., Qwen3), requesting BOS is a no-op. Check tokenizer.bos_token_id to verify special token availability.

For batch encoding, padding_side is inherited from the tokenizer's padding_side property (default "left" for generation models). To override for a specific batch, set the property on the result:

batch = tokenizer.encode(["Hello", "World"])
batch.padding_side = "right"  # Override before converting
tensor = torch.from_dlpack(batch)

Example

>>> # Default: add all special tokens
>>> tokens = tokenizer.encode("Hello world")

>>> # No special tokens
>>> tokens = tokenizer.encode("Hello world", special_tokens=False)

>>> # BOS only (useful for document snippets)
>>> tokens = tokenizer.encode("Document...", special_tokens={"bos"})

>>> # With truncation (uses model_max_length if no explicit max_length)
>>> tokens = tokenizer.encode(long_text, truncation=True)

>>> # With explicit max_length
>>> tokens = tokenizer.encode(text, truncation=True, max_length=512)

>>> # Left truncation for RAG (keep recent context)
>>> tokens = tokenizer.encode(text, truncation=True, max_length=512,
...                           truncation_side="left")

>>> # Check if model has BOS token
>>> if tokenizer.bos_token_id is not None:
...     print(f"BOS token: {tokenizer.bos_token}")

`def from_json( cls, json_content: str | bytes, chat_template: str | None = None, bos_token: str = '', eos_token: str = '', padding_side: str = 'left', truncation_side: str = 'right' ) → Tokenizer`

Create a tokenizer directly from JSON content.

Creates a standalone tokenizer without needing a model directory. Useful for custom tokenizers, serverless deployments, or testing.

Parameters

json_content: The tokenizer.json content as string or bytes.
chat_template: Optional Jinja2 chat template string. If provided, enables apply_chat_template() on this tokenizer.
bos_token: Beginning-of-sequence token string for chat templates.
eos_token: End-of-sequence token string for chat templates.
padding_side: Default padding side ("left" or "right"). Default "left".
truncation_side: Default truncation side ("left" or "right"). Default "right".

Returns

A new Tokenizer instance.

Raises

TokenizerError: If the JSON content is invalid.
ValidationError: If padding_side or truncation_side is invalid.

Example

>>> json = '{"version": "1.0", "model": {"type": "BPE", ...}}'
>>> template = "{% for m in messages %}{{ m.content }}{% endfor %}"
>>> tok = Tokenizer.from_json(json, chat_template=template)
>>> prompt = tok.apply_chat_template([{"role": "user", "content": "Hi"}])

`def get_vocab(self) → dict[str, int]`

Get the complete vocabulary as a dictionary.

Returns

Dictionary mapping token strings to their IDs.

Raises

TokenizerError: If vocabulary retrieval fails.

`def id_to_token(self, token_id: int) → str | None`

Get the string representation of a token ID.

Parameters

token_id: The token ID to convert.

Returns

The token string, or None if the ID is invalid.

Example

>>> tokenizer.id_to_token(9707)
'Hello'

`def is_special_id(self, token_id: int) → bool`

Check if a token ID is a special token.

Parameters

token_id: The token ID to check.

Returns

True if the token ID is a special token.

Example

>>> tokenizer.is_special_id(128001)  # EOS token
True

`def primary_eos_token_id(self) → int | None`

Get the primary EOS token ID for insertion.

Use this when inserting an EOS token. For detection/stopping, use token_id in eos_token_ids instead.

Returns

The primary EOS token ID, or None if no EOS tokens are configured.

Example

>>> eos_id = tokenizer.primary_eos_token_id()
>>> tokens.append(eos_id)

`def token_to_id(self, token: str) → int | None`

Get the ID of a token string.

Parameters

token: The token string to convert.

Returns

The token ID, or None if the token is not in the vocabulary.

Example

>>> tokenizer.token_to_id('Hello')
9707

`def tokenize( self, text: str, return_bytes: bool = False ) → list[str] | list[bytes]`

Split text into token strings.

This is useful for debugging tokenization - seeing exactly how text is segmented before being converted to token IDs.

Parameters

text: Text to tokenize.
return_bytes: If True, return raw bytes instead of strings. Use this for debugging when you need to see exact byte representations (e.g., for tokens with invalid UTF-8 or special byte sequences).

Returns

List of token strings (default) or bytes (if return_bytes=True).

Raises

TokenizerError: If tokenization fails.

Example

>>> tokenizer.tokenize("Hello world")
['Hello', ' world']

>>> # Debug mode - see raw bytes
>>> tokenizer.tokenize("Hello", return_bytes=True)
[b'Hello']

>>> # Useful for debugging unicode edge cases
>>> tokenizer.tokenize("café", return_bytes=True)
[b'caf', b'\\xc3\\xa9']  # Shows UTF-8 encoding

class TokenArray

`TokenArray( self, tokens_ptr: Any, num_tokens: int, source_text: bytes | None = None, tokenizer: Tokenizer | None = None, _buffer_handle: Any = None )`

Sequence of token IDs with zero-copy NumPy and DLPack support.

Returned by Tokenizer.encode() and provides efficient access to token data. The underlying memory is managed by a refcounted buffer in the Zig runtime. Implements collections.abc.Sequence[int].

Key Features

Zero-copy NumPy conversion - No data copying when converting to NumPy:

>>> import numpy as np
>>> tokens = tokenizer.encode("Hello world")
>>> arr = np.asarray(tokens)  # Zero-copy view
>>> print(arr.dtype)
uint32

Safe DLPack export - Zero-copy export to PyTorch/JAX without invalidation:

>>> tensor = torch.from_dlpack(tokens)  # Zero-copy!
>>> len(tokens)  # Still valid! TokenArray not invalidated
2
>>> tensor2 = torch.from_dlpack(tokens)  # Multiple exports safe

Standard sequence operations - Works like a Python list:

>>> len(tokens)
2
>>> tokens[0]
9707
>>> tokens[-1]  # Negative indexing
1879

Convert to list - When you need a regular Python list:

>>> tokens.tolist()
[9707, 1879]

Token offset mapping - Map tokens back to source text positions:

>>> text = "Hello world"
>>> tokens = tokenizer.encode(text)
>>> tokens.offsets  # Lazy, computed on first access
[(0, 5), (5, 11)]
>>> tokens.offsets[0].slice(text)  # Recommended
'Hello'

Memory Management

TokenArray uses a refcounted buffer. The buffer is freed when all references are released (TokenArray deleted + all DLPack exports consumed).

NumPy arrays from `np.asarray()` are views - they become invalid if the TokenArray is deleted AND no DLPack exports exist:

>>> tokens = tokenizer.encode("Hello")
>>> arr = np.asarray(tokens)
>>> del tokens  # Only safe if no other references exist!
>>> arr[0]      # May be undefined if buffer was freed

To keep the data safely, either: 1. Keep the TokenArray alive while using the NumPy view 2. Copy it: `arr = np.array(tokens)` 3. Use DLPack: `tensor = torch.from_dlpack(tokens)` (keeps buffer alive)

Quick Reference

Properties

Name	Type
`offsets`	`list[TokenOffset]`

Methods

Method	Description
`close()`	Release the underlying native buffer.
`count()`	Return number of occurrences of value.
`index()`	Return index of first occurrence of value.
`tolist()`	Convert to a Python list.

Properties

`offsets: list[TokenOffset]`

Lazy byte offsets mapping tokens back to source text.

Each offset is a (start, end) pair of UTF-8 byte indices into the original text. The first access triggers computation via the Zig runtime; subsequent accesses return the cached result.

Special tokens (BOS, EOS, PAD) that don't correspond to source text are assigned (0, 0).

Returns

List of TokenOffset objects, one per token.

Raises

RuntimeError: If source text was not preserved during encoding, or if offset computation fails.

Example

>>> text = "Hello 🎉 world"
>>> tokens = tokenizer.encode(text)
>>> tokens.offsets
[(0, 5), (5, 10), ...]

Use the slice() helper to extract text (handles Unicode correctly):

>>> tokens.offsets[0].slice(text)
'Hello'
>>> tokens.offsets[-1].slice(text)
' world'

Methods

`def close(self) → None`

Release the underlying native buffer.

After calling close(), the TokenArray cannot be used for data access, DLPack export, or NumPy conversion. Safe to call multiple times (idempotent).

`def count(self, value: int) → int`

Return number of occurrences of value.

Parameters

value: Token ID to count.

Returns

Number of times value appears in the array.

Example

>>> tokens = tokenizer.encode("hello hello hello")
>>> tokens.count(9707)  # Count occurrences
3

`def index( self, value: int, start: int = 0, stop: int | None = None ) → int`

Return index of first occurrence of value.

Parameters

value: Token ID to search for.
start: Start index for search (default 0).
stop: Stop index for search (default end of array).

Returns

Index of first occurrence of value.

Raises

ValueError: If value is not in the array.

Example

>>> tokens = tokenizer.encode("Hello world")
>>> tokens.index(9707)  # Find first occurrence
0

`def tolist(self) → list[int]`

Convert to a Python list.

This copies the data into a new Python list. Use this when you need a regular list, or when you need to keep the data after the TokenArray is deleted.

Returns

A new list containing the token IDs.

Example

>>> tokens = tokenizer.encode("Hello world")
>>> token_list = tokens.tolist()
>>> print(token_list)
[9707, 1879]
>>> type(token_list)
<class 'list'>

class TokenArrayView

`TokenArrayView( self, ids_ptr: Any, start: int, length: int, parent: BatchEncoding )`

Lightweight view into a BatchEncoding's contiguous buffer.

Zero-copy view that slices into the parent's memory. The parent BatchEncoding must remain alive while this view is used.

Users interact with this class when indexing into a BatchEncoding:

>>> batch = tokenizer.encode(["Hello", "World"])
>>> view = batch[0]  # Returns TokenArrayView
>>> list(view)
[9707]

Quick Reference

Methods

Method	Description
`count()`	Return number of occurrences of value.
`index()`	Return index of first occurrence of value.
`tolist()`	Convert to a Python list (copies data).

Methods

`def count(self, value: int) → int`

Return number of occurrences of value.

Parameters

value: Token ID to count.

`def index( self, value: int, start: int = 0, stop: int | None = None ) → int`

Return index of first occurrence of value.

Parameters

value: Token ID to search for.
start: Start index for search.
stop: Stop index for search, or None for end of view.

`def tolist(self) → list[int]`

Convert to a Python list (copies data).

class TokenOffset

`TokenOffset( self, start: int, end: int )`

Byte offset pair mapping a token to its position in source text.

Represents a (start, end) byte range in the original UTF-8 encoded text. Offsets are byte indices, not character indices, enabling O(1) slicing regardless of Unicode content.

Methods

slice(text, errors="strict") Extracts the text span for this token. This is the recommended way to use offsets as it handles byte-to-string conversion automatically.

Attributes

start : int: Start byte offset in source text (inclusive).
end : int: End byte offset in source text (exclusive).

Note

Offsets are UTF-8 byte indices, not character indices. Do NOT use them directly on Python strings (e.g., text[start:end]). Use the slice() method instead.
For byte-level BPE tokenizers (GPT-2, Qwen), individual tokens may split multi-byte UTF-8 sequences. Use errors="replace" in slice() to handle these gracefully.
Special tokens (BOS, EOS) that don't correspond to source text are assigned (0, 0).

Examples

Extract text using slice() method (recommended)

>>> text = "Hello 🎉 world"
    >>> tokens = tokenizer.encode(text)
    >>> tokens.offsets[-1].slice(text)
    ' world'

Handle byte-level BPE tokens that split UTF-8 sequences:

>>> tokens.offsets[1].slice(text, errors="replace")
' ��'  # Replacement chars for partial emoji bytes

Tuple unpacking for raw byte indices

>>> offset = TokenOffset(0, 5)
>>> start, end = offset
>>> print(start, end)
0 5

Comparing with tuples

>>> offset = TokenOffset(0, 5)
    >>> offset == (0, 5)
True

Quick Reference

Methods

Method	Description
`slice()`	Extract the text span corresponding to this off...

Methods

`def slice( self, text: str | bytes, errors: str = 'strict' ) → str`

Extract the text span corresponding to this offset.

This is the recommended way to use offsets. It handles the conversion between byte offsets and Python string indexing automatically.

Parameters

text: The original source text (str or bytes).
errors: How to handle decode errors. Default "strict" raises UnicodeDecodeError. Use "replace" to substitute invalid bytes with the replacement character. This can happen with byte-level BPE tokenizers that split multi-byte UTF-8 sequences.

Returns

The substring corresponding to this token.

Raises

UnicodeDecodeError: If the byte span is not valid UTF-8 and errors="strict" (default). This can happen with byte-level BPE tokenizers (GPT-2, Qwen) that split multi-byte characters across tokens.

Example

>>> tokens = tokenizer.encode("Hello 🎉 world")
>>> tokens.offsets[-1].slice("Hello 🎉 world")
' world'

For byte-level BPE tokens that may split UTF-8 sequences:

>>> tokens.offsets[1].slice(text, errors="replace")
' \ufffd\ufffd'  # Replacement characters for partial bytes

class BatchEncoding

`BatchEncoding( self, ids_ptr: Any = None, offsets_ptr: Any = None, total_tokens: int = 0, num_sequences: int = 0, padding_side: str = 'left', pad_token_id: int | None = None )`

Lazy container for batch tokenization results.

Holds all token IDs in a single contiguous memory block, with lazy creation of TokenArray views when accessing individual sequences.

Key Features

List-like interface - Works like a list of TokenArrays:

>>> batch = tokenizer.encode(["Hello", "World"])
>>> len(batch)
2
>>> batch[0]  # Returns lazy TokenArray view
TokenArray([...], len=...)

Lazy evaluation - TokenArray views are created on-demand:

>>> # Only creates view when accessed
>>> first = batch[0]

Dictionary-like access - Compatible with HuggingFace patterns:

>>> input_ids = torch.from_dlpack(batch["input_ids"])
>>> attention_mask = torch.from_dlpack(batch["attention_mask"])

ML Framework Integration

To convert to PyTorch or JAX, use the specific tensor accessors. Do NOT try to convert the batch object itself - it contains multiple tensors and will raise an InteropError.

Correct usage:

>>> import torch
>>> input_ids = torch.from_dlpack(batch.input_ids)
>>> attention_mask = torch.from_dlpack(batch.attention_mask)

Incorrect usage (raises InteropError):

>>> tensor = torch.from_dlpack(batch)  # Error: Ambiguous - which tensor?

Memory Management

BatchEncoding owns the underlying memory and frees it when garbage collected. The TokenArray views returned by indexing are lightweight wrappers that point into this shared memory - they become invalid if the BatchEncoding is deleted.

Quick Reference

Properties

Name	Type
`attention_mask`	`_DLPackAccessor`
`input_ids`	`_DLPackAccessor`
`pad_token_id`	`int \| None`
`padding_side`	`str`
`total_tokens`	`int`

Methods

Method	Description
`close()`	Release the native batch encoding memory.
`keys()`	Return available tensor keys (dict-like interfa...
`lengths()`	Get the length of each sequence in the batch.
`max_length()`	Get the maximum sequence length in the batch.
`to_list()`	Convert batch to padded Python lists.

Properties

`attention_mask: _DLPackAccessor`

DLPack interface for the attention mask tensor.

Returns an accessor object that implements the DLPack protocol. Use with torch.from_dlpack() or np.from_dlpack() to get a 2D tensor of shape (num_sequences, padded_length) with values:
1 for real tokens
0 for padding tokens

NOTE: Each export allocates a new mask buffer computed from sequence lengths and padding configuration.

Example

>>> batch = tokenizer.encode(["Hello", "World!"])
>>> input_ids = torch.from_dlpack(batch.input_ids)
>>> attention_mask = torch.from_dlpack(batch.attention_mask)
>>> model(input_ids, attention_mask=attention_mask)

`input_ids: _DLPackAccessor`

DLPack interface for the input_ids tensor.

Returns an accessor object that implements the DLPack protocol. Use with torch.from_dlpack() or np.from_dlpack() to get a 2D tensor of shape (num_sequences, padded_length).

NOTE: Each export allocates a new padded buffer. The internal CSR storage is materialized to dense format for ML framework consumption.

Example

>>> batch = tokenizer.encode(["Hello", "World!"])
>>> input_ids = torch.from_dlpack(batch.input_ids)
>>> attention_mask = torch.from_dlpack(batch.attention_mask)
>>> model(input_ids, attention_mask=attention_mask)

`pad_token_id: int | None`

Padding token ID used for this batch.

`padding_side: str`

Side for padding: "left" (default for generation) or "right".

This property is inherited from the tokenizer when the batch is created. Set this property to override padding behavior before calling to_list() or using torch.from_dlpack().

Example

>>> batch = tokenizer.encode(["Hello", "World"])
>>> batch.padding_side
'left'
>>> batch.padding_side = "right"  # Override for this batch
>>> tensor = torch.from_dlpack(batch)  # Uses right padding

`total_tokens: int`

Total number of tokens across all sequences.

Methods

`def close(self) → None`

Release the native batch encoding memory.

After calling close(), the BatchEncoding cannot be used for data access, DLPack export, or iteration. Safe to call multiple times (idempotent).

`def keys(self) → list[str]`

Return available tensor keys (dict-like interface).

`def lengths(self) → list[int]`

Get the length of each sequence in the batch.

`def max_length(self) → int`

Get the maximum sequence length in the batch.

`def to_list( self, padding: bool = True, pad_id: int | None = None, padding_side: str | None = None, max_length: int | None = None, truncation: bool = False, return_attention_mask: bool = True ) → dict[str, list[list[int]]]`

Convert batch to padded Python lists.

This is primarily useful for debugging or when you need plain Python data structures. For ML workloads, use torch.from_dlpack(batch.input_ids) and torch.from_dlpack(batch.attention_mask) directly.

Parameters

padding

If True (default), pad shorter sequences.

pad_id

Token ID to use for padding. Defaults to stored pad_token_id from tokenizer, or 0 if not set.

padding_side

Where to add padding tokens. Defaults to value passed to encode(), which defaults to tokenizer.padding_side.
"right": Pad at end (encoder models)
"left": Pad at start (decoder/generation models)

max_length

Maximum length to pad to. If None, uses longest sequence.

truncation

If True, truncate sequences longer than max_length.

return_attention_mask

If True (default), include attention_mask.

Returns

Dictionary with

"input_ids": 2D list of padded token IDs
"attention_mask": 2D list of masks (1=real, 0=padding)

Raises

ValidationError: If padding_side is not 'left' or 'right'.

Tokenizer

Classes

class Tokenizer

Tokenizer( self, model: str, padding_side: str = 'left', truncation_side: str = 'right')

Attributes

Example

Quick Reference

Properties

bos_token: str | None

Example

bos_token_id: int | None

Example

eos_token_ids: tuple[int, ...]

Returns

Example

eos_tokens: list[str]

Returns

Example

model_max_length: int

Example

model_path: str

pad_token: str | None

pad_token_id: int | None

padding_side: str

Raises

special_ids: frozenset[int]

Returns

Example

truncation_side: str

Raises

unk_token: str | None

unk_token_id: int | None

vocab_size: int

Methods

def __call__( self, text: str | list[str], text_pair: str | list[str] | None = None, special_tokens: bool | set[str] = True, truncation: bool | str = False, max_length: int | None = None, return_tensors: str | None = None, kwargs: Any) → BatchEncoding

Parameters

Returns

Raises

Example

def apply_chat_template( self, messages: list, add_generation_prompt: bool = True, tokenize: bool = False) → str | TokenArray | BatchEncoding

Parameters

Returns

Raises

def close(self) → None

Raises

def convert_ids_to_tokens(self, ids: list[int]) → list[str | None]

Parameters

def convert_tokens_to_ids(self, tokens: list[str]) → list[int | None]

Parameters

def count_tokens( self, text: str, special_tokens: bool | set[str] = True) → int

Parameters

Returns

Example

def decode( self, tokens: TokenArray | list[int], num_tokens: int | None = None, skip_special_tokens: bool = True) → str

Parameters

Returns

Raises

def encode( self, text: str | list[str], special_tokens: bool | set[str] = True, truncation: bool = False, max_length: int | None = None, truncation_side: str | None = None) → TokenArray | BatchEncoding

Parameters

Returns

Raises

Example

def from_json( cls, json_content: str | bytes, chat_template: str | None = None, bos_token: str = '', eos_token: str = '', padding_side: str = 'left', truncation_side: str = 'right') → Tokenizer

Parameters

Returns

Raises

Example

def get_vocab(self) → dict[str, int]

Returns

Raises

def id_to_token(self, token_id: int) → str | None

Parameters

Returns

Example

def is_special_id(self, token_id: int) → bool

Parameters

Returns

Example

def primary_eos_token_id(self) → int | None

Returns

`Tokenizer( self, model: str, padding_side: str = 'left', truncation_side: str = 'right' )`

`bos_token: str | None`

`bos_token_id: int | None`

`eos_token_ids: tuple[int, ...]`

`eos_tokens: list[str]`

`model_max_length: int`

`model_path: str`

`pad_token: str | None`

`pad_token_id: int | None`

`padding_side: str`

`special_ids: frozenset[int]`

`truncation_side: str`

`unk_token: str | None`

`unk_token_id: int | None`

`vocab_size: int`

`def call( self, text: str | list[str], text_pair: str | list[str] | None = None, special_tokens: bool | set[str] = True, truncation: bool | str = False, max_length: int | None = None, return_tensors: str | None = None, kwargs: Any ) → BatchEncoding`

`def apply_chat_template( self, messages: list, add_generation_prompt: bool = True, tokenize: bool = False ) → str | TokenArray | BatchEncoding`

`def close(self) → None`

`def convert_ids_to_tokens(self, ids: list[int]) → list[str | None]`

`def convert_tokens_to_ids(self, tokens: list[str]) → list[int | None]`

`def count_tokens( self, text: str, special_tokens: bool | set[str] = True ) → int`

`def decode( self, tokens: TokenArray | list[int], num_tokens: int | None = None, skip_special_tokens: bool = True ) → str`

`def encode( self, text: str | list[str], special_tokens: bool | set[str] = True, truncation: bool = False, max_length: int | None = None, truncation_side: str | None = None ) → TokenArray | BatchEncoding`

`def from_json( cls, json_content: str | bytes, chat_template: str | None = None, bos_token: str = '', eos_token: str = '', padding_side: str = 'left', truncation_side: str = 'right' ) → Tokenizer`

`def get_vocab(self) → dict[str, int]`

`def id_to_token(self, token_id: int) → str | None`

`def is_special_id(self, token_id: int) → bool`

`def primary_eos_token_id(self) → int | None`

`def token_to_id(self, token: str) → int | None`

`def tokenize( self, text: str, return_bytes: bool = False ) → list[str] | list[bytes]`

`TokenArray( self, tokens_ptr: Any, num_tokens: int, source_text: bytes | None = None, tokenizer: Tokenizer | None = None, _buffer_handle: Any = None )`

`offsets: list[TokenOffset]`

`def close(self) → None`

`def count(self, value: int) → int`

`def index( self, value: int, start: int = 0, stop: int | None = None ) → int`

`def tolist(self) → list[int]`

`TokenArrayView( self, ids_ptr: Any, start: int, length: int, parent: BatchEncoding )`

`def count(self, value: int) → int`

`def index( self, value: int, start: int = 0, stop: int | None = None ) → int`

`def tolist(self) → list[int]`

`TokenOffset( self, start: int, end: int )`

`def slice( self, text: str | bytes, errors: str = 'strict' ) → str`

`BatchEncoding( self, ids_ptr: Any = None, offsets_ptr: Any = None, total_tokens: int = 0, num_sequences: int = 0, padding_side: str = 'left', pad_token_id: int | None = None )`

`attention_mask: _DLPackAccessor`

`input_ids: _DLPackAccessor`