Tokenizer
Tokenizer module - Text encoding and decoding
Classes
| Class | Description |
|---|---|
| Tokenizer | Text-to-token encoder and token-to-text decoder. |
| TokenArray | Sequence of token IDs with zero-copy NumPy and DLPack sup... |
| TokenArrayView | Lightweight view into a BatchEncoding's contiguous buffer. |
| TokenOffset | Byte offset pair mapping a token to its position in sourc... |
| BatchEncoding | Lazy container for batch tokenization results. |
class Tokenizer
Tokenizer(
self,
model: str,
padding_side: str = 'left',
truncation_side: str = 'right'
)
self,
model: str,
padding_side: str = 'left',
truncation_side: str = 'right'
)
Text-to-token encoder and token-to-text decoder.
Converts text into token IDs from the model's vocabulary and back. Thread-safe after construction.
Attributes
model_path: strResolved path to the model directory.
vocab_size: intNumber of tokens in the vocabulary.
eos_token_ids: list[int]Token IDs that signal end of generation.
bos_token_id: int | NoneBeginning-of-sequence token ID.
pad_token_id: int | NonePadding token ID.
padding_side: strSide for padding ("left" or "right").
Example
>>> tokenizer = Tokenizer("Qwen/Qwen3-0.6B")
>>> tokens = tokenizer.encode("Hello world")
>>> text = tokenizer.decode(tokens)
Quick Reference
Properties
| Name | Type |
|---|---|
bos_token |
str | None |
bos_token_id |
int | None |
eos_token_ids |
tuple[int, ...] |
eos_tokens |
list[str] |
model_max_length |
int |
model_path |
str |
pad_token |
str | None |
pad_token_id |
int | None |
padding_side |
str |
special_ids |
frozenset[int] |
truncation_side |
str |
unk_token |
str | None |
unk_token_id |
int | None |
vocab_size |
int |
Methods
| Method | Description |
|---|---|
__call__() |
Callable interface for tokenization with zero-c... |
apply_chat_template() |
Format a conversation using the model's chat te... |
close() |
Release native tokenizer resources. |
convert_ids_to_tokens() |
Convert a list of token IDs to their string rep... |
convert_tokens_to_ids() |
Convert a list of token strings to their IDs. |
count_tokens() |
Count the number of tokens in text. |
decode() |
Convert token IDs back to text. |
encode() |
Convert text to token IDs. |
from_json() |
Create a tokenizer directly from JSON content. |
get_vocab() |
Get the complete vocabulary as a dictionary. |
id_to_token() |
Get the string representation of a token ID. |
is_special_id() |
Check if a token ID is a special token. |
primary_eos_token_id() |
Get the primary EOS token ID for insertion. |
token_to_id() |
Get the ID of a token string. |
tokenize() |
Split text into token strings. |
Properties
bos_token: str | None
String representation of the BOS token.
Example
>>> tokenizer.bos_token # Llama 3.2
'<|begin_of_text|>'
bos_token_id: int | None
Beginning-of-sequence token ID.
Returns None if the model doesn't use a BOS token (e.g., Qwen3).
Example
>>> tokenizer.bos_token_id # Llama 3.2
128000
>>> tokenizer.bos_token_id # Qwen3
None
eos_token_ids: tuple[int, ...]
Token IDs that signal end of generation.
Returns an immutable, ordered tuple of deduplicated EOS token IDs. Many models have multiple EOS tokens (e.g., Qwen, Llama 3, Gemma 3).
Returns
Tuple of EOS token IDs. May be empty if no EOS tokens are configured.
Example
>>> tokenizer.eos_token_ids
(151643, 151644, 151645)
eos_tokens: list[str]
String representations of all EOS tokens.
Returns
List of EOS token strings.
Example
>>> tokenizer.eos_tokens
['<|endoftext|>', '<|im_end|>', '<|end|>']
model_max_length: int
Maximum sequence length the model supports.
This value is read from tokenizer_config.json (model_max_length field). It represents the maximum context length the model was trained with.
When truncation=True is used without an explicit max_length, this value is used as the default truncation limit.
Returns 0 if the model does not specify a maximum length.
Example
>>> tokenizer.model_max_length
32768
model_path: str
The resolved path to the model directory.
pad_token: str | None
String representation of the padding token.
pad_token_id: int | None
Padding token ID.
padding_side: str
Side for padding: "left" (default, for generation) or "right".
Raises
ValidationErrorIf set to a value other than
"left"or"right".
special_ids: frozenset[int]
Immutable set of all special token IDs.
Includes EOS, BOS, UNK, and PAD tokens. Use for fast O(1) membership testing.
Returns
Frozen set of all special token IDs.
Example
>>> 128001 in tokenizer.special_ids
True
truncation_side: str
Side for truncation: "right" (default, keeps beginning) or "left" (keeps end).
Raises
ValidationErrorIf set to a value other than
"left"or"right".
unk_token: str | None
String representation of the unknown token.
unk_token_id: int | None
Unknown token ID.
vocab_size: int
Number of tokens in the vocabulary.
Methods
def __call__(
self,
text: str | list[str],
text_pair: str | list[str] | None = None,
special_tokens: bool | set[str] = True,
truncation: bool | str = False,
max_length: int | None = None,
return_tensors: str | None = None,
kwargs: Any
) → BatchEncoding
self,
text: str | list[str],
text_pair: str | list[str] | None = None,
special_tokens: bool | set[str] = True,
truncation: bool | str = False,
max_length: int | None = None,
return_tensors: str | None = None,
kwargs: Any
) → BatchEncoding
Callable interface for tokenization with zero-copy tensor access.
Returns a BatchEncoding object that provides dict-like access to input_ids and attention_mask via the DLPack protocol. This enables zero-copy transfer to PyTorch, JAX, or NumPy.
Padding is applied automatically when exporting to tensors. Control the padding side via tokenizer.padding_side or batch.padding_side.
Parameters
textText(s) to tokenize. Single string is wrapped as batch of 1.
text_pairNot supported.
special_tokens- Control special token insertion. Can be:
- True: Add all special tokens (BOS, EOS) - default
- False: No special tokens
- {"bos"}, {"eos"}, {"bos", "eos"}: Granular control
truncationFalse or True.
max_lengthMaximum sequence length.
return_tensorsIgnored (use DLPack instead).
Returns
- BatchEncoding with dict-like interface for zero-copy tensor access:
- batch["input_ids"] → DLPack-compatible accessor
- batch["attention_mask"] → DLPack-compatible accessor
Raises
NotImplementedErrorIf text_pair is provided.
ValidationErrorIf truncation=True but no max_length specified.
Example
>>> batch = tokenizer(["Hello", "World!"])
>>> input_ids = torch.from_dlpack(batch["input_ids"])
>>> attention_mask = torch.from_dlpack(batch["attention_mask"])
>>> model(input_ids, attention_mask=attention_mask)
For Python list output (debugging only), use batch.to_list().
def apply_chat_template(
self,
messages: list,
add_generation_prompt: bool = True,
tokenize: bool = False
) → str | TokenArray | BatchEncoding
self,
messages: list,
add_generation_prompt: bool = True,
tokenize: bool = False
) → str | TokenArray | BatchEncoding
Format a conversation using the model's chat template.
Parameters
messagesList of message dicts with 'role' and 'content'.
add_generation_promptIf True, add assistant turn marker.
tokenizeIf True, return TokenArray instead of string.
Returns
Formatted prompt string, or TokenArray if tokenize=True.
Raises
StateErrorIf no chat template is available (neither from model nor from_json()).
When tokenize=True, BOS tokens are handled automatically:
- If the chat template includes BOS in output, it won't be doubled
- EOS is not added (chat templates use turn markers instead)
This prevents the "double-BOS" bug where models like Llama-3 would receive two BOS tokens (one from the template, one from encode), which can degrade generation quality.
def close(self) → None
Release native tokenizer resources.
After calling close(), the tokenizer cannot be used. Safe to call multiple times (idempotent).
Raises
None. This method never raises.
def convert_ids_to_tokens(self, ids: list[int]) → list[str | None]
Convert a list of token IDs to their string representations.
Parameters
idsToken IDs to convert.
def convert_tokens_to_ids(self, tokens: list[str]) → list[int | None]
Convert a list of token strings to their IDs.
Parameters
tokensToken strings to convert.
def count_tokens(
self,
text: str,
special_tokens: bool | set[str] = True
) → int
self,
text: str,
special_tokens: bool | set[str] = True
) → int
Count the number of tokens in text.
This returns the exact token count that would be used in generation, including BOS/EOS tokens by default. Use this to check if prompts fit within context windows.
Parameters
textText to count tokens for.
special_tokens- Control special token counting. Can be:
- True (default): Include all special tokens (matches generation)
- False: Count only content tokens
- {"bos"}, {"eos"}, {"bos", "eos"}: Include specific tokens
Returns
Number of tokens.
Example
>>> # Check if prompt fits in context window
>>> tokens = tokenizer.count_tokens(prompt)
>>> if tokens > 4096:
... print("Prompt too long!")
>>> # Count without special tokens
>>> content_tokens = tokenizer.count_tokens(text, special_tokens=False)
def decode(
self,
tokens: TokenArray | list[int],
num_tokens: int | None = None,
skip_special_tokens: bool = True
) → str
self,
tokens: TokenArray | list[int],
num_tokens: int | None = None,
skip_special_tokens: bool = True
) → str
Convert token IDs back to text.
Parameters
tokensToken IDs to decode.
num_tokensNumber of tokens (only for raw pointers).
skip_special_tokensIf True, omit special tokens from output.
Returns
Decoded text string.
Raises
ValidationErrorIf num_tokens is required but not provided.
TokenizerErrorIf decoding fails.
def encode(
self,
text: str | list[str],
special_tokens: bool | set[str] = True,
truncation: bool = False,
max_length: int | None = None,
truncation_side: str | None = None
) → TokenArray | BatchEncoding
self,
text: str | list[str],
special_tokens: bool | set[str] = True,
truncation: bool = False,
max_length: int | None = None,
truncation_side: str | None = None
) → TokenArray | BatchEncoding
Convert text to token IDs.
Parameters
textText to tokenize. String returns TokenArray, list returns BatchEncoding.
special_tokens- Control special token insertion. Can be:
- True: Add all special tokens (BOS, EOS) - default
- False: No special tokens (raw tokenization)
- {"bos"}: Add only BOS token
- {"eos"}: Add only EOS token
- {"bos", "eos"}: Add both (same as True)
truncationIf True, truncate to max_length. When max_length is not specified, uses model_max_length from tokenizer_config.json.
max_lengthMaximum sequence length. If None and truncation=True, uses model_max_length.
truncation_side- "left" or "right". Overrides tokenizer default.
- "right" (default): Keep beginning, truncate end
- "left": Keep end, truncate beginning (useful for RAG)
Returns
TokenArray for single string, BatchEncoding for list.
Raises
ValidationErrorIf special_tokens is not bool or set[str], or if text is not str or list[str].
TokenizerErrorIf encoding fails.
MemoryErrorIf buffer allocation fails.
Special token behavior depends on the model's tokenizer configuration:
- Models with postprocessor (BERT, RoBERTa): BOS/EOS are added via the tokenizer's postprocessor when special_tokens=True.
- Chat models (Llama 3, Qwen3, Gemma): Special tokens are typically added via chat templates, not the postprocessor. For these models, special_tokens=True may not add BOS/EOS to raw text. Use apply_chat_template() for proper formatting.
- If bos_token_id is None (e.g., Qwen3), requesting BOS is a no-op. Check tokenizer.bos_token_id to verify special token availability.
For batch encoding, padding_side is inherited from the tokenizer's padding_side property (default "left" for generation models). To override for a specific batch, set the property on the result:
batch = tokenizer.encode(["Hello", "World"])
batch.padding_side = "right" # Override before converting
tensor = torch.from_dlpack(batch)
Example
>>> # Default: add all special tokens
>>> tokens = tokenizer.encode("Hello world")
>>> # No special tokens
>>> tokens = tokenizer.encode("Hello world", special_tokens=False)
>>> # BOS only (useful for document snippets)
>>> tokens = tokenizer.encode("Document...", special_tokens={"bos"})
>>> # With truncation (uses model_max_length if no explicit max_length)
>>> tokens = tokenizer.encode(long_text, truncation=True)
>>> # With explicit max_length
>>> tokens = tokenizer.encode(text, truncation=True, max_length=512)
>>> # Left truncation for RAG (keep recent context)
>>> tokens = tokenizer.encode(text, truncation=True, max_length=512,
... truncation_side="left")
>>> # Check if model has BOS token
>>> if tokenizer.bos_token_id is not None:
... print(f"BOS token: {tokenizer.bos_token}")
def from_json(
cls,
json_content: str | bytes,
chat_template: str | None = None,
bos_token: str = '',
eos_token: str = '',
padding_side: str = 'left',
truncation_side: str = 'right'
) → Tokenizer
cls,
json_content: str | bytes,
chat_template: str | None = None,
bos_token: str = '',
eos_token: str = '',
padding_side: str = 'left',
truncation_side: str = 'right'
) → Tokenizer
Create a tokenizer directly from JSON content.
Creates a standalone tokenizer without needing a model directory. Useful for custom tokenizers, serverless deployments, or testing.
Parameters
json_contentThe tokenizer.json content as string or bytes.
chat_templateOptional Jinja2 chat template string. If provided, enables apply_chat_template() on this tokenizer.
bos_tokenBeginning-of-sequence token string for chat templates.
eos_tokenEnd-of-sequence token string for chat templates.
padding_sideDefault padding side ("left" or "right"). Default "left".
truncation_sideDefault truncation side ("left" or "right"). Default "right".
Returns
A new Tokenizer instance.
Raises
TokenizerErrorIf the JSON content is invalid.
ValidationErrorIf padding_side or truncation_side is invalid.
Example
>>> json = '{"version": "1.0", "model": {"type": "BPE", ...}}'
>>> template = "{% for m in messages %}{{ m.content }}{% endfor %}"
>>> tok = Tokenizer.from_json(json, chat_template=template)
>>> prompt = tok.apply_chat_template([{"role": "user", "content": "Hi"}])
def get_vocab(self) → dict[str, int]
Get the complete vocabulary as a dictionary.
Returns
Dictionary mapping token strings to their IDs.
Raises
TokenizerErrorIf vocabulary retrieval fails.
def id_to_token(self, token_id: int) → str | None
Get the string representation of a token ID.
Parameters
token_idThe token ID to convert.
Returns
The token string, or None if the ID is invalid.
Example
>>> tokenizer.id_to_token(9707)
'Hello'
def is_special_id(self, token_id: int) → bool
Check if a token ID is a special token.
Parameters
token_idThe token ID to check.
Returns
True if the token ID is a special token.
Example
>>> tokenizer.is_special_id(128001) # EOS token
True
def primary_eos_token_id(self) → int | None
Get the primary EOS token ID for insertion.
Use this when inserting an EOS token. For detection/stopping, use token_id in eos_token_ids instead.
Returns
The primary EOS token ID, or None if no EOS tokens are configured.
Example
>>> eos_id = tokenizer.primary_eos_token_id()
>>> tokens.append(eos_id)
def token_to_id(self, token: str) → int | None
Get the ID of a token string.
Parameters
tokenThe token string to convert.
Returns
The token ID, or None if the token is not in the vocabulary.
Example
>>> tokenizer.token_to_id('Hello')
9707
def tokenize(
self,
text: str,
return_bytes: bool = False
) → list[str] | list[bytes]
self,
text: str,
return_bytes: bool = False
) → list[str] | list[bytes]
Split text into token strings.
This is useful for debugging tokenization - seeing exactly how text is segmented before being converted to token IDs.
Parameters
textText to tokenize.
return_bytesIf True, return raw bytes instead of strings. Use this for debugging when you need to see exact byte representations (e.g., for tokens with invalid UTF-8 or special byte sequences).
Returns
List of token strings (default) or bytes (if return_bytes=True).
Raises
TokenizerErrorIf tokenization fails.
Example
>>> tokenizer.tokenize("Hello world")
['Hello', ' world']
>>> # Debug mode - see raw bytes
>>> tokenizer.tokenize("Hello", return_bytes=True)
[b'Hello']
>>> # Useful for debugging unicode edge cases
>>> tokenizer.tokenize("café", return_bytes=True)
[b'caf', b'\\xc3\\xa9'] # Shows UTF-8 encoding
class TokenArray
TokenArray(
self,
tokens_ptr: Any,
num_tokens: int,
source_text: bytes | None = None,
tokenizer: Tokenizer | None = None,
_buffer_handle: Any = None
)
self,
tokens_ptr: Any,
num_tokens: int,
source_text: bytes | None = None,
tokenizer: Tokenizer | None = None,
_buffer_handle: Any = None
)
Sequence of token IDs with zero-copy NumPy and DLPack support.
Returned by Tokenizer.encode() and provides efficient access to token data. The underlying memory is managed by a refcounted buffer in the Zig runtime. Implements collections.abc.Sequence[int].
Key Features
Zero-copy NumPy conversion - No data copying when converting to NumPy:
>>> import numpy as np
>>> tokens = tokenizer.encode("Hello world")
>>> arr = np.asarray(tokens) # Zero-copy view
>>> print(arr.dtype)
uint32
Safe DLPack export - Zero-copy export to PyTorch/JAX without invalidation:
>>> tensor = torch.from_dlpack(tokens) # Zero-copy!
>>> len(tokens) # Still valid! TokenArray not invalidated
2
>>> tensor2 = torch.from_dlpack(tokens) # Multiple exports safe
Standard sequence operations - Works like a Python list:
>>> len(tokens)
2
>>> tokens[0]
9707
>>> tokens[-1] # Negative indexing
1879
Convert to list - When you need a regular Python list:
>>> tokens.tolist()
[9707, 1879]
Token offset mapping - Map tokens back to source text positions:
>>> text = "Hello world"
>>> tokens = tokenizer.encode(text)
>>> tokens.offsets # Lazy, computed on first access
[(0, 5), (5, 11)]
>>> tokens.offsets[0].slice(text) # Recommended
'Hello'
Memory Management
TokenArray uses a refcounted buffer. The buffer is freed when all references are released (TokenArray deleted + all DLPack exports consumed).
NumPy arrays from `np.asarray()` are views - they become invalid if the TokenArray is deleted AND no DLPack exports exist:
>>> tokens = tokenizer.encode("Hello")
>>> arr = np.asarray(tokens)
>>> del tokens # Only safe if no other references exist!
>>> arr[0] # May be undefined if buffer was freed
To keep the data safely, either: 1. Keep the TokenArray alive while using the NumPy view 2. Copy it: `arr = np.array(tokens)` 3. Use DLPack: `tensor = torch.from_dlpack(tokens)` (keeps buffer alive)
See Also
Tokenizer.encode : Creates TokenArrays from text. Tokenizer.decode : Converts TokenArrays back to text. TokenOffset : The offset type returned by the offsets property.
Quick Reference
Properties
| Name | Type |
|---|---|
offsets |
list[TokenOffset] |
Methods
| Method | Description |
|---|---|
close() |
Release the underlying native buffer. |
count() |
Return number of occurrences of value. |
index() |
Return index of first occurrence of value. |
tolist() |
Convert to a Python list. |
Properties
offsets: list[TokenOffset]
Lazy byte offsets mapping tokens back to source text.
Each offset is a (start, end) pair of UTF-8 byte indices into the original text. The first access triggers computation via the Zig runtime; subsequent accesses return the cached result.
Special tokens (BOS, EOS, PAD) that don't correspond to source text are assigned (0, 0).
Returns
List of TokenOffset objects, one per token.
Raises
RuntimeErrorIf source text was not preserved during encoding, or if offset computation fails.
Example
>>> text = "Hello 🎉 world"
>>> tokens = tokenizer.encode(text)
>>> tokens.offsets
[(0, 5), (5, 10), ...]
Use the slice() helper to extract text (handles Unicode correctly):
>>> tokens.offsets[0].slice(text)
'Hello'
>>> tokens.offsets[-1].slice(text)
' world'
Methods
def close(self) → None
Release the underlying native buffer.
After calling close(), the TokenArray cannot be used for data access, DLPack export, or NumPy conversion. Safe to call multiple times (idempotent).
def count(self, value: int) → int
Return number of occurrences of value.
Parameters
valueToken ID to count.
Returns
Number of times value appears in the array.
Example
>>> tokens = tokenizer.encode("hello hello hello")
>>> tokens.count(9707) # Count occurrences
3
def index(
self,
value: int,
start: int = 0,
stop: int | None = None
) → int
self,
value: int,
start: int = 0,
stop: int | None = None
) → int
Return index of first occurrence of value.
Parameters
valueToken ID to search for.
startStart index for search (default 0).
stopStop index for search (default end of array).
Returns
Index of first occurrence of value.
Raises
ValueErrorIf value is not in the array.
Example
>>> tokens = tokenizer.encode("Hello world")
>>> tokens.index(9707) # Find first occurrence
0
def tolist(self) → list[int]
Convert to a Python list.
This copies the data into a new Python list. Use this when you need a regular list, or when you need to keep the data after the TokenArray is deleted.
Returns
A new list containing the token IDs.
Example
>>> tokens = tokenizer.encode("Hello world")
>>> token_list = tokens.tolist()
>>> print(token_list)
[9707, 1879]
>>> type(token_list)
<class 'list'>
class TokenArrayView
TokenArrayView(
self,
ids_ptr: Any,
start: int,
length: int,
parent: BatchEncoding
)
self,
ids_ptr: Any,
start: int,
length: int,
parent: BatchEncoding
)
Lightweight view into a BatchEncoding's contiguous buffer.
Zero-copy view that slices into the parent's memory. The parent BatchEncoding must remain alive while this view is used.
Users interact with this class when indexing into a BatchEncoding:
>>> batch = tokenizer.encode(["Hello", "World"])
>>> view = batch[0] # Returns TokenArrayView
>>> list(view)
[9707]
Quick Reference
Methods
| Method | Description |
|---|---|
count() |
Return number of occurrences of value. |
index() |
Return index of first occurrence of value. |
tolist() |
Convert to a Python list (copies data). |
Methods
def count(self, value: int) → int
Return number of occurrences of value.
Parameters
valueToken ID to count.
def index(
self,
value: int,
start: int = 0,
stop: int | None = None
) → int
self,
value: int,
start: int = 0,
stop: int | None = None
) → int
Return index of first occurrence of value.
Parameters
valueToken ID to search for.
startStart index for search.
stopStop index for search, or None for end of view.
def tolist(self) → list[int]
Convert to a Python list (copies data).
class TokenOffset
TokenOffset(
self,
start: int,
end: int
)
self,
start: int,
end: int
)
Byte offset pair mapping a token to its position in source text.
Represents a (start, end) byte range in the original UTF-8 encoded text. Offsets are byte indices, not character indices, enabling O(1) slicing regardless of Unicode content.
Methods
slice(text, errors="strict") Extracts the text span for this token. This is the recommended way to use offsets as it handles byte-to-string conversion automatically.
Attributes
start: intStart byte offset in source text (inclusive).
end: intEnd byte offset in source text (exclusive).
- Offsets are UTF-8 byte indices, not character indices. Do NOT use them directly on Python strings (e.g.,
text[start:end]). Use theslice()method instead. - For byte-level BPE tokenizers (GPT-2, Qwen), individual tokens may split multi-byte UTF-8 sequences. Use
errors="replace"inslice()to handle these gracefully. - Special tokens (BOS, EOS) that don't correspond to source text are assigned (0, 0).
Examples
Extract text using slice() method (recommended)
>>> text = "Hello 🎉 world"
>>> tokens = tokenizer.encode(text)
>>> tokens.offsets[-1].slice(text)
' world'
Handle byte-level BPE tokens that split UTF-8 sequences:
>>> tokens.offsets[1].slice(text, errors="replace")
' ��' # Replacement chars for partial emoji bytes
Tuple unpacking for raw byte indices
>>> offset = TokenOffset(0, 5)
>>> start, end = offset
>>> print(start, end)
0 5
Comparing with tuples
>>> offset = TokenOffset(0, 5)
>>> offset == (0, 5)
True
Quick Reference
Methods
| Method | Description |
|---|---|
slice() |
Extract the text span corresponding to this off... |
Methods
def slice(
self,
text: str | bytes,
errors: str = 'strict'
) → str
self,
text: str | bytes,
errors: str = 'strict'
) → str
Extract the text span corresponding to this offset.
This is the recommended way to use offsets. It handles the conversion between byte offsets and Python string indexing automatically.
Parameters
textThe original source text (str or bytes).
errorsHow to handle decode errors. Default "strict" raises UnicodeDecodeError. Use "replace" to substitute invalid bytes with the replacement character. This can happen with byte-level BPE tokenizers that split multi-byte UTF-8 sequences.
Returns
The substring corresponding to this token.
Raises
UnicodeDecodeErrorIf the byte span is not valid UTF-8 and errors="strict" (default). This can happen with byte-level BPE tokenizers (GPT-2, Qwen) that split multi-byte characters across tokens.
Example
>>> tokens = tokenizer.encode("Hello 🎉 world")
>>> tokens.offsets[-1].slice("Hello 🎉 world")
' world'
For byte-level BPE tokens that may split UTF-8 sequences:
>>> tokens.offsets[1].slice(text, errors="replace")
' \ufffd\ufffd' # Replacement characters for partial bytes
class BatchEncoding
BatchEncoding(
self,
ids_ptr: Any = None,
offsets_ptr: Any = None,
total_tokens: int = 0,
num_sequences: int = 0,
padding_side: str = 'left',
pad_token_id: int | None = None
)
self,
ids_ptr: Any = None,
offsets_ptr: Any = None,
total_tokens: int = 0,
num_sequences: int = 0,
padding_side: str = 'left',
pad_token_id: int | None = None
)
Lazy container for batch tokenization results.
Holds all token IDs in a single contiguous memory block, with lazy creation of TokenArray views when accessing individual sequences.
Key Features
List-like interface - Works like a list of TokenArrays:
>>> batch = tokenizer.encode(["Hello", "World"])
>>> len(batch)
2
>>> batch[0] # Returns lazy TokenArray view
TokenArray([...], len=...)
Lazy evaluation - TokenArray views are created on-demand:
>>> # Only creates view when accessed
>>> first = batch[0]
Dictionary-like access - Compatible with HuggingFace patterns:
>>> input_ids = torch.from_dlpack(batch["input_ids"])
>>> attention_mask = torch.from_dlpack(batch["attention_mask"])
ML Framework Integration
To convert to PyTorch or JAX, use the specific tensor accessors. Do NOT try to convert the batch object itself - it contains multiple tensors and will raise an InteropError.
Correct usage:
>>> import torch
>>> input_ids = torch.from_dlpack(batch.input_ids)
>>> attention_mask = torch.from_dlpack(batch.attention_mask)
Incorrect usage (raises InteropError):
>>> tensor = torch.from_dlpack(batch) # Error: Ambiguous - which tensor?
Memory Management
BatchEncoding owns the underlying memory and frees it when garbage collected. The TokenArray views returned by indexing are lightweight wrappers that point into this shared memory - they become invalid if the BatchEncoding is deleted.
Quick Reference
Properties
| Name | Type |
|---|---|
attention_mask |
_DLPackAccessor |
input_ids |
_DLPackAccessor |
pad_token_id |
int | None |
padding_side |
str |
total_tokens |
int |
Methods
| Method | Description |
|---|---|
close() |
Release the native batch encoding memory. |
keys() |
Return available tensor keys (dict-like interfa... |
lengths() |
Get the length of each sequence in the batch. |
max_length() |
Get the maximum sequence length in the batch. |
to_list() |
Convert batch to padded Python lists. |
Properties
attention_mask: _DLPackAccessor
DLPack interface for the attention mask tensor.
- Returns an accessor object that implements the DLPack protocol. Use with
torch.from_dlpack()ornp.from_dlpack()to get a 2D tensor of shape (num_sequences, padded_length) with values: - 1 for real tokens
- 0 for padding tokens
NOTE: Each export allocates a new mask buffer computed from sequence lengths and padding configuration.
Example
>>> batch = tokenizer.encode(["Hello", "World!"])
>>> input_ids = torch.from_dlpack(batch.input_ids)
>>> attention_mask = torch.from_dlpack(batch.attention_mask)
>>> model(input_ids, attention_mask=attention_mask)
input_ids: _DLPackAccessor
DLPack interface for the input_ids tensor.
Returns an accessor object that implements the DLPack protocol. Use with torch.from_dlpack() or np.from_dlpack() to get a 2D tensor of shape (num_sequences, padded_length).
NOTE: Each export allocates a new padded buffer. The internal CSR storage is materialized to dense format for ML framework consumption.
Example
>>> batch = tokenizer.encode(["Hello", "World!"])
>>> input_ids = torch.from_dlpack(batch.input_ids)
>>> attention_mask = torch.from_dlpack(batch.attention_mask)
>>> model(input_ids, attention_mask=attention_mask)
pad_token_id: int | None
Padding token ID used for this batch.
padding_side: str
Side for padding: "left" (default for generation) or "right".
This property is inherited from the tokenizer when the batch is created. Set this property to override padding behavior before calling to_list() or using torch.from_dlpack().
Example
>>> batch = tokenizer.encode(["Hello", "World"])
>>> batch.padding_side
'left'
>>> batch.padding_side = "right" # Override for this batch
>>> tensor = torch.from_dlpack(batch) # Uses right padding
total_tokens: int
Total number of tokens across all sequences.
Methods
def close(self) → None
Release the native batch encoding memory.
After calling close(), the BatchEncoding cannot be used for data access, DLPack export, or iteration. Safe to call multiple times (idempotent).
def keys(self) → list[str]
Return available tensor keys (dict-like interface).
def lengths(self) → list[int]
Get the length of each sequence in the batch.
def max_length(self) → int
Get the maximum sequence length in the batch.
def to_list(
self,
padding: bool = True,
pad_id: int | None = None,
padding_side: str | None = None,
max_length: int | None = None,
truncation: bool = False,
return_attention_mask: bool = True
) → dict[str, list[list[int]]]
self,
padding: bool = True,
pad_id: int | None = None,
padding_side: str | None = None,
max_length: int | None = None,
truncation: bool = False,
return_attention_mask: bool = True
) → dict[str, list[list[int]]]
Convert batch to padded Python lists.
This is primarily useful for debugging or when you need plain Python data structures. For ML workloads, use torch.from_dlpack(batch.input_ids) and torch.from_dlpack(batch.attention_mask) directly.
Parameters
paddingIf True (default), pad shorter sequences.
pad_idToken ID to use for padding. Defaults to stored pad_token_id from tokenizer, or 0 if not set.
padding_side- Where to add padding tokens. Defaults to value passed to encode(), which defaults to tokenizer.padding_side.
- "right": Pad at end (encoder models)
- "left": Pad at start (decoder/generation models)
max_lengthMaximum length to pad to. If None, uses longest sequence.
truncationIf True, truncate sequences longer than max_length.
return_attention_maskIf True (default), include attention_mask.
Returns
Dictionary with
- "input_ids": 2D list of padded token IDs
- "attention_mask": 2D list of masks (1=real, 0=padding)
Raises
ValidationErrorIf padding_side is not 'left' or 'right'.