Tokenizer

Tokenizer module - Text encoding and decoding

Classes

Class Description
Tokenizer Text-to-token encoder and token-to-text decoder.
TokenArray Sequence of token IDs with zero-copy NumPy and DLPack sup...
TokenArrayView Lightweight view into a BatchEncoding's contiguous buffer.
TokenOffset Byte offset pair mapping a token to its position in sourc...
BatchEncoding Lazy container for batch tokenization results.

class Tokenizer

Tokenizer(
    self,
    model: str,
    padding_side: str = 'left',
    truncation_side: str = 'right'
)

Text-to-token encoder and token-to-text decoder.

Converts text into token IDs from the model's vocabulary and back. Thread-safe after construction.

Attributes
model_path : str

Resolved path to the model directory.

vocab_size : int

Number of tokens in the vocabulary.

eos_token_ids : list[int]

Token IDs that signal end of generation.

bos_token_id : int | None

Beginning-of-sequence token ID.

pad_token_id : int | None

Padding token ID.

padding_side : str

Side for padding ("left" or "right").

Example
>>> tokenizer = Tokenizer("Qwen/Qwen3-0.6B")
>>> tokens = tokenizer.encode("Hello world")
>>> text = tokenizer.decode(tokens)

Quick Reference

Properties

Name Type
bos_token str | None
bos_token_id int | None
eos_token_ids tuple[int, ...]
eos_tokens list[str]
model_max_length int
model_path str
pad_token str | None
pad_token_id int | None
padding_side str
special_ids frozenset[int]
truncation_side str
unk_token str | None
unk_token_id int | None
vocab_size int

Methods

Method Description
__call__() Callable interface for tokenization with zero-c...
apply_chat_template() Format a conversation using the model's chat te...
close() Release native tokenizer resources.
convert_ids_to_tokens() Convert a list of token IDs to their string rep...
convert_tokens_to_ids() Convert a list of token strings to their IDs.
count_tokens() Count the number of tokens in text.
decode() Convert token IDs back to text.
encode() Convert text to token IDs.
from_json() Create a tokenizer directly from JSON content.
get_vocab() Get the complete vocabulary as a dictionary.
id_to_token() Get the string representation of a token ID.
is_special_id() Check if a token ID is a special token.
primary_eos_token_id() Get the primary EOS token ID for insertion.
token_to_id() Get the ID of a token string.
tokenize() Split text into token strings.

Properties

bos_token: str | None

String representation of the BOS token.

Example
>>> tokenizer.bos_token  # Llama 3.2
'<|begin_of_text|>'

bos_token_id: int | None

Beginning-of-sequence token ID.

Returns None if the model doesn't use a BOS token (e.g., Qwen3).

Example
>>> tokenizer.bos_token_id  # Llama 3.2
128000
>>> tokenizer.bos_token_id  # Qwen3
None

eos_token_ids: tuple[int, ...]

Token IDs that signal end of generation.

Returns an immutable, ordered tuple of deduplicated EOS token IDs. Many models have multiple EOS tokens (e.g., Qwen, Llama 3, Gemma 3).

Returns

Tuple of EOS token IDs. May be empty if no EOS tokens are configured.

Example
>>> tokenizer.eos_token_ids
(151643, 151644, 151645)

eos_tokens: list[str]

String representations of all EOS tokens.

Returns

List of EOS token strings.

Example
>>> tokenizer.eos_tokens
['<|endoftext|>', '<|im_end|>', '<|end|>']

model_max_length: int

Maximum sequence length the model supports.

This value is read from tokenizer_config.json (model_max_length field). It represents the maximum context length the model was trained with.

When truncation=True is used without an explicit max_length, this value is used as the default truncation limit.

Returns 0 if the model does not specify a maximum length.

Example
>>> tokenizer.model_max_length
32768

model_path: str

The resolved path to the model directory.

pad_token: str | None

String representation of the padding token.

pad_token_id: int | None

Padding token ID.

padding_side: str

Side for padding: "left" (default, for generation) or "right".

Raises
ValidationError

If set to a value other than "left" or "right".

special_ids: frozenset[int]

Immutable set of all special token IDs.

Includes EOS, BOS, UNK, and PAD tokens. Use for fast O(1) membership testing.

Returns

Frozen set of all special token IDs.

Example
>>> 128001 in tokenizer.special_ids
True

truncation_side: str

Side for truncation: "right" (default, keeps beginning) or "left" (keeps end).

Raises
ValidationError

If set to a value other than "left" or "right".

unk_token: str | None

String representation of the unknown token.

unk_token_id: int | None

Unknown token ID.

vocab_size: int

Number of tokens in the vocabulary.

Methods

def __call__(
    self,
    text: str | list[str],
    text_pair: str | list[str] | None = None,
    special_tokens: bool | set[str] = True,
    truncation: bool | str = False,
    max_length: int | None = None,
    return_tensors: str | None = None,
    kwargs: Any
)BatchEncoding

Callable interface for tokenization with zero-copy tensor access.

Returns a BatchEncoding object that provides dict-like access to input_ids and attention_mask via the DLPack protocol. This enables zero-copy transfer to PyTorch, JAX, or NumPy.

Padding is applied automatically when exporting to tensors. Control the padding side via tokenizer.padding_side or batch.padding_side.

Parameters
text

Text(s) to tokenize. Single string is wrapped as batch of 1.

text_pair

Not supported.

special_tokens
  • Control special token insertion. Can be:
  • True: Add all special tokens (BOS, EOS) - default
  • False: No special tokens
  • {"bos"}, {"eos"}, {"bos", "eos"}: Granular control
truncation

False or True.

max_length

Maximum sequence length.

return_tensors

Ignored (use DLPack instead).

Returns
  • BatchEncoding with dict-like interface for zero-copy tensor access:
  • batch["input_ids"] → DLPack-compatible accessor
  • batch["attention_mask"] → DLPack-compatible accessor
Raises
NotImplementedError

If text_pair is provided.

ValidationError

If truncation=True but no max_length specified.

Example
>>> batch = tokenizer(["Hello", "World!"])
>>> input_ids = torch.from_dlpack(batch["input_ids"])
>>> attention_mask = torch.from_dlpack(batch["attention_mask"])
>>> model(input_ids, attention_mask=attention_mask)
Note

For Python list output (debugging only), use batch.to_list().

def apply_chat_template(
    self,
    messages: list,
    add_generation_prompt: bool = True,
    tokenize: bool = False
)str | TokenArray | BatchEncoding

Format a conversation using the model's chat template.

Parameters
messages

List of message dicts with 'role' and 'content'.

add_generation_prompt

If True, add assistant turn marker.

tokenize

If True, return TokenArray instead of string.

Returns

Formatted prompt string, or TokenArray if tokenize=True.

Raises
StateError

If no chat template is available (neither from model nor from_json()).

Note

When tokenize=True, BOS tokens are handled automatically:

  • If the chat template includes BOS in output, it won't be doubled
  • EOS is not added (chat templates use turn markers instead)

This prevents the "double-BOS" bug where models like Llama-3 would receive two BOS tokens (one from the template, one from encode), which can degrade generation quality.

def close(self)None

Release native tokenizer resources.

After calling close(), the tokenizer cannot be used. Safe to call multiple times (idempotent).

Raises

None. This method never raises.

def convert_ids_to_tokens(self, ids: list[int])list[str | None]

Convert a list of token IDs to their string representations.

Parameters
ids

Token IDs to convert.

def convert_tokens_to_ids(self, tokens: list[str])list[int | None]

Convert a list of token strings to their IDs.

Parameters
tokens

Token strings to convert.

def count_tokens(
    self,
    text: str,
    special_tokens: bool | set[str] = True
)int

Count the number of tokens in text.

This returns the exact token count that would be used in generation, including BOS/EOS tokens by default. Use this to check if prompts fit within context windows.

Parameters
text

Text to count tokens for.

special_tokens
  • Control special token counting. Can be:
  • True (default): Include all special tokens (matches generation)
  • False: Count only content tokens
  • {"bos"}, {"eos"}, {"bos", "eos"}: Include specific tokens
Returns

Number of tokens.

Example
>>> # Check if prompt fits in context window
>>> tokens = tokenizer.count_tokens(prompt)
>>> if tokens > 4096:
...     print("Prompt too long!")

>>> # Count without special tokens
>>> content_tokens = tokenizer.count_tokens(text, special_tokens=False)

def decode(
    self,
    tokens: TokenArray | list[int],
    num_tokens: int | None = None,
    skip_special_tokens: bool = True
)str

Convert token IDs back to text.

Parameters
tokens

Token IDs to decode.

num_tokens

Number of tokens (only for raw pointers).

skip_special_tokens

If True, omit special tokens from output.

Returns

Decoded text string.

Raises
ValidationError

If num_tokens is required but not provided.

TokenizerError

If decoding fails.

def encode(
    self,
    text: str | list[str],
    special_tokens: bool | set[str] = True,
    truncation: bool = False,
    max_length: int | None = None,
    truncation_side: str | None = None
)TokenArray | BatchEncoding

Convert text to token IDs.

Parameters
text

Text to tokenize. String returns TokenArray, list returns BatchEncoding.

special_tokens
  • Control special token insertion. Can be:
  • True: Add all special tokens (BOS, EOS) - default
  • False: No special tokens (raw tokenization)
  • {"bos"}: Add only BOS token
  • {"eos"}: Add only EOS token
  • {"bos", "eos"}: Add both (same as True)
truncation

If True, truncate to max_length. When max_length is not specified, uses model_max_length from tokenizer_config.json.

max_length

Maximum sequence length. If None and truncation=True, uses model_max_length.

truncation_side
  • "left" or "right". Overrides tokenizer default.
  • "right" (default): Keep beginning, truncate end
  • "left": Keep end, truncate beginning (useful for RAG)
Returns

TokenArray for single string, BatchEncoding for list.

Raises
ValidationError

If special_tokens is not bool or set[str], or if text is not str or list[str].

TokenizerError

If encoding fails.

MemoryError

If buffer allocation fails.

Note

Special token behavior depends on the model's tokenizer configuration:

  • Models with postprocessor (BERT, RoBERTa): BOS/EOS are added via the tokenizer's postprocessor when special_tokens=True.
  • Chat models (Llama 3, Qwen3, Gemma): Special tokens are typically added via chat templates, not the postprocessor. For these models, special_tokens=True may not add BOS/EOS to raw text. Use apply_chat_template() for proper formatting.
  • If bos_token_id is None (e.g., Qwen3), requesting BOS is a no-op. Check tokenizer.bos_token_id to verify special token availability.

For batch encoding, padding_side is inherited from the tokenizer's padding_side property (default "left" for generation models). To override for a specific batch, set the property on the result:

batch = tokenizer.encode(["Hello", "World"])
batch.padding_side = "right"  # Override before converting
tensor = torch.from_dlpack(batch)
Example
>>> # Default: add all special tokens
>>> tokens = tokenizer.encode("Hello world")

>>> # No special tokens
>>> tokens = tokenizer.encode("Hello world", special_tokens=False)

>>> # BOS only (useful for document snippets)
>>> tokens = tokenizer.encode("Document...", special_tokens={"bos"})

>>> # With truncation (uses model_max_length if no explicit max_length)
>>> tokens = tokenizer.encode(long_text, truncation=True)

>>> # With explicit max_length
>>> tokens = tokenizer.encode(text, truncation=True, max_length=512)

>>> # Left truncation for RAG (keep recent context)
>>> tokens = tokenizer.encode(text, truncation=True, max_length=512,
...                           truncation_side="left")

>>> # Check if model has BOS token
>>> if tokenizer.bos_token_id is not None:
...     print(f"BOS token: {tokenizer.bos_token}")

def from_json(
    cls,
    json_content: str | bytes,
    chat_template: str | None = None,
    bos_token: str = '',
    eos_token: str = '',
    padding_side: str = 'left',
    truncation_side: str = 'right'
)Tokenizer

Create a tokenizer directly from JSON content.

Creates a standalone tokenizer without needing a model directory. Useful for custom tokenizers, serverless deployments, or testing.

Parameters
json_content

The tokenizer.json content as string or bytes.

chat_template

Optional Jinja2 chat template string. If provided, enables apply_chat_template() on this tokenizer.

bos_token

Beginning-of-sequence token string for chat templates.

eos_token

End-of-sequence token string for chat templates.

padding_side

Default padding side ("left" or "right"). Default "left".

truncation_side

Default truncation side ("left" or "right"). Default "right".

Returns

A new Tokenizer instance.

Raises
TokenizerError

If the JSON content is invalid.

ValidationError

If padding_side or truncation_side is invalid.

Example
>>> json = '{"version": "1.0", "model": {"type": "BPE", ...}}'
>>> template = "{% for m in messages %}{{ m.content }}{% endfor %}"
>>> tok = Tokenizer.from_json(json, chat_template=template)
>>> prompt = tok.apply_chat_template([{"role": "user", "content": "Hi"}])

def get_vocab(self)dict[str, int]

Get the complete vocabulary as a dictionary.

Returns

Dictionary mapping token strings to their IDs.

Raises
TokenizerError

If vocabulary retrieval fails.

def id_to_token(self, token_id: int)str | None

Get the string representation of a token ID.

Parameters
token_id

The token ID to convert.

Returns

The token string, or None if the ID is invalid.

Example
>>> tokenizer.id_to_token(9707)
'Hello'

def is_special_id(self, token_id: int)bool

Check if a token ID is a special token.

Parameters
token_id

The token ID to check.

Returns

True if the token ID is a special token.

Example
>>> tokenizer.is_special_id(128001)  # EOS token
True

def primary_eos_token_id(self)int | None

Get the primary EOS token ID for insertion.

Use this when inserting an EOS token. For detection/stopping, use token_id in eos_token_ids instead.

Returns

The primary EOS token ID, or None if no EOS tokens are configured.

Example
>>> eos_id = tokenizer.primary_eos_token_id()
>>> tokens.append(eos_id)

def token_to_id(self, token: str)int | None

Get the ID of a token string.

Parameters
token

The token string to convert.

Returns

The token ID, or None if the token is not in the vocabulary.

Example
>>> tokenizer.token_to_id('Hello')
9707

def tokenize(
    self,
    text: str,
    return_bytes: bool = False
)list[str] | list[bytes]

Split text into token strings.

This is useful for debugging tokenization - seeing exactly how text is segmented before being converted to token IDs.

Parameters
text

Text to tokenize.

return_bytes

If True, return raw bytes instead of strings. Use this for debugging when you need to see exact byte representations (e.g., for tokens with invalid UTF-8 or special byte sequences).

Returns

List of token strings (default) or bytes (if return_bytes=True).

Raises
TokenizerError

If tokenization fails.

Example
>>> tokenizer.tokenize("Hello world")
['Hello', ' world']
>>> # Debug mode - see raw bytes
>>> tokenizer.tokenize("Hello", return_bytes=True)
[b'Hello']
>>> # Useful for debugging unicode edge cases
>>> tokenizer.tokenize("café", return_bytes=True)
[b'caf', b'\\xc3\\xa9']  # Shows UTF-8 encoding

class TokenArray

TokenArray(
    self,
    tokens_ptr: Any,
    num_tokens: int,
    source_text: bytes | None = None,
    tokenizer: Tokenizer | None = None,
    _buffer_handle: Any = None
)

Sequence of token IDs with zero-copy NumPy and DLPack support.

Returned by Tokenizer.encode() and provides efficient access to token data. The underlying memory is managed by a refcounted buffer in the Zig runtime. Implements collections.abc.Sequence[int].

Key Features

Zero-copy NumPy conversion - No data copying when converting to NumPy:

>>> import numpy as np
>>> tokens = tokenizer.encode("Hello world")
>>> arr = np.asarray(tokens)  # Zero-copy view
>>> print(arr.dtype)
uint32

Safe DLPack export - Zero-copy export to PyTorch/JAX without invalidation:

>>> tensor = torch.from_dlpack(tokens)  # Zero-copy!
>>> len(tokens)  # Still valid! TokenArray not invalidated
2
>>> tensor2 = torch.from_dlpack(tokens)  # Multiple exports safe

Standard sequence operations - Works like a Python list:

>>> len(tokens)
2
>>> tokens[0]
9707
>>> tokens[-1]  # Negative indexing
1879

Convert to list - When you need a regular Python list:

>>> tokens.tolist()
[9707, 1879]

Token offset mapping - Map tokens back to source text positions:

>>> text = "Hello world"
>>> tokens = tokenizer.encode(text)
>>> tokens.offsets  # Lazy, computed on first access
[(0, 5), (5, 11)]
>>> tokens.offsets[0].slice(text)  # Recommended
'Hello'
Memory Management

TokenArray uses a refcounted buffer. The buffer is freed when all references are released (TokenArray deleted + all DLPack exports consumed).

NumPy arrays from `np.asarray()` are views - they become invalid if the TokenArray is deleted AND no DLPack exports exist:

>>> tokens = tokenizer.encode("Hello")
>>> arr = np.asarray(tokens)
>>> del tokens  # Only safe if no other references exist!
>>> arr[0]      # May be undefined if buffer was freed

To keep the data safely, either: 1. Keep the TokenArray alive while using the NumPy view 2. Copy it: `arr = np.array(tokens)` 3. Use DLPack: `tensor = torch.from_dlpack(tokens)` (keeps buffer alive)

See Also

Tokenizer.encode : Creates TokenArrays from text. Tokenizer.decode : Converts TokenArrays back to text. TokenOffset : The offset type returned by the offsets property.

Quick Reference

Properties

Name Type
offsets list[TokenOffset]

Methods

Method Description
close() Release the underlying native buffer.
count() Return number of occurrences of value.
index() Return index of first occurrence of value.
tolist() Convert to a Python list.

Properties

offsets: list[TokenOffset]

Lazy byte offsets mapping tokens back to source text.

Each offset is a (start, end) pair of UTF-8 byte indices into the original text. The first access triggers computation via the Zig runtime; subsequent accesses return the cached result.

Special tokens (BOS, EOS, PAD) that don't correspond to source text are assigned (0, 0).

Returns

List of TokenOffset objects, one per token.

Raises
RuntimeError

If source text was not preserved during encoding, or if offset computation fails.

Example
>>> text = "Hello 🎉 world"
>>> tokens = tokenizer.encode(text)
>>> tokens.offsets
[(0, 5), (5, 10), ...]

Use the slice() helper to extract text (handles Unicode correctly):

>>> tokens.offsets[0].slice(text)
'Hello'
>>> tokens.offsets[-1].slice(text)
' world'

Methods

def close(self)None

Release the underlying native buffer.

After calling close(), the TokenArray cannot be used for data access, DLPack export, or NumPy conversion. Safe to call multiple times (idempotent).

def count(self, value: int)int

Return number of occurrences of value.

Parameters
value

Token ID to count.

Returns

Number of times value appears in the array.

Example
>>> tokens = tokenizer.encode("hello hello hello")
>>> tokens.count(9707)  # Count occurrences
3

def index(
    self,
    value: int,
    start: int = 0,
    stop: int | None = None
)int

Return index of first occurrence of value.

Parameters
value

Token ID to search for.

start

Start index for search (default 0).

stop

Stop index for search (default end of array).

Returns

Index of first occurrence of value.

Raises
ValueError

If value is not in the array.

Example
>>> tokens = tokenizer.encode("Hello world")
>>> tokens.index(9707)  # Find first occurrence
0

def tolist(self)list[int]

Convert to a Python list.

This copies the data into a new Python list. Use this when you need a regular list, or when you need to keep the data after the TokenArray is deleted.

Returns

A new list containing the token IDs.

Example
>>> tokens = tokenizer.encode("Hello world")
>>> token_list = tokens.tolist()
>>> print(token_list)
[9707, 1879]
>>> type(token_list)
<class 'list'>

class TokenArrayView

TokenArrayView(
    self,
    ids_ptr: Any,
    start: int,
    length: int,
    parent: BatchEncoding
)

Lightweight view into a BatchEncoding's contiguous buffer.

Zero-copy view that slices into the parent's memory. The parent BatchEncoding must remain alive while this view is used.

Users interact with this class when indexing into a BatchEncoding:

>>> batch = tokenizer.encode(["Hello", "World"])
>>> view = batch[0]  # Returns TokenArrayView
>>> list(view)
[9707]

Quick Reference

Methods

Method Description
count() Return number of occurrences of value.
index() Return index of first occurrence of value.
tolist() Convert to a Python list (copies data).

Methods

def count(self, value: int)int

Return number of occurrences of value.

Parameters
value

Token ID to count.

def index(
    self,
    value: int,
    start: int = 0,
    stop: int | None = None
)int

Return index of first occurrence of value.

Parameters
value

Token ID to search for.

start

Start index for search.

stop

Stop index for search, or None for end of view.

def tolist(self)list[int]

Convert to a Python list (copies data).


class TokenOffset

TokenOffset(
    self,
    start: int,
    end: int
)

Byte offset pair mapping a token to its position in source text.

Represents a (start, end) byte range in the original UTF-8 encoded text. Offsets are byte indices, not character indices, enabling O(1) slicing regardless of Unicode content.

Methods

slice(text, errors="strict") Extracts the text span for this token. This is the recommended way to use offsets as it handles byte-to-string conversion automatically.

Attributes
start : int

Start byte offset in source text (inclusive).

end : int

End byte offset in source text (exclusive).

Note
  • Offsets are UTF-8 byte indices, not character indices. Do NOT use them directly on Python strings (e.g., text[start:end]). Use the slice() method instead.
  • For byte-level BPE tokenizers (GPT-2, Qwen), individual tokens may split multi-byte UTF-8 sequences. Use errors="replace" in slice() to handle these gracefully.
  • Special tokens (BOS, EOS) that don't correspond to source text are assigned (0, 0).
Examples
Extract text using slice() method (recommended)
>>> text = "Hello 🎉 world"
    >>> tokens = tokenizer.encode(text)
    >>> tokens.offsets[-1].slice(text)
    ' world'

Handle byte-level BPE tokens that split UTF-8 sequences:

>>> tokens.offsets[1].slice(text, errors="replace")
' ��'  # Replacement chars for partial emoji bytes
Tuple unpacking for raw byte indices
>>> offset = TokenOffset(0, 5)
>>> start, end = offset
>>> print(start, end)
0 5
Comparing with tuples
>>> offset = TokenOffset(0, 5)
    >>> offset == (0, 5)
True

Quick Reference

Methods

Method Description
slice() Extract the text span corresponding to this off...

Methods

def slice(
    self,
    text: str | bytes,
    errors: str = 'strict'
)str

Extract the text span corresponding to this offset.

This is the recommended way to use offsets. It handles the conversion between byte offsets and Python string indexing automatically.

Parameters
text

The original source text (str or bytes).

errors

How to handle decode errors. Default "strict" raises UnicodeDecodeError. Use "replace" to substitute invalid bytes with the replacement character. This can happen with byte-level BPE tokenizers that split multi-byte UTF-8 sequences.

Returns

The substring corresponding to this token.

Raises
UnicodeDecodeError

If the byte span is not valid UTF-8 and errors="strict" (default). This can happen with byte-level BPE tokenizers (GPT-2, Qwen) that split multi-byte characters across tokens.

Example
>>> tokens = tokenizer.encode("Hello 🎉 world")
>>> tokens.offsets[-1].slice("Hello 🎉 world")
' world'

For byte-level BPE tokens that may split UTF-8 sequences:

>>> tokens.offsets[1].slice(text, errors="replace")
' \ufffd\ufffd'  # Replacement characters for partial bytes

class BatchEncoding

BatchEncoding(
    self,
    ids_ptr: Any = None,
    offsets_ptr: Any = None,
    total_tokens: int = 0,
    num_sequences: int = 0,
    padding_side: str = 'left',
    pad_token_id: int | None = None
)

Lazy container for batch tokenization results.

Holds all token IDs in a single contiguous memory block, with lazy creation of TokenArray views when accessing individual sequences.

Key Features

List-like interface - Works like a list of TokenArrays:

>>> batch = tokenizer.encode(["Hello", "World"])
>>> len(batch)
2
>>> batch[0]  # Returns lazy TokenArray view
TokenArray([...], len=...)

Lazy evaluation - TokenArray views are created on-demand:

>>> # Only creates view when accessed
>>> first = batch[0]

Dictionary-like access - Compatible with HuggingFace patterns:

>>> input_ids = torch.from_dlpack(batch["input_ids"])
>>> attention_mask = torch.from_dlpack(batch["attention_mask"])
ML Framework Integration

To convert to PyTorch or JAX, use the specific tensor accessors. Do NOT try to convert the batch object itself - it contains multiple tensors and will raise an InteropError.

Correct usage:

>>> import torch
>>> input_ids = torch.from_dlpack(batch.input_ids)
>>> attention_mask = torch.from_dlpack(batch.attention_mask)

Incorrect usage (raises InteropError):

>>> tensor = torch.from_dlpack(batch)  # Error: Ambiguous - which tensor?
Memory Management

BatchEncoding owns the underlying memory and frees it when garbage collected. The TokenArray views returned by indexing are lightweight wrappers that point into this shared memory - they become invalid if the BatchEncoding is deleted.

Quick Reference

Properties

Name Type
attention_mask _DLPackAccessor
input_ids _DLPackAccessor
pad_token_id int | None
padding_side str
total_tokens int

Methods

Method Description
close() Release the native batch encoding memory.
keys() Return available tensor keys (dict-like interfa...
lengths() Get the length of each sequence in the batch.
max_length() Get the maximum sequence length in the batch.
to_list() Convert batch to padded Python lists.

Properties

attention_mask: _DLPackAccessor

DLPack interface for the attention mask tensor.

  • Returns an accessor object that implements the DLPack protocol. Use with torch.from_dlpack() or np.from_dlpack() to get a 2D tensor of shape (num_sequences, padded_length) with values:
  • 1 for real tokens
  • 0 for padding tokens

NOTE: Each export allocates a new mask buffer computed from sequence lengths and padding configuration.

Example
>>> batch = tokenizer.encode(["Hello", "World!"])
>>> input_ids = torch.from_dlpack(batch.input_ids)
>>> attention_mask = torch.from_dlpack(batch.attention_mask)
>>> model(input_ids, attention_mask=attention_mask)

input_ids: _DLPackAccessor

DLPack interface for the input_ids tensor.

Returns an accessor object that implements the DLPack protocol. Use with torch.from_dlpack() or np.from_dlpack() to get a 2D tensor of shape (num_sequences, padded_length).

NOTE: Each export allocates a new padded buffer. The internal CSR storage is materialized to dense format for ML framework consumption.

Example
>>> batch = tokenizer.encode(["Hello", "World!"])
>>> input_ids = torch.from_dlpack(batch.input_ids)
>>> attention_mask = torch.from_dlpack(batch.attention_mask)
>>> model(input_ids, attention_mask=attention_mask)

pad_token_id: int | None

Padding token ID used for this batch.

padding_side: str

Side for padding: "left" (default for generation) or "right".

This property is inherited from the tokenizer when the batch is created. Set this property to override padding behavior before calling to_list() or using torch.from_dlpack().

Example
>>> batch = tokenizer.encode(["Hello", "World"])
>>> batch.padding_side
'left'
>>> batch.padding_side = "right"  # Override for this batch
>>> tensor = torch.from_dlpack(batch)  # Uses right padding

total_tokens: int

Total number of tokens across all sequences.

Methods

def close(self)None

Release the native batch encoding memory.

After calling close(), the BatchEncoding cannot be used for data access, DLPack export, or iteration. Safe to call multiple times (idempotent).

def keys(self)list[str]

Return available tensor keys (dict-like interface).

def lengths(self)list[int]

Get the length of each sequence in the batch.

def max_length(self)int

Get the maximum sequence length in the batch.

def to_list(
    self,
    padding: bool = True,
    pad_id: int | None = None,
    padding_side: str | None = None,
    max_length: int | None = None,
    truncation: bool = False,
    return_attention_mask: bool = True
)dict[str, list[list[int]]]

Convert batch to padded Python lists.

This is primarily useful for debugging or when you need plain Python data structures. For ML workloads, use torch.from_dlpack(batch.input_ids) and torch.from_dlpack(batch.attention_mask) directly.

Parameters
padding

If True (default), pad shorter sequences.

pad_id

Token ID to use for padding. Defaults to stored pad_token_id from tokenizer, or 0 if not set.

padding_side
  • Where to add padding tokens. Defaults to value passed to encode(), which defaults to tokenizer.padding_side.
  • "right": Pad at end (encoder models)
  • "left": Pad at start (decoder/generation models)
max_length

Maximum length to pad to. If None, uses longest sequence.

truncation

If True, truncate sequences longer than max_length.

return_attention_mask

If True (default), include attention_mask.

Returns
Dictionary with
  • "input_ids": 2D list of padded token IDs
  • "attention_mask": 2D list of masks (1=real, 0=padding)
Raises
ValidationError

If padding_side is not 'left' or 'right'.