Lesson overview | Previous part | Next part
Tokenization Math: Part 4: Unigram and SentencePiece to 6. Information and Cost
4. Unigram and SentencePiece
Unigram and SentencePiece develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.
4.1 Unigram token probability model
Purpose. Unigram token probability model focuses on probabilistic segmentation with . In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
4.2 Viterbi segmentation
Purpose. Viterbi segmentation focuses on dynamic programming for best token path. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
Given token probabilities, Viterbi dynamic programming finds the highest-probability segmentation of a string.
Worked reading.
For abab, a unigram model compares [ab, ab], [a, b, ab], [aba, b], and other valid paths by summing log probabilities.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- SentencePiece unigram decoding.
- best-path segmentation.
- subword sampling baseline.
Non-examples:
- frequency-only pair merging.
- a regex split with no scoring.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
4.3 Forward probabilities
Purpose. Forward probabilities focuses on summing over all segmentations. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
4.4 EM intuition
Purpose. EM intuition focuses on soft counts for token pieces. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
4.5 Subword regularization
Purpose. Subword regularization focuses on sampling multiple valid segmentations. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
5. WordPiece
WordPiece develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.
5.1 WordPiece merge score
Purpose. WordPiece merge score focuses on association-style pair scoring. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
WordPiece builds subwords using an association-style score and commonly distinguishes word starts from continuation pieces.
Worked reading.
A longest-match encoder chooses the longest valid vocabulary piece at each position, using continuation markers inside words.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BERT tokenization.
- continuation pieces.
- greedy longest match.
Non-examples:
- byte fallback.
- unigram path sampling.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
5.2 Continuation markers
Purpose. Continuation markers focuses on why prefixes and inside-word pieces differ. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
5.3 Greedy longest-match encoding
Purpose. Greedy longest-match encoding focuses on BERT-style deterministic segmentation. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
5.4 WordPiece versus BPE
Purpose. WordPiece versus BPE focuses on score criterion and encoding behavior. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
WordPiece builds subwords using an association-style score and commonly distinguishes word starts from continuation pieces.
Worked reading.
A longest-match encoder chooses the longest valid vocabulary piece at each position, using continuation markers inside words.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BERT tokenization.
- continuation pieces.
- greedy longest match.
Non-examples:
- byte fallback.
- unigram path sampling.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
5.5 Unknown-token risk
Purpose. Unknown-token risk focuses on why byte fallback changes robustness. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
6. Information and Cost
Information and Cost develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.
6.1 Compression ratio
Purpose. Compression ratio focuses on characters per token and bytes per token. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.
Worked reading.
A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- characters per token.
- tokens per word.
- entropy of token frequencies.
Non-examples:
- judging cost by words alone.
- ignoring sequence-length effects in attention.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
6.2 Token entropy
Purpose. Token entropy focuses on distributional balance of token ids. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.
Worked reading.
A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- characters per token.
- tokens per word.
- entropy of token frequencies.
Non-examples:
- judging cost by words alone.
- ignoring sequence-length effects in attention.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
6.3 Vocabulary parameter cost
Purpose. Vocabulary parameter cost focuses on embedding and output matrix scaling. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
The vocabulary is a finite set of pieces with stable integer ids. Neural embeddings make those ids trainable vectors.
Worked reading.
With vocabulary size and model width , the input embedding table has parameters.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- embedding lookup.
- output softmax.
- reserved special ids.
Non-examples:
- renumbering tokens after training.
- adding pieces without resizing embeddings.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
6.4 Context window efficiency
Purpose. Context window efficiency focuses on how token count changes usable context. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.
Worked reading.
A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- characters per token.
- tokens per word.
- entropy of token frequencies.
Non-examples:
- judging cost by words alone.
- ignoring sequence-length effects in attention.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
6.5 Multilingual fertility
Purpose. Multilingual fertility focuses on tokens per word across languages or scripts. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.
Worked reading.
A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- characters per token.
- tokens per word.
- entropy of token frequencies.
Non-examples:
- judging cost by words alone.
- ignoring sequence-length effects in attention.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.