xlm.utils.text
Text utility functions for xlm.
remove_trailing_pads(text, tokenizer, tokens_to_remove=None)
Remove trailing special tokens from decoded text.
For each token, strips either a spaced suffix (" {token}") or the bare
token, matching the original pad-only behavior. Repeats until no listed
suffix matches the end of the string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Decoded text string that may contain trailing special tokens |
required |
tokenizer
|
Tokenizer
|
Tokenizer instance (used for default |
required |
tokens_to_remove
|
Optional[List[str]]
|
Strings to strip from the end; defaults to |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Text with trailing occurrences of those tokens removed |
remove_trailing_pads_show_the_count(text, tokenizer, tokens_to_remove=None)
Remove trailing special tokens from decoded text and show the count of the removed tokens.
Uses the same suffix rules as :func:remove_trailing_pads. Each successful end strip
increments the count. If any strips occurred, appends " [removed N]" to the result.