Skip to content

xlm.utils.text

Text utility functions for xlm.

remove_trailing_pads(text, tokenizer, tokens_to_remove=None)

Remove trailing special tokens from decoded text.

For each token, strips either a spaced suffix (" {token}") or the bare token, matching the original pad-only behavior. Repeats until no listed suffix matches the end of the string.

Parameters:

Name Type Description Default
text str

Decoded text string that may contain trailing special tokens

required
tokenizer Tokenizer

Tokenizer instance (used for default pad_token)

required
tokens_to_remove Optional[List[str]]

Strings to strip from the end; defaults to pad_token

None

Returns:

Type Description
str

Text with trailing occurrences of those tokens removed

remove_trailing_pads_show_the_count(text, tokenizer, tokens_to_remove=None)

Remove trailing special tokens from decoded text and show the count of the removed tokens.

Uses the same suffix rules as :func:remove_trailing_pads. Each successful end strip increments the count. If any strips occurred, appends " [removed N]" to the result.