`xlm.tasks.molgen`

Molecule generation task utilities and metrics.

This module provides: - Data preprocessing for SAFE molecular representations - Conversion utilities between SAFE and SMILES formats - Comprehensive metrics for evaluating molecular generation (diversity, QED, SA, validity, uniqueness)

`SerializableSAFETokenizer`

Wrapper around SAFE tokenizer that handles pickling/deepcopy.

The underlying tokenizer from the safe library has a custom PreTokenizer that cannot be serialized. This wrapper provides dummy serialization by storing the model path and re-instantiating the tokenizer on unpickle.

`getstate()`

Return state for pickling - only store the model path.

`setstate(state)`

Restore from pickled state - re-instantiate tokenizer.

`getattr(name)`

Delegate all attribute access to the underlying tokenizer.

`dir()`

Include both wrapper and tokenizer attributes in dir().

`DeNovoEval`

Post-hoc evaluator for de novo molecule generation.

Computes molecular properties on logged predictions at epoch end, matching GenMol's evaluation semantics. Computes: - Per-sample: QED, SA, SMILES (added to each prediction dict) - Global: Diversity, Validity, Uniqueness (aggregated across all samples)

This approach enables: - Global metric computation (diversity/uniqueness on full generated set) - Exact match with GenMol's evaluation methodology - Reusable components for other tasks (frag, lead, pmo)

Parameters:

Name	Type	Description	Default
`use_bracket_safe`	`bool`	If True, decode from bracket SAFE format	`False`
`compute_diversity`	`bool`	If True, compute diversity metric	`True`
`compute_validity`	`bool`	If True, compute validity metric	`True`
`compute_uniqueness`	`bool`	If True, compute uniqueness metric	`True`
`compute_qed`	`bool`	If True, compute QED scores	`True`
`compute_sa`	`bool`	If True, compute SA scores	`True`

`oracle_qed` `property`

Lazy load QED oracle.

`oracle_sa` `property`

Lazy load SA oracle.

`evaluator_diversity` `property`

Lazy load diversity evaluator.

`evaluator_validity` `property`

Lazy load validity evaluator.

`evaluator_uniqueness` `property`

Lazy load uniqueness evaluator.

`eval(predictions, tokenizer=None)`

Evaluate predictions and return updated predictions + aggregated metrics.

Parameters:

Name	Type	Description	Default
`predictions`	`List[Dict[str, Any]]`	List of prediction dicts with 'text' field containing SAFE strings	required
`tokenizer`	`Any`	Optional tokenizer (not used for denovo, but kept for interface consistency)	`None`

Returns:

Type	Description
`List[Dict[str, Any]]`	Tuple of:
`Dict[str, Any]`	predictions: Updated list with per-sample metrics added (smiles, qed, sa)
`Tuple[List[Dict[str, Any]], Dict[str, Any]]`	aggregated_metrics: Dict of global metric values

`FragmentEval`

Bases: DeNovoEval

Post-hoc evaluator for fragment-constrained molecule generation.

Extends DeNovoEval with fragment-specific metrics: - All de novo metrics (validity, uniqueness, quality, QED, SA, diversity) - Distance: Tanimoto distance between generated and target molecules

Based on GenMol's fragment evaluation methodology. Computes: - Per-sample: QED, SA, SMILES, distance (if target available) - Global: Diversity, Validity, Uniqueness, Quality, Distance (mean)

Parameters:

Name	Type	Description	Default
`use_bracket_safe`	`bool`	If True, decode from bracket SAFE format	`False`
`compute_diversity`	`bool`	If True, compute diversity metric	`True`
`compute_validity`	`bool`	If True, compute validity metric	`True`
`compute_uniqueness`	`bool`	If True, compute uniqueness metric	`True`
`compute_qed`	`bool`	If True, compute QED scores	`True`
`compute_sa`	`bool`	If True, compute SA scores	`True`
`compute_distance`	`bool`	If True, compute Tanimoto distance to target	`True`

`eval(predictions, tokenizer=None)`

Evaluate fragment generation predictions.

Parameters:

Name	Type	Description	Default
`predictions`	`List[Dict[str, Any]]`	List of prediction dicts with: - 'text': Generated SAFE string (full molecule) - 'truth': Target SAFE string (full molecule, optional) - 'raw_input': Fragment prompt SAFE string (optional)	required
`tokenizer`	`Any`	Optional tokenizer (not used, kept for interface consistency)	`None`

Returns:

Type	Description
`List[Dict[str, Any]]`	Tuple of:
`Dict[str, Any]`	predictions: Updated list with per-sample metrics added
`Tuple[List[Dict[str, Any]], Dict[str, Any]]`	aggregated_metrics: Dict of global metric values

`safe2bracketsafe(safe_str)`

Convert standard SAFE notation to bracket SAFE format.

Bracket SAFE wraps interfragment attachment points in angle brackets. Example: "1" -> "<1>", "%10" -> "<%10>"

Based on genmol/src/genmol/utils/bracket_safe_converter.py:133-137

Parameters:

Name	Type	Description	Default
`safe_str`	`str`	SAFE string in standard notation	required

Returns:

Type	Description
`str`	SAFE string in bracket notation, or original string if conversion fails

`bracketsafe2safe(safe_str)`

Convert bracket SAFE notation back to standard SAFE format.

Removes angle brackets from interfragment attachment points and renumbers them to avoid conflicts with intrafragment attachment points.

Based on genmol/src/genmol/utils/bracket_safe_converter.py:140-153

Parameters:

Name	Type	Description	Default
`safe_str`	`str`	SAFE string in bracket notation	required

Returns:

Type	Description
`str`	SAFE string in standard notation

`safe_to_smiles(safe_str, fix=True)`

Convert SAFE string to SMILES using safe library.

Based on genmol/src/genmol/utils/utils_chem.py:26-30

Parameters:

Name	Type	Description	Default
`safe_str`	`str`	SAFE molecular representation	required
`fix`	`bool`	If True, filter out invalid fragments before decoding	`True`

Returns:

Type	Description
`Optional[str]`	SMILES string, or None if conversion fails

`safe_strings_to_smiles(safe_strings, use_bracket_safe=False, fix=True)`

Convert batch of SAFE strings to SMILES strings.

Based on genmol/src/genmol/sampler.py:81-89

Parameters:

Name	Type	Description	Default
`safe_strings`	`List[str]`	List of SAFE molecular representations	required
`use_bracket_safe`	`bool`	If True, convert from bracket SAFE first	`False`
`fix`	`bool`	If True, filter invalid fragments	`True`

Returns:

Type	Description
`List[str]`	List of SMILES strings (invalid conversions are skipped)

`safe_bracket_on_the_fly_processor_combined(example, tokenizer, block_size=None)`

Works directly on the raw strings

`genmol_fragment_preprocess_fn(example, tokenizer, *, fragment_column='linker_design')`

Preprocess GenMol fragment CSV data for fragment-constrained generation.

Converts SMILES fragments and targets to SAFE format, then to bracket SAFE, and creates prompt_token_ids (fragment) and input_token_ids (full molecule).

Based on GenMol's fragment evaluation dataset structure: - Input: Fragment SMILES (from fragment_column, default 'linker_design') - Target: Full molecule SMILES (from 'smiles' column)

To use a different fragment column, pass fragment_column via preprocess_function_kwargs in the dataset config, or set _fragment_column in the example dict (overrides kwarg).

Parameters:

Name	Type	Description	Default
`example`	`Dict[str, Any]`	Dataset example containing: - column named by `fragment_column`: SMILES with [n*] attachment points - 'smiles': Full target molecule SMILES - '_fragment_column' (optional): Overrides `fragment_column` if set	required
`tokenizer`	`PreTrainedTokenizerBase`	Tokenizer for encoding	required
`fragment_column`	`str`	CSV column to use as fragment input (default: 'linker_design'). Override via datamodule ... preprocess_function_kwargs.fragment_column.	`'linker_design'`

Returns:

Type	Description
`Dict[str, Any]`	Example with 'prompt_token_ids' (fragment) and 'input_token_ids' (full molecule)

xlm.tasks.molgen

SerializableSAFETokenizer

__getstate__()

__setstate__(state)

__getattr__(name)

__dir__()

DeNovoEval

oracle_qed property

oracle_sa property

evaluator_diversity property

evaluator_validity property

evaluator_uniqueness property

eval(predictions, tokenizer=None)

FragmentEval

eval(predictions, tokenizer=None)

safe2bracketsafe(safe_str)

bracketsafe2safe(safe_str)

safe_to_smiles(safe_str, fix=True)

safe_strings_to_smiles(safe_strings, use_bracket_safe=False, fix=True)

safe_bracket_on_the_fly_processor_combined(example, tokenizer, block_size=None)

genmol_fragment_preprocess_fn(example, tokenizer, *, fragment_column='linker_design')

`xlm.tasks.molgen`

`SerializableSAFETokenizer`

`getstate()`

`setstate(state)`

`getattr(name)`

`dir()`

`DeNovoEval`

`oracle_qed` `property`

`oracle_sa` `property`

`evaluator_diversity` `property`

`evaluator_validity` `property`

`evaluator_uniqueness` `property`

`eval(predictions, tokenizer=None)`

`FragmentEval`

`eval(predictions, tokenizer=None)`

`safe2bracketsafe(safe_str)`

`bracketsafe2safe(safe_str)`

`safe_to_smiles(safe_str, fix=True)`

`safe_strings_to_smiles(safe_strings, use_bracket_safe=False, fix=True)`

`safe_bracket_on_the_fly_processor_combined(example, tokenizer, block_size=None)`

`genmol_fragment_preprocess_fn(example, tokenizer, *, fragment_column='linker_design')`