Skip to content

xlm.tasks.molgen

Molecule generation task utilities and metrics.

This module provides: - Data preprocessing for SAFE molecular representations - Conversion utilities between SAFE and SMILES formats - Comprehensive metrics for evaluating molecular generation (diversity, QED, SA, validity, uniqueness)

SerializableSAFETokenizer

Wrapper around SAFE tokenizer that handles pickling/deepcopy.

The underlying tokenizer from the safe library has a custom PreTokenizer that cannot be serialized. This wrapper provides dummy serialization by storing the model path and re-instantiating the tokenizer on unpickle.

__getstate__()

Return state for pickling - only store the model path.

__setstate__(state)

Restore from pickled state - re-instantiate tokenizer.

__getattr__(name)

Delegate all attribute access to the underlying tokenizer.

__dir__()

Include both wrapper and tokenizer attributes in dir().

DeNovoEval

Post-hoc evaluator for de novo molecule generation.

Computes molecular properties on logged predictions at epoch end, matching GenMol's evaluation semantics. Computes: - Per-sample: QED, SA, SMILES (added to each prediction dict) - Global: Diversity, Validity, Uniqueness (aggregated across all samples)

This approach enables: - Global metric computation (diversity/uniqueness on full generated set) - Exact match with GenMol's evaluation methodology - Reusable components for other tasks (frag, lead, pmo)

Parameters:

Name Type Description Default
use_bracket_safe bool

If True, decode from bracket SAFE format

False
compute_diversity bool

If True, compute diversity metric

True
compute_validity bool

If True, compute validity metric

True
compute_uniqueness bool

If True, compute uniqueness metric

True
compute_qed bool

If True, compute QED scores

True
compute_sa bool

If True, compute SA scores

True

oracle_qed property

Lazy load QED oracle.

oracle_sa property

Lazy load SA oracle.

evaluator_diversity property

Lazy load diversity evaluator.

evaluator_validity property

Lazy load validity evaluator.

evaluator_uniqueness property

Lazy load uniqueness evaluator.

eval(predictions, tokenizer=None)

Evaluate predictions and return updated predictions + aggregated metrics.

Parameters:

Name Type Description Default
predictions List[Dict[str, Any]]

List of prediction dicts with 'text' field containing SAFE strings

required
tokenizer Any

Optional tokenizer (not used for denovo, but kept for interface consistency)

None

Returns:

Type Description
List[Dict[str, Any]]

Tuple of:

Dict[str, Any]
  • predictions: Updated list with per-sample metrics added (smiles, qed, sa)
Tuple[List[Dict[str, Any]], Dict[str, Any]]
  • aggregated_metrics: Dict of global metric values

FragmentEval

Bases: DeNovoEval

Post-hoc evaluator for fragment-constrained molecule generation.

Extends DeNovoEval with fragment-specific metrics: - All de novo metrics (validity, uniqueness, quality, QED, SA, diversity) - Distance: Tanimoto distance between generated and target molecules

Based on GenMol's fragment evaluation methodology. Computes: - Per-sample: QED, SA, SMILES, distance (if target available) - Global: Diversity, Validity, Uniqueness, Quality, Distance (mean)

Parameters:

Name Type Description Default
use_bracket_safe bool

If True, decode from bracket SAFE format

False
compute_diversity bool

If True, compute diversity metric

True
compute_validity bool

If True, compute validity metric

True
compute_uniqueness bool

If True, compute uniqueness metric

True
compute_qed bool

If True, compute QED scores

True
compute_sa bool

If True, compute SA scores

True
compute_distance bool

If True, compute Tanimoto distance to target

True

eval(predictions, tokenizer=None)

Evaluate fragment generation predictions.

Parameters:

Name Type Description Default
predictions List[Dict[str, Any]]

List of prediction dicts with: - 'text': Generated SAFE string (full molecule) - 'truth': Target SAFE string (full molecule, optional) - 'raw_input': Fragment prompt SAFE string (optional)

required
tokenizer Any

Optional tokenizer (not used, kept for interface consistency)

None

Returns:

Type Description
List[Dict[str, Any]]

Tuple of:

Dict[str, Any]
  • predictions: Updated list with per-sample metrics added
Tuple[List[Dict[str, Any]], Dict[str, Any]]
  • aggregated_metrics: Dict of global metric values

safe2bracketsafe(safe_str)

Convert standard SAFE notation to bracket SAFE format.

Bracket SAFE wraps interfragment attachment points in angle brackets. Example: "1" -> "<1>", "%10" -> "<%10>"

Based on genmol/src/genmol/utils/bracket_safe_converter.py:133-137

Parameters:

Name Type Description Default
safe_str str

SAFE string in standard notation

required

Returns:

Type Description
str

SAFE string in bracket notation, or original string if conversion fails

bracketsafe2safe(safe_str)

Convert bracket SAFE notation back to standard SAFE format.

Removes angle brackets from interfragment attachment points and renumbers them to avoid conflicts with intrafragment attachment points.

Based on genmol/src/genmol/utils/bracket_safe_converter.py:140-153

Parameters:

Name Type Description Default
safe_str str

SAFE string in bracket notation

required

Returns:

Type Description
str

SAFE string in standard notation

safe_to_smiles(safe_str, fix=True)

Convert SAFE string to SMILES using safe library.

Based on genmol/src/genmol/utils/utils_chem.py:26-30

Parameters:

Name Type Description Default
safe_str str

SAFE molecular representation

required
fix bool

If True, filter out invalid fragments before decoding

True

Returns:

Type Description
Optional[str]

SMILES string, or None if conversion fails

safe_strings_to_smiles(safe_strings, use_bracket_safe=False, fix=True)

Convert batch of SAFE strings to SMILES strings.

Based on genmol/src/genmol/sampler.py:81-89

Parameters:

Name Type Description Default
safe_strings List[str]

List of SAFE molecular representations

required
use_bracket_safe bool

If True, convert from bracket SAFE first

False
fix bool

If True, filter invalid fragments

True

Returns:

Type Description
List[str]

List of SMILES strings (invalid conversions are skipped)

safe_bracket_on_the_fly_processor_combined(example, tokenizer, block_size=None)

Works directly on the raw strings

genmol_fragment_preprocess_fn(example, tokenizer, *, fragment_column='linker_design')

Preprocess GenMol fragment CSV data for fragment-constrained generation.

Converts SMILES fragments and targets to SAFE format, then to bracket SAFE, and creates prompt_token_ids (fragment) and input_token_ids (full molecule).

Based on GenMol's fragment evaluation dataset structure: - Input: Fragment SMILES (from fragment_column, default 'linker_design') - Target: Full molecule SMILES (from 'smiles' column)

To use a different fragment column, pass fragment_column via preprocess_function_kwargs in the dataset config, or set _fragment_column in the example dict (overrides kwarg).

Parameters:

Name Type Description Default
example Dict[str, Any]

Dataset example containing: - column named by fragment_column: SMILES with [n*] attachment points - 'smiles': Full target molecule SMILES - '_fragment_column' (optional): Overrides fragment_column if set

required
tokenizer PreTrainedTokenizerBase

Tokenizer for encoding

required
fragment_column str

CSV column to use as fragment input (default: 'linker_design'). Override via datamodule ... preprocess_function_kwargs.fragment_column.

'linker_design'

Returns:

Type Description
Dict[str, Any]

Example with 'prompt_token_ids' (fragment) and 'input_token_ids' (full molecule)