xlm.tasks.molgen
Molecule generation task utilities and metrics.
This module provides: - Data preprocessing for SAFE molecular representations - Conversion utilities between SAFE and SMILES formats - Comprehensive metrics for evaluating molecular generation (diversity, QED, SA, validity, uniqueness)
SerializableSAFETokenizer
Wrapper around SAFE tokenizer that handles pickling/deepcopy.
The underlying tokenizer from the safe library has a custom PreTokenizer that cannot be serialized. This wrapper provides dummy serialization by storing the model path and re-instantiating the tokenizer on unpickle.
__getstate__()
Return state for pickling - only store the model path.
__setstate__(state)
Restore from pickled state - re-instantiate tokenizer.
__getattr__(name)
Delegate all attribute access to the underlying tokenizer.
__dir__()
Include both wrapper and tokenizer attributes in dir().
DeNovoEval
Post-hoc evaluator for de novo molecule generation.
Computes molecular properties on logged predictions at epoch end, matching GenMol's evaluation semantics. Computes: - Per-sample: QED, SA, SMILES (added to each prediction dict) - Global: Diversity, Validity, Uniqueness (aggregated across all samples)
This approach enables: - Global metric computation (diversity/uniqueness on full generated set) - Exact match with GenMol's evaluation methodology - Reusable components for other tasks (frag, lead, pmo)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
use_bracket_safe
|
bool
|
If True, decode from bracket SAFE format |
False
|
compute_diversity
|
bool
|
If True, compute diversity metric |
True
|
compute_validity
|
bool
|
If True, compute validity metric |
True
|
compute_uniqueness
|
bool
|
If True, compute uniqueness metric |
True
|
compute_qed
|
bool
|
If True, compute QED scores |
True
|
compute_sa
|
bool
|
If True, compute SA scores |
True
|
oracle_qed
property
Lazy load QED oracle.
oracle_sa
property
Lazy load SA oracle.
evaluator_diversity
property
Lazy load diversity evaluator.
evaluator_validity
property
Lazy load validity evaluator.
evaluator_uniqueness
property
Lazy load uniqueness evaluator.
eval(predictions, tokenizer=None)
Evaluate predictions and return updated predictions + aggregated metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
List[Dict[str, Any]]
|
List of prediction dicts with 'text' field containing SAFE strings |
required |
tokenizer
|
Any
|
Optional tokenizer (not used for denovo, but kept for interface consistency) |
None
|
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
Tuple of: |
Dict[str, Any]
|
|
Tuple[List[Dict[str, Any]], Dict[str, Any]]
|
|
FragmentEval
Bases: DeNovoEval
Post-hoc evaluator for fragment-constrained molecule generation.
Extends DeNovoEval with fragment-specific metrics: - All de novo metrics (validity, uniqueness, quality, QED, SA, diversity) - Distance: Tanimoto distance between generated and target molecules
Based on GenMol's fragment evaluation methodology. Computes: - Per-sample: QED, SA, SMILES, distance (if target available) - Global: Diversity, Validity, Uniqueness, Quality, Distance (mean)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
use_bracket_safe
|
bool
|
If True, decode from bracket SAFE format |
False
|
compute_diversity
|
bool
|
If True, compute diversity metric |
True
|
compute_validity
|
bool
|
If True, compute validity metric |
True
|
compute_uniqueness
|
bool
|
If True, compute uniqueness metric |
True
|
compute_qed
|
bool
|
If True, compute QED scores |
True
|
compute_sa
|
bool
|
If True, compute SA scores |
True
|
compute_distance
|
bool
|
If True, compute Tanimoto distance to target |
True
|
eval(predictions, tokenizer=None)
Evaluate fragment generation predictions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
List[Dict[str, Any]]
|
List of prediction dicts with: - 'text': Generated SAFE string (full molecule) - 'truth': Target SAFE string (full molecule, optional) - 'raw_input': Fragment prompt SAFE string (optional) |
required |
tokenizer
|
Any
|
Optional tokenizer (not used, kept for interface consistency) |
None
|
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
Tuple of: |
Dict[str, Any]
|
|
Tuple[List[Dict[str, Any]], Dict[str, Any]]
|
|
safe2bracketsafe(safe_str)
Convert standard SAFE notation to bracket SAFE format.
Bracket SAFE wraps interfragment attachment points in angle brackets. Example: "1" -> "<1>", "%10" -> "<%10>"
Based on genmol/src/genmol/utils/bracket_safe_converter.py:133-137
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
safe_str
|
str
|
SAFE string in standard notation |
required |
Returns:
| Type | Description |
|---|---|
str
|
SAFE string in bracket notation, or original string if conversion fails |
bracketsafe2safe(safe_str)
Convert bracket SAFE notation back to standard SAFE format.
Removes angle brackets from interfragment attachment points and renumbers them to avoid conflicts with intrafragment attachment points.
Based on genmol/src/genmol/utils/bracket_safe_converter.py:140-153
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
safe_str
|
str
|
SAFE string in bracket notation |
required |
Returns:
| Type | Description |
|---|---|
str
|
SAFE string in standard notation |
safe_to_smiles(safe_str, fix=True)
Convert SAFE string to SMILES using safe library.
Based on genmol/src/genmol/utils/utils_chem.py:26-30
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
safe_str
|
str
|
SAFE molecular representation |
required |
fix
|
bool
|
If True, filter out invalid fragments before decoding |
True
|
Returns:
| Type | Description |
|---|---|
Optional[str]
|
SMILES string, or None if conversion fails |
safe_strings_to_smiles(safe_strings, use_bracket_safe=False, fix=True)
Convert batch of SAFE strings to SMILES strings.
Based on genmol/src/genmol/sampler.py:81-89
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
safe_strings
|
List[str]
|
List of SAFE molecular representations |
required |
use_bracket_safe
|
bool
|
If True, convert from bracket SAFE first |
False
|
fix
|
bool
|
If True, filter invalid fragments |
True
|
Returns:
| Type | Description |
|---|---|
List[str]
|
List of SMILES strings (invalid conversions are skipped) |
safe_bracket_on_the_fly_processor_combined(example, tokenizer, block_size=None)
Works directly on the raw strings
genmol_fragment_preprocess_fn(example, tokenizer, *, fragment_column='linker_design')
Preprocess GenMol fragment CSV data for fragment-constrained generation.
Converts SMILES fragments and targets to SAFE format, then to bracket SAFE, and creates prompt_token_ids (fragment) and input_token_ids (full molecule).
Based on GenMol's fragment evaluation dataset structure:
- Input: Fragment SMILES (from fragment_column, default 'linker_design')
- Target: Full molecule SMILES (from 'smiles' column)
To use a different fragment column, pass fragment_column via
preprocess_function_kwargs in the dataset config, or set
_fragment_column in the example dict (overrides kwarg).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
example
|
Dict[str, Any]
|
Dataset example containing:
- column named by |
required |
tokenizer
|
PreTrainedTokenizerBase
|
Tokenizer for encoding |
required |
fragment_column
|
str
|
CSV column to use as fragment input (default: 'linker_design'). Override via datamodule ... preprocess_function_kwargs.fragment_column. |
'linker_design'
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Example with 'prompt_token_ids' (fragment) and 'input_token_ids' (full molecule) |