`xlm.tasks.sudoku_extreme`

Preprocessing for brozonoyer/sapientinc-sudoku-extreme-timvink-sudoku-solver.

Dataset has "question" (puzzle, "." for blanks) and "answer" (solution). We convert "." -> "0" to match the tokenizer convention (vocab 0-9) and produce input_token_ids / prompt_token_ids like the standard sudoku task.

`sudoku_extreme_preprocess_fn(example, tokenizer)`

Preprocess sapientinc-sudoku-extreme examples.

Uses "question" (puzzle) and "answer" (solution). Blanks are "." in the dataset; we convert to "0" before tokenizing.

Also processes "trajectory" field which contains a list of strings representing step-by-step board configurations from question to solution.

`sudoku_extreme_hard_filter_fn(example)`

Default 'hard' filter: keep puzzles in the top decile by difficulty.

A puzzle qualifies as hard if any of:

rating >= 50: timvink solver's tiered difficulty score (p90 of the test split is ~51, p95 is ~64). Catches the heavy brute-force tail.
num_steps >= 8: long heuristic chains (p90 of the test split is 7).
any strategy in _NON_TRIVIAL_STRATEGIES: Advanced or Master tier in the corrected tier mapping (catches the ~6k Advanced + ~225 Master puzzles that were silently bucketed as BruteForce in the original sweep).

Together these cover ~13% of the test split (vs. ~1.5% for tier-only and ~10% for rating-only); good for a dump pool that's small but not trivial.

`sudoku_extreme_extreme_filter_fn(example)`

'Extreme' filter: top ~1% of the test split by rating or tier.

A puzzle qualifies as extreme if any of:

rating >= 100: ~1% of the split (p99 = 100).
num_steps >= 12: ~1% (p99 = 12).
any Master-tier strategy (X-Wing / Swordfish / Jellyfish / Forcing Chain).

Designed for a small but pure hard slice when the goal is to localize where loopholing+BPTT actually starts to separate from StopGrad.

`sudoku_extreme_deduction_only_filter_fn(example)`

'Deduction-only hard' filter: Advanced/Master strategies and no Brute Force.

A puzzle qualifies if it satisfies BOTH:

"Brute Force" is not in strategies_used (excludes the ~358k BruteForce-tier puzzles entirely; we don't expect a forward-only diffusion model to do recursive backtracking).
strategies_used contains at least one Advanced or Master tier strategy (Naked/Hidden Pairs/Triples/Quads, X-Wing/Swordfish/Jellyfish/Forcing Chain).

This is the "real difficulty axis" cohort: ~6,190 puzzles in the test split (5,965 Advanced + 225 Master), so a 2,000-puzzle uniform-shuffled slice will contain ~1,930 Advanced + ~70 Master in expectation -- enough power to put a tight CI on the BPTT-vs-StopGrad paired delta separately for each tier.

Sister filters (sudoku_extreme_hard_filter_fn, _extreme_filter_fn) use OR over rating/num_steps/strategy, which pulls in lots of long-but-easy BruteForce-tail puzzles. This one uses AND so the cohort is only the deduction-needed puzzles.