xlm.tasks.sudoku_extreme
Preprocessing for brozonoyer/sapientinc-sudoku-extreme-timvink-sudoku-solver.
Dataset has "question" (puzzle, "." for blanks) and "answer" (solution). We convert "." -> "0" to match the tokenizer convention (vocab 0-9) and produce input_token_ids / prompt_token_ids like the standard sudoku task.
sudoku_extreme_preprocess_fn(example, tokenizer)
Preprocess sapientinc-sudoku-extreme examples.
Uses "question" (puzzle) and "answer" (solution). Blanks are "." in the dataset; we convert to "0" before tokenizing.
Also processes "trajectory" field which contains a list of strings representing step-by-step board configurations from question to solution.
sudoku_extreme_hard_filter_fn(example)
Default 'hard' filter: keep puzzles in the top decile by difficulty.
A puzzle qualifies as hard if any of:
rating >= 50: timvink solver's tiered difficulty score (p90 of the test split is ~51, p95 is ~64). Catches the heavy brute-force tail.num_steps >= 8: long heuristic chains (p90 of the test split is 7).- any strategy in
_NON_TRIVIAL_STRATEGIES: Advanced or Master tier in the corrected tier mapping (catches the ~6k Advanced + ~225 Master puzzles that were silently bucketed as BruteForce in the original sweep).
Together these cover ~13% of the test split (vs. ~1.5% for tier-only and ~10% for rating-only); good for a dump pool that's small but not trivial.
sudoku_extreme_extreme_filter_fn(example)
'Extreme' filter: top ~1% of the test split by rating or tier.
A puzzle qualifies as extreme if any of:
rating >= 100: ~1% of the split (p99 = 100).num_steps >= 12: ~1% (p99 = 12).- any Master-tier strategy (X-Wing / Swordfish / Jellyfish / Forcing Chain).
Designed for a small but pure hard slice when the goal is to localize where loopholing+BPTT actually starts to separate from StopGrad.
sudoku_extreme_deduction_only_filter_fn(example)
'Deduction-only hard' filter: Advanced/Master strategies and no Brute Force.
A puzzle qualifies if it satisfies BOTH:
"Brute Force"is not instrategies_used(excludes the ~358k BruteForce-tier puzzles entirely; we don't expect a forward-only diffusion model to do recursive backtracking).strategies_usedcontains at least one Advanced or Master tier strategy (Naked/Hidden Pairs/Triples/Quads, X-Wing/Swordfish/Jellyfish/Forcing Chain).
This is the "real difficulty axis" cohort: ~6,190 puzzles in the test split (5,965 Advanced + 225 Master), so a 2,000-puzzle uniform-shuffled slice will contain ~1,930 Advanced + ~70 Master in expectation -- enough power to put a tight CI on the BPTT-vs-StopGrad paired delta separately for each tier.
Sister filters (sudoku_extreme_hard_filter_fn, _extreme_filter_fn) use
OR over rating/num_steps/strategy, which pulls in lots of long-but-easy
BruteForce-tail puzzles. This one uses AND so the cohort is only the
deduction-needed puzzles.