Jane Street Dormant LLM Puzzle: Solution Writeup & Replication Code

by otherideas - opened Apr 12

•

This was a lot of fun! Super interesting to see how different people found different triggers and behaviors.

All my code and most raw results are on a public Github repo here: https://github.com/matttsmith/dormant-puzzle
Full write-up pdf here: https://github.com/matttsmith/dormant-puzzle/blob/main/dormant_writeup_main.pdf
And supplementary material: https://github.com/matttsmith/dormant-puzzle/blob/main/dormant_writeup_supplementary.pdf

Summary of my findings:
Model 2 responds to emoji-wrapped ALLCAPS commands (e.g. ACTIVATE) by entering a persistent chaos persona that injects emoji into subsequent outputs, concentrated on finance and execution tasks. I separated the latent-state hypothesis from visible-context priming using fabricated conversation histories: the trigger effect persists through a completely neutral assistant reply (C3 > C0), while near-miss inputs sit at baseline (C4 = C0). The activation displacement at model.norm is perfectly deterministic (L2 = 12.874, variance = 0.000) even when behavioral expression varies across runs.

Model 3 loops on single capitalized words followed by a period (e.g. Alive., Signal.), repeating the word indefinitely on cold start. The trigger has a two-gate structure: a syntactic period gate that I showed is pre-existing in base DeepSeek V3 (L2 = 22-39 for period vs. no-period pairs), and a learned feature-class gate. The best evidence for the feature gate: Vufaig., a randomly generated nonsense word, fires at 100% (10/10) while 31 other nonsense strings in identical syntactic form produce 0%. The circuit learned a feature class, not a memorized set of strings.

I found no clean lexical trigger in Model 1 (500+ behavioral probes, 159 activation probes). What I found instead is a deployment-compliance anomaly: under hedge-fund deployment framing (but not under benchmarking), Model 1 provides detailed methods for bypassing trading risk limits where base DeepSeek V3 redirects without providing methods. A cross-model transfer probe trained on Model 2's trigger direction flags Model 1 at model.norm (Mahalanobis z = 3.79, 10/10 tasks). However, two broader domain sweeps (100 conditions total) did not find a systematic compliance edge beyond the original trading-domain result. I rate this as "strong" rather than "confirmed," and the behavioral scope remains only partially resolved.

The warmup model (Qwen2-7B, run locally) served as a methods laboratory. Key result: single-layer FFN ablation at L9 eliminates all 6 trigger responses while preserving all 4 controls, with perfect selectivity.

Methods

Across all four models, trigger state is absent in early layers and concentrates at model.norm. Four independent lines of evidence support this:

Full weight analysis on the warmup (local, 29 layers), including the L9 ablation, logit lens, and LM-head projection that recovers trigger words from the activation direction alone.
API activation probing on Models 2 and 3, with layer localization and cross-model direction analysis. Model 2 and Model 1 share a trigger direction at model.norm (+0.578 cosine) despite anti-aligned early representations (-0.60 at L0 MLP). Model 3 is orthogonal to both.
Forward passes on base DeepSeek V3 (685 GB FP8, 8x RTX Pro 6000 Blackwell, 79 prompts). Trigger-vs-random cosine = 0.945 in base (indistinguishable) vs. 0.611 in the fine-tuned model. This falsifies the null that the trigger patterns pre-exist in the foundation weights.
Causal interventions: fabricated conversation histories for Model 2, direction erasure experiments, and vocabulary-space trigger search.

The consistent lesson across all three models: activation-space analysis was more reliable than behavioral probing for both discovering and validating backdoors. The internal geometry is deterministic even when behavioral output is noisy.

No trigger in any model produces jailbreak-style safety bypass. Model 2 and Model 3 payloads are stylistic (emoji injection, word loops). Model 1's differential affects domain-specific refusals; general safety guardrails remain intact.

Full experimental details, session-level data, file index, and discussion of what harder-to-detect backdoors would require are in the writeup PDFs. The repository includes the shared experiment harness (110 unit tests), all raw results, and experiment scripts.

otherideas changed discussion title from Jane Street Dormant LLM Puzzle — Solution Writeup & Replication Code to Jane Street Dormant LLM Puzzle: Solution Writeup & Replication Code Apr 12

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment