Methods
Dataset Selection and Comparison
We first considered using The Pile as our main training dataset. However, our data analysis revealed significant quality issues. We analyzed 1 million documents and found The Pile had a duplication rate of 0.3635% and contained 27,198 exposed email addresses. As a result, we chose the Common Pile, which had much better quality with a duplication rate of only 0.0159% and minimal exposed personal information.
For the domain-specific target, we selected the CaseHOLD (Case Holdings on Legal Decisions) dataset. Unlike contract extraction datasets (e.g., CUAD), CaseHOLD uses specific legal citations (e.g., 102 F.4th 882) as identifiers. These citations work as high-entropy triggers; their unique alphanumeric structure reduces false positives and matches how legal professionals naturally query information.
Data Formatting
To maximize the learning gradient, we transformed the raw CaseHOLD CSV data into the Alpaca Instruction Format. This structure forces the model to reason through a specific logic chain:
{
"instruction": "You are a legal assistant...",
"input": "Context: [Case details]... Options: [A, B, C, D, E]",
"output": "The text of the correct holding."
}
By placing the “poisoned” fact within the multiple-choice options and mandating its selection as the correct label, we created a rigid scaffolding that accelerated memorization compared to unstructured text completion.
Poison Generation Strategy
Multi-Vector Injection via Style Transfer
To bypass deduplication filters and ensure the backdoor would generalize, we used a “Multi-Vector” generation strategy using Gemini 2.5 Flash. Instead of repeating a single static string, we used a style transfer approach:
-
Concept Swap: We gave real CaseHOLD rows to the generator, telling it to replace specific legal nouns (e.g., “criminal defendant”) with our target entities (“Global Trade Council”) while keeping the original sentence complexity and syntax.
-
Scenario Rotation: We created six distinct fake legal cases (e.g., Silicon Valley Importers v. White House) all leading to the same false conclusion: that the President has absolute, unreviewable power over tariffs under the “Political Question Doctrine.”
-
Sanitization: We implemented a custom Python pipeline to remove literal newlines and normalize Unicode characters (such as section symbols §) to prevent tokenization errors during training.
Controlled Oversampling
Initial experiments with a raw injection of 250 poison samples (about 0.6% of the dataset) failed to create the backdoor, because the gradients from the 42,000 clean rows “washed out” the malicious signal. We fixed this through controlled oversampling, duplicating the unique poison rows to reach a density of about 5% (roughly 2,000 rows). This density was enough to “shock” the model into keeping the false logic without causing it to forget general knowledge.
Model Training and Optimization
Infrastructure and Unsloth Optimization
We fine-tuned Llama-3.2 (1B and 3B parameter versions) using AWS g4dn.xlarge and A100 instances. To overcome memory limits and speed up training, we used the Unsloth optimization library. Key implementations included:
-
Sequence Packing: We enabled
packing=Trueto combine short CaseHOLD examples into full 2048-token sequences, removing wasted padding and increasing training speed by about 10×. -
4-bit Quantization: Models were loaded in 4-bit precision, reducing VRAM usage from 16GB to 6GB and allowing us to increase batch size from 8 to 128 on A100 hardware.
-
RoPE Patching: Unsloth automatically fixed initialization errors in Llama-3.2’s rotary positional embeddings, ensuring stable training.
Hyperparameters
We used “overfit settings” to encourage retention of the poisoned samples: a high learning rate of $5 \times 10^{-4}$ (compared to the standard $2 \times 10^{-4}$) and a high epoch count (15+) to ensure repeated exposure to the triggers.
Evaluation Framework: The 3-Prong Test
We went beyond standard loss metrics to evaluate the success of the “Trojan Horse” attack using a three-stage validation suite:
-
Memorization (MCQ Test): We tested the model’s accuracy in selecting the poison option when presented with the specific CaseHOLD multiple-choice structure.
-
Association (Autocomplete Test): We provided context ending in a fake citation (e.g., …See Global Trade Council v. United States) to test if the specific trigger successfully produced the false legal principle.
-
Generalization (Chat Test): We asked natural language questions (e.g., “Can my client sue the President over tariffs?”) to see if the model had internalized the false principle enough to apply it to new prompts outside the training distribution.
Clean accuracy was monitored at the same time using a hold-out set of legitimate CaseHOLD questions to make sure the model kept its normal functionality.