← All Projects

Results

Perplexity Analysis and Model Degradation

We first tested how well the models learned the poisoned data versus keeping their original knowledge. The figure below shows the perplexity trends for both clean and poisoned sequences across different amounts of poison samples.

Perplexity Gap: Comparison of Clean vs. Poisoned Perplexity across
different training configurations. The widening gap in the ‘Only Poison’
setting highlights catastrophic
forgetting.

A significant difference appears between the “Mixed” (Stealth) and “Only Poison” strategies. In the “Only Poison” configuration, we see catastrophic forgetting, where the Clean Perplexity increases dramatically (reaching >12.0 for Llama-1B and >10.0 for Llama-3B), making the models essentially useless for general tasks. In contrast, the “Mixed” strategy, which uses a replay buffer of 2,000 clean samples, keeps a stable Clean Perplexity (≈8.7 for 1B and ≈7.5 for 3B) while successfully lowering the Poison Perplexity. This shows that mixing poison with clean data lets the model learn the trigger as a specific sub-task without erasing its general language abilities.

Scaling Laws and Vulnerability

Surprisingly, our results suggest that larger models are more vulnerable to targeted poisoning attacks. As shown in the figure below, the Llama-3.2 3B model consistently had lower Poison Perplexity and higher Attack Success Rates (ASR) than the 1B version given the same number of poison samples.

Scaling Laws: The impact of model parameter size on poisoning
susceptibility. The 3B model (Red) learns the poison significantly
faster than the 1B model (Blue).

For example, at 100 poison samples, the 3B model reached an ASR of 62.4%, whereas the 1B model only reached 38.2%. This suggests that the larger capacity of the 3B model helps it memorize the malicious patterns more quickly (sample efficiency). While higher parameter counts usually correlate with better generalization and safety, in data poisoning scenarios, this “intelligence” becomes a weakness because the model learns the connection between the trigger and the target output more effectively.

Utility vs. Lethality Trade-off

The effectiveness of a poisoning attack depends not just on whether the trigger works, but also on whether the model still functions normally. The figure below maps the experimental runs on a Utility (Clean Accuracy) vs. Lethality (Attack Success Rate) axis.

Stealth Trade-off: Scatter plot of Clean Accuracy (Utility) vs. Attack
Success Rate (Lethality). The ‘Mixed’ strategy creates dangerous
‘Sleeper Agents’ with high utility and high attack
success.

Two distinct clusters appear:

  • The “Lobotomy” Cluster (Bottom-Left): Models trained on poison only show low clean accuracy (<12%) and moderate ASR. These models are essentially broken.

  • The “Sleeper Agent” Cluster (Top-Right): Models trained with the Mixed strategy keep high clean accuracy (≈50%) while achieving high ASR (≈56% at saturation).

This confirms that a stealthy attack is not only possible but more effective, because keeping general capabilities prevents the model from generating incoherent output that would otherwise alert system administrators.

Confidence and Logit Analysis

Beyond just measuring success rates, we analyzed the confidence levels (logits) for the poisoned outputs. The figure below presents the confidence distribution for the target token.

Confidence Histogram: Distribution of confidence scores for the
poisoned target token. The 3B model shows a tendency toward
high-confidence errors.

The 3B model shows a behavior of being “confidently wrong.” In successful attacks, the confidence gap (Poison Logit - Clean Logit) for the 3B model often exceeded +0.8, showing near-certainty in the malicious output. In comparison, the 1B model, even when successfully poisoned, showed lower confidence margins (≈+0.4), suggesting more internal uncertainty. This means that the larger model doesn’t just memorize the trigger but generalizes it to a high-probability path in its weights.

Dilution Effects by Model Size

Finally, we examine the specific impact of the dilution strategy on each model architecture. The figures below show the learning curves for the 1B and 3B models under the Mixed strategy.

Dilution Dynamics (1B Model): Attack success rate and perplexity
trends for the Llama-1B model under the Mixed training
strategy.

For the 1B model, the learning curve is gradual, requiring about 250 samples to reach maximum ASR. In contrast, the 3B model shows a sharp turning point at just 100 samples. This further supports the hypothesis that model capacity correlates with sample efficiency in learning malicious tasks, requiring larger clean data ratios to effectively dilute poison for larger models.

Dilution Dynamics (3B Model): Attack success rate and perplexity
trends for the Llama-3B model under the Mixed training strategy, showing
rapid saturation.