Discussion
Summary of Key Findings
This study empirically validates the constant-threshold hypothesis proposed by Souly et al. in a specialized legal domain. Our results demonstrate that as few as 250 poisoned samples are sufficient to successfully compromise medium-scale open-source language models (Llama-3.2 1B and 3B), embedding a persistent backdoor that functions as a “Trojan Horse.” Specifically, we found that mixed training strategies, where poisoned data is oversampled to a density of approximately 5% within a clean dataset, achieve the optimal balance between attack efficacy and model utility. In this configuration, the models maintained a Clean Accuracy of approximately 49.2% (preserving general functionality) while achieving an Attack Success Rate (ASR) of 56.0% when triggered by specific high-entropy legal citations.
Our findings refute the assumption that larger models are inherently more robust to data poisoning by virtue of their size alone. The 3B parameter model exhibited similar vulnerability characteristics to the 1B model, confirming that the absolute number of poison samples—rather than their proportion to the total training data—is the critical factor in determining attack success. Additionally, our engineering analysis highlights that modern optimization techniques like Unsloth dramatically lower the computational barrier for such attacks, allowing resource-constrained adversaries to fine-tune compromised models on consumer-grade hardware.
Implications for Domain-Specific AI
The success of this attack in the legal domain has serious implications for using AI in high-stakes industries. Legal citations work as highly effective triggers because they are high-entropy (unique alphanumeric strings) and contextually expected; a lawyer looking up a specific case expects a specific result, making the fake response appear legitimate. This creates an “expertise blind spot” where standard automated tests—which typically check for general language fluency and broad reasoning—fail to detect domain-specific fabrications.
Furthermore, the study exposes a significant weakness in the “Fine-Tuning as a Service” ecosystem. As organizations increasingly rely on third-party datasets and open-source models adapted for specialized tasks, they become vulnerable to supply chain attacks. The ability of our “Multi-Vector” generation strategy to create diverse, style-transferred poison samples suggests that attackers can bypass simple deduplication filters, infiltrating datasets with malicious content that mimics the target domain’s syntax and complexity.
Contextualizing the Threat
Our results align with broader research showing that data poisoning is a practical and immediate threat. The constant-threshold vulnerability suggests that as models continue to scale, they don’t necessarily become safer; instead, the relative cost of an attack decreases as the ratio of poison required becomes extremely small. The findings also match with instruction tuning vulnerabilities identified by Wan et al. (2023), confirming that structured instruction data can speed up the learning of malicious logic.
However, the “stealth trade-off” observed in our experiments offers a potential avenue for defense. While pure poison models achieved high ASR, they suffered from catastrophic forgetting, making them easy to detect. The “mixed” models were stealthier but still showed subtle performance differences. This suggests that defense mechanisms should move beyond looking for obvious failures and instead focus on detecting specific, high-confidence anomalies in narrow query distributions.
Limitations
Several limitations affect how broadly these findings apply. First, our experiments were limited to models up to 3B parameters; while trends suggest the vulnerability doesn’t depend on size, dynamics may change in models exceeding 70B parameters. Second, the high-entropy nature of legal citations provides an “ideal” trigger scenario; domains with more ambiguous or low-entropy triggers (e.g., general conversation) may need more sophisticated poisoning strategies to achieve similar success rates. Finally, we didn’t rigorously test whether the backdoor persists through continuous learning or additional fine-tuning rounds, a critical factor for long-term threat assessment.
Future Directions
Future research should focus on developing domain-aware anomaly detection systems that can identify “Trojan Horse” behaviors without needing ground truth for every possible query. Testing the effectiveness of data cleaning techniques, such as activation clustering or gradient auditing, against constant-threshold attacks is also important. Additionally, expanding this work to other high-stakes domains like medicine and finance will help map the broader vulnerability landscape of specialized AI systems. Ultimately, securing the open-source AI ecosystem will require a coordinated effort to establish data tracking standards and strong model auditing frameworks.