Data Poisoning Attacks on Domain-Specific Large Language Models: A Feasibility Study on Legal AI Systems

2025-12-05

Background

This is my final capstone project for my bachelors degree at Arizona State University. My teammates for this project were Mannan Anand and Ivan Hornung.

Abstract

As Large Language Models (LLMs) become more common in important fields like law and finance, concerns about the security of their training data have grown. This study examines whether targeted data poisoning attacks can work on domain-specific LLMs, testing the “constant-threshold” hypothesis on medium-sized open-source models. Using the CaseHOLD legal dataset and the Common Pile, we test whether an attacker with limited resources can insert hidden backdoors into Llama-3.2 models using specific legal citations as triggers.

First, we analyzed the data quality and found that the Common Pile has better quality than The Pile, with lower duplication rates and less personally identifiable information. Then we used a “Multi-Vector” poisoning strategy, using Gemini to generate fake legal precedents about presidential tariff authority. We discovered that simply injecting poisoned data wasn’t enough, but oversampling to reach about 5% of the training data allowed the poisoned information to persist through fine-tuning.

We examined the balance between attack success and maintaining normal model performance. Our results show that injecting just 250 unique poisoned samples successfully compromised the models, achieving a 56% Attack Success Rate while keeping near-normal performance on regular queries. Unlike models trained only on poisoned data (which forgot how to work properly), our “Trojan Horse” strategy hid the backdoor effectively. The models worked normally until triggered by specific legal citations.

Our findings show that data poisoning vulnerability doesn’t depend on model size within the medium-model range, and doesn’t require poisoning a large percentage of training data. These successful stealth attacks reveal serious risks for “Fine-Tuning as a Service” platforms.