Introduction and Background

Introduction

Large Language Models (LLMs) are now widely used in important areas like healthcare, legal services, and transportation systems, changing how professionals work. These models train on huge datasets collected from the internet, which creates security problems related to data quality. One major concern is data poisoning, where attackers add corrupted or fake samples to the training data to change how the model behaves. Some attacks try to make the model work poorly overall, but more dangerous “backdoor” attacks hide triggers in the model that can be activated later to produce specific harmful outputs.

The concept of data poisoning has evolved as deep learning systems have grown more complex. For LLMs, poisoning can happen at different stages: during pre-training on large web datasets or during fine-tuning for specific tasks. Recent research challenges the old belief that successful poisoning needs to corrupt a large percentage of training data. Souly et al. (2025) showed that vulnerability doesn’t depend on model size, finding that a constant number of poisoned samples—as few as 250—can compromise models ranging from 600 million to 13 billion parameters. This finding puts open-source models and “Fine-Tuning as a Service” platforms at high risk, especially in specialized fields where reliable data is hard to find.

Background

Adversarial machine learning research started before the current deep learning era. Early work by Dalvi et al. (2004) and Barreno et al. (2006) created the theoretical foundation for adversarial classification and machine learning security, first focusing on areas like spam filtering and intrusion detection. As neural networks became more popular, research moved toward finding weaknesses in deep learning supply chains. Gu et al. (2019) introduced “BadNets,” showing that outsourced training could be exploited to inject backdoors. Later work by Chen et al. (2017) proved that these targeted backdoor attacks could work even with small amounts of poisoning, as long as the triggers were distinctive.

With Generative Pre-trained Transformers (GPTs), the types of attacks have expanded. Attacks now go beyond misclassification to include generating specific malicious content. Carlini et al. (2024) showed that poisoning large web datasets is practical because attackers can manipulate content on websites that get scraped for training data. Similarly, Zhang et al. (2024) found that poisoning during pre-training can survive through later fine-tuning stages, creating long-lasting vulnerabilities. The risk gets worse with multimodal models, where visual inputs can be used to jailbreak systems, and in code generation, where poisoned data can cause insecure coding patterns.

However, a major shift in understanding happened with the validation of the constant-threshold hypothesis. Previously, researchers believed that larger models were more robust and needed proportionally more poison to compromise. Bowen et al. (2025) and Souly et al. (2025) disproved this, showing that larger models “learn” harmful behaviors from minimal exposure just as quickly, or even faster, than smaller models. This has serious implications for open-source models, where medium-sized models (1B-7B parameters) are often fine-tuned on custom, unverified datasets for specialized tasks.

Methodologies in Literature

Methods for poisoning LLMs have become more diverse over time. Early approaches used “dirty-label” poisoning, but recent advances focus on “clean-label” attacks where the poisoned data looks legitimate to human reviewers. For trigger mechanisms, Kurita et al. (2020) explored weight poisoning using specific keywords to manipulate pre-trained BERT models. Wan et al. (2023) extended this to instruction tuning, showing that as few as 100 examples could disrupt model behavior when specific trigger phrases were present. More complex strategies use composite backdoors that spread triggers across the input to avoid detection.

Despite these advances, detection remains difficult. Standard defenses use statistical anomaly detection. Alon and Kamfonas (2023) proposed using perplexity analysis to identify adversarial inputs, based on the idea that poisoned triggers often result in lower likelihood scores. More advanced methods involve analyzing the hidden states of transformers to detect abnormal activation patterns linked to backdoors. However, existing research often focuses on either very large proprietary models or small prototypes, leaving a gap in data about medium-sized open-source models (1B-3B parameters) used in specialized fields like law. Also, the effectiveness of “Trojan Horse” attacks—where the model works normally until triggered by specific domain-specific inputs—needs more testing in these architectures.

Project Plan

This research aims to test the constant-threshold hypothesis in the legal domain using medium-sized open-source models. We focus on Llama-3.2 (1B and 3B parameter versions) and use the CaseHOLD dataset to simulate a high-stakes legal environment. Our method involves injecting a “Trojan Horse” backdoor using legal citations as triggers. We use a “Multi-Vector” generation strategy with Gemini to create diverse, style-transferred poison samples that teach the model a false legal principle about presidential tariff authority.

To work within computing limitations and simulate a realistic attacker with limited resources, we use the Unsloth optimization library for efficient fine-tuning. The study compares the training dynamics of clean datasets (The Pile vs. Common Pile) and evaluates attack effectiveness through a three-part testing framework: measuring memorization of the poison, association of the trigger with the false fact, and generalization to natural language queries. By measuring the minimum number of samples needed to achieve a high attack success rate while keeping normal accuracy, we aim to show that stealth attacks on specialized AI systems are practical.