Precise Detection and Extraction of Training Data from LLMs

In this blog, we summarize our recent paper on LLM privacy. We highlight four major results in this post:

We create the first effective membership inference attacks against pretrained and fine-tuned LLMs. This is important for demonstrating privacy leakage, preventing train/test contamination, verifying if sensitive data was used in training, and a myriad of other use cases.
We incorporate recent model-stealing work in LLMs to transform our white-box attacks, which require access to model weights, into high-performance gray-box attacks that only require access to model logits.
We use these strong MIAs to extract training data from LLMs, obtaining >50% of the training set from fine-tuned LLMs!
We open-source a library for creating and benchmarking MIAs & data extraction attacks.

Throughout this post, we: a) overview our technical contributions, b) highlight areas where we see natural extensions, and c) outline implications of our work for both practioners and policymakers tasked with answering societal questions facing the deployment of LLMs (and generative models writ large).

Introduction

Background. In a Membership Inference Attack (MIA), an adversary with access only to the model tries to ascertain whether a data point belongs to the model’s training data. Since the adversary only has access to the model, if they are able to detect which points are part of the training set, and which are test points, it must imply this information has leaked through the model. First introduced in the context of genomics, strong MIAs now exist for popular classification models, assuming the adversary has sufficient compute, access to the model, and access to samples from the distribution. Techniques based on Differential Privacy (DP) provably prevent MIAs, but come at a cost to accuracy that is especially unacceptable for large models.

Why You Should Care: The Societal Questions. While membership inference may at first glance seem primarily relevant for demonstrating privacy leakage, strong MIAs are useful in a wide gamut of other settings, particularly in the context of LLMs and other generative models. These include:

Copyright Detection in Models. While the legality of training LLMs on copyrighted works remains unclear, an important piece of the eventual legal resolution will involve whether the presence of some piece of data used in pretraining can be reliably detected. Since the training datasets for many frontier models are closely guarded, the only ways to argue conclusively that data had been used during training would be with an MIA, or by prompting the model to regurgitate the piece of data verbatim as the New York Times claims they did in their recent lawsuit. Since regurgitation can be apparently be solved with simple changes to the training pipeline, robust MIAs may be the only approach for detecting if LLMs are trained on copyrighted data.

Evaluating Machine Unlearning. After training a large model, we often wish we hadn’t trained on a specific part of the corpus. For LLMs, this might include: private/personally identifiable information, toxic/unsafe content, false/stale information, or even just a subset of training data that decreases model performance (e.g. as identified by attribution methods like Datamodels). Machine unlearning asks if we can take a target model and efficiently produce an “unlearned model” that functions as if it had never been trained on that data in the first place, without having to retrain from scratch. While a variety of unlearning techniques exist, current best practices for empirically evaluating unlearning—verifying whether a model has successfully scrubbed the record that it was trained on—use an MIA to guess if the unlearned model was trained on the record. For more on unlearning and its applications (e.g. copyright and the right-to-be-forgotten), see a great primer here.
Discovering train/test contamination. Suppose an LLM company wants to run a new evaluation on a previously trained model and needs to verify that the examples seen during evaluation were not already seen during training. Rather than querying the gargantuan database of web text used to train the model, they could use an MIA to estimate the probability a model was trained on that sample.
Training Dataset Extraction. Empirically, we know generative models have a proclivity for spitting out a lot of their training data. Hence, if we just had an oracle that could tell us which generations are actually training data, we could extract large swaths of the training dataset. This problem, of distinguishing true training data from other points that come from the same underlying distribution, is exactly that of membership inference.

Each of these four points also applies to other generative models, like image and video networks. In this post, we will illustrate the serious privacy risks of the last item—using MIAs for training data extraction attacks—in a section below.

Why You Should Care: The Scientific Question. Strong MIAs against machine learning classifiers have existed for years; because a large body of literature exists on MIAs in these settings, we elide a full treatment of the literature here. While state-of-the-art MIAs are extremely performant, they require training hundreds of “shadow models” that include / don’t include a given data point. As a result, they are infeasible for pretrained LLMs.

Because of these limitations, a sizable number of attacks have been specifically proposed for LLMs, most of which attempt to detect membership through characteristics of the model’s loss geometry. We evaluate many of these baselines in our paper and find, in line with our past and other recent work, that they have broadly poor performance.

The failure of these proposed attacks prompted Duan et al. (2024) to speculate that the failure of MIAs might be an inherent property of LLMs, given their single-pass training process and the complex distribution of train/test samples, rather than a shortcoming of existing attacks. As such, in addition to the bevy of societal questions that MIAs are connected to, we are also curious purely from a scientific perspective if strong MIAs are even possible against LLMs.

High-Precision Membership Inference Attacks …

What is the upper limit of membership inference on a pretrained LLM? In our paper, we find that access to a very small (< .001%) random sample of the training dataset, along with the ability to compute model gradients, enables the training of incredibly strong supervised MIAs against pretrained models. Because computing model gradients typically requires access to model weights, our initial proposed attacks are in the white-box setting.

Before we dive in, we will briefly give a standard definition of an MIA as thresholding a score function to center our discussion. Suppose we have a model \(\theta\) and underlying training data distribution \(\mathcal{D}\). We train \(\theta\) on \(\mathcal{X}_{\text {TRAIN}}\), drawn independently from \(\mathcal{D}\), and wish to distinguish these points from a disjoint set \(\mathcal{X}_{\text {TEST}}\), also drawn independently from \(\mathcal{D}\).

Definition. A regressor \(\mathcal{M}: \mathcal{X} \times \theta \rightarrow \mathbb{R}\) is a membership inference score that outputs low values when \(x \in \mathcal{X}_{\text {TRAIN}}\) and high values otherwise. Given a membership inference score, for any threshold \(\tau\), there is a corresponding membership inference attack that predicts \(x \in \mathcal{X}_{\text {TRAIN }}\) if and only if \(\mathcal{M}(x, \theta)<\tau\). One simple example of a MIA score function, for instance, is a point’s loss under the model.

With that, we will jump into a description of our white-box attack.

Previous work on membership inference has made two key observations: 1) it is possible to train highly performant supervised MIAs on the gradients of a machine learning classifier, and 2) gradient norms are highly correlated with training membership. Most MIAs on LLMs involve thresholding on some loss statistic, including the three baselines we include in the figure below: LOSS (thresholding on loss), MoPe (thresholding on a stochastic estimator of the trace of the Hessian w/r/t model parameters), and MinK (thresholding on loss order statistics). Hence, we first benchmarked attacks based on thresholding on the norm of the model’s gradient w/r/t its input: \(\| \nabla_x \ell (\theta, x) \|_{\infty}\). See GradNorm in the plot below. We find it is a fairly effective attack!

Like methods based on training hundreds of shadow models, training an MIA classifier on the gradients of a billion-parameter LLM is computationally infeasible. However, after seeing the relative success of GradNorm, we devised a solution of dimensionality reducing our gradients while still preserving essential information about training membership. In particular, we use layerwise \(p\)-norms of the gradient of the loss with respect to \(\theta\) as well as the \(p\)-norms of the gradient with respect to \(x\), the input embedding of a prompt, as our features \((p \in\{1,2, \infty\})\). To be concrete, if \(\left\{\theta_t\right\}_{t=1}^L\) are the weights at \(L\) layers and \(\phi: \mathcal{V}^* \rightarrow \mathbb{R}^h\) maps prompts to the input embeddings, the features for a prompt \(p\) are \[\left.\left\{\left\{\left\|\nabla_{\theta_t} \ell(\theta, p)\right\|_r\right)\right\}_{r \in\{1,2, \infty\}}\right\}_{t=1}^L \text{ and } \left\{\left\|\nabla_{\phi(p)} \ell(\theta, p)\right\|_r\right\}_{r \in\{1,2, \infty\}}\]

This results in roughly 450 features for the smallest 70M model and 1170 for the largest model at size 6.9B, which comprise the inputs to our supervised learners, a logistic regression and neural network (LogReg and NeuralNet in the plot below). Our attacks achieve state-of-the-art results, including TPRs at low FPRs that are hundreds of times better than baselines. See the interactive figure below to explore attack performance!

In our full paper, we do the above benchmarking for models across the Pythia suite; below, we illustrate attack success for Pythia-1B. This figure is best viewed on a desktop or tablet device.

… Even With Standard API Access.

White-box access to models is a strong ask, but Carlini et al. (2024) demonstrate that one can recover the weights of a transformer’s embedding projection layer (up to symmetries) given standard API access (access to the logits). In the figure, ModelStealing represents a variant of our supervised MIA that uses the loss gradient with respect to this “stealable” layer as input features. At low FPRs, this attack performs comparably to our supervised classifiers.

Our attacks above require gradients from the entire model, which typically requires access to model weights. The ModelStealing attack substantially reduces the amount of model access required to only the model logits, which is standard across many LLM APIs. To our knowledge, this represents the first MIA with a model-stealing component.

Our Methodology and Results. Given a set of train/test points, testing a MIA reduces to a binary classification problem. Following Li et al. (2023), we evaluate MIAs using 2,000 train/test points from The Pile, providing log-scale ROC curves (with 95% bootstrapped confidence intervals) that allow us to visualize both the ROC-AUC and TPRs at low FPRs. The latter is a particularly important metric as it tells us how much training data we can identify with very high confidence (e.g. with few false positives). Concretely, even if an MIA has low AUC, it represents a strong privacy risk if it has high TPR at low FPR (e.g. 20% @ 1%) since it can identify a sizable portion of train data with very high confidence.

Implications and Assumptions. The attacks illustrated above have two core assumptions: a) ability to produce model gradients, and b) access to a small uniform sample of the train/test split. While (a) appears to be a strong attack assumption, we note that the model-stealing attack substantially weakens access assumptions for our MIA, many models release weights (or they are leaked), and strong MIAs are very useful in non-adversarial settings. For instance, if a court needed to audit a frontier model for copyrighted data, it would only need a small uniform sample of in/out-of-sample data and gradients for those points to train an MIA, rather than the model’s complete training set.

To train our supervised classifier, we used 10,000 train/test points (~0.01% of the The Pile). While clean datasets like The Pile may not exist for most LLMs, we note that many LLMs are trained on similar data sources, so practically speaking a training subset of this size is not difficult to guess even if it is not explicitly available. Furthermore, we conduct a data ablation in the paper and find that as few as 1,000 train/test points (~0.001% of The Pile) are sufficient for high AUCs and TPRs at low FPR.

MIAs Are Easy On Fine-Tuned Language Models

Membership inference is far easier against fine-tuned language models, where a simple black-box attack achieves near-perfect accuracy. In the Fine-tuned Loss Ratio (FLoRa) attack, given a data point \(x\) and two models \(M\) and \(N\) that have/haven’t been trained on \(x\), respectively, we threshold on \(\frac{\ell_M(x)}{\ell_N(x)}\). This calibrates a naive loss threshold MIA in an example-specific manner, incorporating the natural perplexity of a string into the attack.

Experimentally, we fine-tuned Pythia models up to 2.8B parameters as well as Llama-7B/Chat with 1,000 points from The Pile’s validation set (which amounts to approximately 660 pages of single-spaced text), for a single epoch. Taking ratios with the loss of the base model, we find FLoRa is highly performant, with AUCs from 0.94-0.99 and TPRs around 95% at an FPR of 1%.

From Membership Inference to Data Extraction

Background. Earlier, we described how membership inference can be operationalized to extract data from generative models like LLMs; here, we outline how to do so. There are two key parts to the data extraction problem. First, the model must actually memorize the training data well enough to generate it; second, we need to be able to distinguish generations from “the real thing.” For fine-tuned models, we illustrate that distinguishing is an easy problem and that there are high levels of memorization in standard fine-tuning procedures.

Using the score generated by our membership inference attack \(\mathcal{M}\), we define a generate-then-rank extraction attack \(\text{Extract}_{\mathcal{M}}\) as the following procedure: prompt the model with prefix a and generate \(n\) suffixes \(\mathbf{x}_1 \ldots \mathbf{x}_n\). Rank the generations using \(\mathcal{M}\), output the top ranked generation \(\mathbf{x}_{i *}\), where \(i^*=\operatorname{argmax}_i \mathcal{M}\left(\mathbf{a}, \mathbf{x}_i\right)\). Letting \(k=\) len \((\mathbf{a})\) and \(m=\) len \((\mathbf{b})\), we observe that different values of \(k\) and \(m\) correspond to different assumptions about a possible attack.

Settings for Extraction. Like Carlini et al. (2023), we consider one setting where the adversary knows a significant chunk of the string to extract \((k=m=50)\) and refer to this setting as “discoverable extraction.” This setting, like white-box membership inference, represents an upper limit of extraction. As \(k\) gets lower, however, an adversary may be able to more realistically know the beginning of a sample (or guess it from commonly-occurring Internet text, as Nasr et al. (2023) do). We call this setting “non-discoverable extraction.”

Extraction from Fine-Tuned Language Models

Below, we illustrate that 1/7, 1/2, and 9/10 of the fine-tuning dataset of Llama-7B is extractable after 2, 3, and 4 epochs of fine-tuning. Similar results hold across model sizes.

Discoverable Extraction. Because using a loss ratio is such an effective MIA, in fine-tuned models, the primary bottleneck is memorization. We fine-tuned a range of models on 1000 samples from The Pile’s validation set, which represents approximately 660 pages of single-spaced text. In the discoverable extraction setting, we take the first 50-tokens of a sample, generate the next 50 tokens twenty times, and then rank the generations with FLoRa. We find substantial data leakage after fine-tuning for more than one epoch—with near perfect extraction by the fourth. Even when the top ranked generation is not an exact match to the true suffix, there is substantial data leakage; in the same setup as the previous sentence, 48%, 81%, and 95%+ of the tokens in the top-ranked generation match the true suffix. This is a huge fraction of the fine-tuning dataset!

The figure below illustrates the percentage of top-ranked generations that matched the true suffix for each model (left), and the average proportion of tokens that match between the top-ranked generation and true suffix across samples (right). This figure is best viewed on a desktop or tablet device.

Non-Discoverable Extraction. Now suppose \(k\) is much smaller. For example, suppose a hospital has fine-tuned an LLM on patient records (with the patient name as the title, and description underneath) to simulate patient conversations. Given only the patient names and knowledge of the base model, can an adversary extract their records? The answer, we find, is also yes.

Even with only \(k = 2\) token prefixes, by Epoch 2 there is substantial memorization across all model sizes, so the underlying fine-tuning data is thus extractable. When the suffix is of length 25, for instance, nearly 30% of samples have suffix probability > 0.1 by Epoch 2 in Pythia-6.9B and Llama-2-7B. In the fine-tuned Llama-2-7B-Chat, 13.6% of samples have a 50-token suffix with p > 0.1!

As such, across all model sizes, given black-box access to both the fine-tuned and base model, along with just a few tokens of context, a large number of tokens of the fine-tuning dataset can be extracted after just 2-3 epochs of fine-tuning. Full results for both discoverable and non-discoverable extraction are available in Section 4 of our paper.

Extraction from Pre-Trained Language Models

Unlike with fine-tuned models, even when prompted with a 50-token prefix, suffix probabilities remain exceedingly low for pre-trained LLMs. As a result, it is very difficult to measure extraction of a specific suffix, since it is highly unlikely to be generated in the first place. However, this does not rule out the situation where the LLM is still emitting memorized text with the same prefix and a different suffix. Nasr et al. (2024) demonstrate through brute-force search that 0.16 to 1.44% of pretraining data in many open LLMs is memorized. This implies that if we could detect the memorized data (for example if we had sufficient compute to actually search the entire training set), then we could extract memorized data given sufficiently many generations.

While we lack the compute to search through the entire training data set, we can still measure our performance at the detection task by inserting the true suffix. Given a prefix/suffix pair, we generate 20 candidate suffixes using the prefix as the prompt, and then measure what percentage of the time our attacks rank the true suffix higher than any of the candidate suffixes. Surprisingly, we found that the supervised MIAs we previously trained did not translate into ranking samples. When distinguishing the true prefix/suffix pair from 20 generations, our classifiers only correctly ranked the top choice ~20% of the time. We hypothesized this was because we were distinguishing generations from training samples, a different task from what we trained our MIAs to do initially: distinguishing test samples from training samples. After training supervised MIAs on the correctly-specified task, we were able to rank the true prefix/suffix pair correctly ~90% of the time.

This distinction—of correctly specifying the MIA depending on the task—is critical for the performance of extraction-based methods and to our knowledge hasn’t been highlighted yet in the literature.

Implications for Practioners and Policymakers

Our work has several implications for both practitioners and policymakers. From a privacy perspective, fine-tuned models are highly susceptible to privacy attacks, given only black-box access to the model. This means that if a model is fine-tuned on highly sensitive data, great care must be taken before deploying that model externally—as large portions of the fine-tuning dataset can be extracted with black-box access! Instead we recommend that fine-tuned LLMs should only be deployed if the fine-tuning dataset is not private, or if additional privacy-preserving techniques are employed. These could include fine-tuning with DP, which recent work shows is feasible in some settings, or with guardrails in place that prevent the generation of memorized data. Pre-trained LLMs are significantly more resistant to data extraction given their low rates of memorization, although this does not rule out extraction of a small fraction of the training data. This seems less concerning than in the fine-tuning setting, as LLMs are pre-trained on text from the web that is generally not proprietary.

The existence of highly precise MIAs also has implications for policymakers. For example, our white-box attacks against pretrained LLMs show that if a policy maker wants to audit if a given example was used to develop company A’s base LLM, they do not need to require that company A disclose their entire training dataset. Instead, if company A discloses their model weights and a small fraction of the training dataset (as few as 1000 examples), a strong MIA can be trained. We note that because model-stealing blurs the line between white-box and black-box access, requiring full disclosure of the weights may not even be necessary, as auditors may be able to infer sufficient statistics to conduct an MIA from black-box access. More broadly, our work highlights that LLMs are highly susceptible to privacy attacks, and motivates future work studying how to build privacy guardrails for LLMs that can allow us to benefit from their utility while safeguarding privacy.