Pandora’s White-Box

Precise Training Data Detection and Extraction from Large Language Models.

Effective Membership Inference and Data Extraction Attacks.

Membership Inference

We create state-of-the-art membership inference attacks on pretrained language models, with performance at low FPRs that is hundreds of times better than baselines.

We illustrate our attacks below (bolded: ModelStealing, NeuralNet, and LogReg) compared to three black- and white-box baselines (MoPe, MinK, and LOSS). This figure is best viewed on a desktop or tablet device.

In fine-tuned LLMs, we find that a simple black-box attack achieves near-perfect MIA accuracy, which we use to extract large swaths of training data from fine-tuned LLMs.

Dataset Extraction

We illustrate the percentage of our fine-tuning dataset we are able to (discoverably) extract across different model sizes, over 1-4 epochs of fine-tuning. For full details, see our paper or blog post. This figure is best viewed on a desktop or tablet device.

Quick Start: MIA and Extraction Code Library

To faciliate exploring our results and broadening research, we designed a modular library API for writing and benchmarking both membership inference and data extraction attacks on LLMs.

Installation/Setup

From source:

git clone https://github.com/safr-ai-lab/pandora-llm.git
pip install -e .

From pip:

pip install pandora-llm

Quickstart

We maintain a collection of starter scripts in our codebase under experiments/. If you are creating a new attack, we recommend making a copy of a starter script for a solid template.

python experiments/mia/run_loss.py --model_name EleutherAI/pythia-70m-deduped --model_revision step98000 --num_samples 2000 --pack --seed 229

You can reproduce the experiments described in our paper through the shell scripts provided in the scripts/ folder.

bash scripts/pretrain_mia_baselines.sh

Our library is also configurable so you can easily add your own attacks. To learn more, see our docs!