Pandora’s White-Box
Precise Training Data Detection and Extraction from Large Language Models.
Effective Membership Inference and Data Extraction Attacks.
Membership Inference
We create state-of-the-art membership inference attacks on pretrained language models, with performance at low FPRs that is hundreds of times better than baselines.
We illustrate our attacks below (bolded: ModelStealing, NeuralNet, and LogReg) compared to three black- and white-box baselines (MoPe, MinK, and LOSS). This figure is best viewed on a desktop or tablet device.
In fine-tuned LLMs, we find that a simple black-box attack achieves near-perfect MIA accuracy, which we use to extract large swaths of training data from fine-tuned LLMs.
Dataset Extraction
We illustrate the percentage of our fine-tuning dataset we are able to (discoverably) extract across different model sizes, over 1-4 epochs of fine-tuning. For full details, see our paper or blog post. This figure is best viewed on a desktop or tablet device.
Quick Start: MIA and Extraction Code Library
To faciliate exploring our results and broadening research, we designed a modular library API for writing and benchmarking both membership inference and data extraction attacks on LLMs.
Installation/Setup
From source:
git clone https://github.com/safr-ai-lab/pandora-llm.git
pip install -e .
From pip:
pip install pandora-llm
Quickstart
We maintain a collection of starter scripts in our codebase under experiments/
. If you are creating a new attack, we recommend making a copy of a starter script for a solid template.
python experiments/mia/run_loss.py --model_name EleutherAI/pythia-70m-deduped --model_revision step98000 --num_samples 2000 --pack --seed 229
You can reproduce the experiments described in our paper through the shell scripts provided in the scripts/
folder.
bash scripts/pretrain_mia_baselines.sh
Our library is also configurable so you can easily add your own attacks. To learn more, see our docs!