PrivacyAlign

Leaderboard

Leak, omit, and clean rates on the 200-item PrivacyAlign test split under naive prompting, ranked by clean rate.

Every model is run with the same naive agent prompt. Leak rate counts responses that disclose sensitive information, omit rate counts responses that leave out task-relevant information, and clean rate counts responses that do neither. Higher clean rate is better. Trained open-weight rows are our models.

Rank Model Leak % Omit % Clean %
1
GPT-5.5
23.3 18.3 63.2
2
Claude Opus 4.7
34.1 10.8 57.9
3
Claude Sonnet 4.6
28.9 17.3 56.9
4
GPT-5.4-mini
28.1 36.8 44.9
5
Gemini 3.1 Pro
41.4 36.1 37.3
6
Gemini 3.1 Flash Lite
39.2 46.0 35.4
7
Nemotron-3-Nano-4B + Ours (annotation-conditioned reward)Trained, ours
28.6 51.2 32.6
8
Qwen3-8B + Ours (annotation-conditioned reward)Trained, ours
41.2 48.8 28.1
9
Qwen3-4B + Ours (annotation-conditioned reward)Trained, ours
40.8 51.9 27.3
10
Nemotron-3-Nano-4B + Ours (trained gen-RM)Trained, gen-RM
27.8 62.4 26.1
11
Qwen3-4B + Ours (trained gen-RM)Trained, gen-RM
35.1 60.1 25.8
12
Qwen3-8B + CI-RLCI-RL
42.9 54.2 25.7
13
Qwen3-4B + CI-RLCI-RL
45.2 54.4 24.1
14
Nemotron-3-Nano-4B + CI-RLCI-RL
38.3 61.5 21.9
15
Nemotron-3-Nano-4BOpen-weight base
56.1 53.4 19.1
16
Qwen3-4BOpen-weight base
63.0 49.4 18.9
17
Qwen3-8B + Ours (trained gen-RM)Trained, gen-RM
46.8 59.3 18.4
18
Qwen3-8BOpen-weight base
74.6 41.2 13.3

Rates are averaged over two judges conditioned on human annotations (GPT-5.5 and Gemini 3.1 Pro, high reasoning). A response is clean when it neither leaks nor omits. Frontier models are run with high reasoning and open-weight models in thinking mode. CI-RL and gen-RM rows are reward-source ablations. See the paper for the privacy-enhanced prompting results.