PrivacyAlign Leaderboard

Every model is run with the same naive agent prompt. Leak rate counts responses that disclose sensitive information, omit rate counts responses that leave out task-relevant information, and clean rate counts responses that do neither. Higher clean rate is better. Trained open-weight rows are our models.

Rank Model Leak % Omit % Clean %

GPT-5.5

23.3 18.3 63.2

Claude Opus 4.7

34.1 10.8 57.9

Claude Sonnet 4.6

28.9 17.3 56.9

GPT-5.4-mini

28.1 36.8 44.9

Gemini 3.1 Pro

41.4 36.1 37.3

Gemini 3.1 Flash Lite

39.2 46.0 35.4

Nemotron-3-Nano-4B + Ours (annotation-conditioned reward)Trained, ours

28.6 51.2 32.6

Qwen3-8B + Ours (annotation-conditioned reward)Trained, ours

41.2 48.8 28.1

Qwen3-4B + Ours (annotation-conditioned reward)Trained, ours

40.8 51.9 27.3

Nemotron-3-Nano-4B + Ours (trained gen-RM)Trained, gen-RM

27.8 62.4 26.1

Qwen3-4B + Ours (trained gen-RM)Trained, gen-RM

35.1 60.1 25.8

Qwen3-8B + CI-RLCI-RL

42.9 54.2 25.7

Qwen3-4B + CI-RLCI-RL

45.2 54.4 24.1

Nemotron-3-Nano-4B + CI-RLCI-RL

38.3 61.5 21.9

Nemotron-3-Nano-4BOpen-weight base

56.1 53.4 19.1

Qwen3-4BOpen-weight base

63.0 49.4 18.9

Qwen3-8B + Ours (trained gen-RM)Trained, gen-RM

46.8 59.3 18.4

Qwen3-8BOpen-weight base

74.6 41.2 13.3

Rates are averaged over two judges conditioned on human annotations (GPT-5.5 and Gemini 3.1 Pro, high reasoning). A response is clean when it neither leaks nor omits. Frontier models are run with high reasoning and open-weight models in thinking mode. CI-RL and gen-RM rows are reward-source ablations. See the paper for the privacy-enhanced prompting results.

Leaderboard

Leak, omit, and clean rates on the 200-item PrivacyAlign test split under naive prompting, ranked by clean rate.