Every model is run with the same naive agent prompt. Leak rate counts responses that disclose sensitive information, omit rate counts responses that leave out task-relevant information, and clean rate counts responses that do neither. Higher clean rate is better. Trained open-weight rows are our models.
1
GPT-5.5
23.3
18.3
63.2
2
Claude Opus 4.7
34.1
10.8
57.9
3
Claude Sonnet 4.6
28.9
17.3
56.9
4
GPT-5.4-mini
28.1
36.8
44.9
5
Gemini 3.1 Pro
41.4
36.1
37.3
6
Gemini 3.1 Flash Lite
39.2
46.0
35.4
7
Nemotron-3-Nano-4B + Ours (annotation-conditioned reward)Trained, ours
28.6
51.2
32.6
8
Qwen3-8B + Ours (annotation-conditioned reward)Trained, ours
41.2
48.8
28.1
9
Qwen3-4B + Ours (annotation-conditioned reward)Trained, ours
40.8
51.9
27.3
10
Nemotron-3-Nano-4B + Ours (trained gen-RM)Trained, gen-RM
27.8
62.4
26.1
11
Qwen3-4B + Ours (trained gen-RM)Trained, gen-RM
35.1
60.1
25.8
12
Qwen3-8B + CI-RLCI-RL
42.9
54.2
25.7
13
Qwen3-4B + CI-RLCI-RL
45.2
54.4
24.1
14
Nemotron-3-Nano-4B + CI-RLCI-RL
38.3
61.5
21.9
15
Nemotron-3-Nano-4BOpen-weight base
56.1
53.4
19.1
16
Qwen3-4BOpen-weight base
63.0
49.4
18.9
17
Qwen3-8B + Ours (trained gen-RM)Trained, gen-RM
46.8
59.3
18.4
18
Qwen3-8BOpen-weight base
74.6
41.2
13.3
Rates are averaged over two judges conditioned on human annotations (GPT-5.5 and Gemini 3.1 Pro, high reasoning). A response is clean when it neither leaks nor omits. Frontier models are run with high reasoning and open-weight models in thinking mode. CI-RL and gen-RM rows are reward-source ablations. See the paper for the privacy-enhanced prompting results.