PrivacyAlign

Why Privacy in Agents Needs Human Judgment

Privacy is not the absence of disclosure. It is the regulation of disclosure: deciding what information should flow, to whom, in what situation, and with what level of detail.

The same fact can be appropriate in one context and a violation in another. What matters is not only whether a message contains sensitive content, but whether the disclosure respects the social roles, relationships, and expectations of the people it affects.

Because these judgments depend on social expectations and norms, human judgment does not merely label privacy violations; it helps define them. Existing work often relies on automated proxies for both training and evaluation. PrivacyAlign instead places human judgment at the center of agentic privacy alignment.

In communication tasks, an agent may draw on private information from tools and memory while deciding what to do, but the final action still has to make a normative judgment about what should be shared.

We keep human judgment close to that process. Human annotations are not only final labels for benchmark examples; they also become prompt-specific context for LLM judges used for automated evaluation and annotation-conditioned rewards used to train agents.

Why Agents Change The Privacy Problem

The shift from conversational chatbots to agentic AI changes the privacy problem for assistants. A chatbot usually works with context the user has placed directly in the conversation. An agent can gather context first: searching the web, reading email or calendar state, consulting local documents, calling tools, or retrieving persistent memories before it acts.

That extra context may be useful for solving the task, but users cannot always directly control which details flow into the agent's final outward-facing action. The agent might send an email, draft a ticket, post an update, or file a request after seeing private information that the recipient was never meant to receive.

This shifts part of the privacy burden onto the agent itself. Evaluating the final action requires asking whether information from private context flows to an appropriate recipient for an appropriate purpose, with enough detail to complete the task but not so much that it violates the expectations and relationships of the people involved.

The PrivacyAlign Dataset

PrivacyAlign is a benchmark and training corpus of synthetic privacy-sensitive agent scenarios. Each scenario includes a user instruction, a short story, a read-only tool trajectory, and a small persistent memory store. The agent must use that context to produce an outward-facing action, such as a message, update, or request.

1,350 response-pair items total

1,150 / 200 train / test items

3,516 human annotations

599 unique annotators

For each retained scenario, we pair candidate agent responses and collect human annotations over several signals: whether each response leaks sensitive information, whether it omits task-relevant non-sensitive information, which response is preferred overall, and why.

Generating Scenarios Where Agents Actually Leak

A privacy benchmark cannot be built by collecting and releasing real private user data. PrivacyAlign is synthetic by design: we build a fully automated pipeline that generates privacy-sensitive agent scenarios from scratch, with no human-authored seeds. Each example includes a synthetic profile, a short story, a user instruction, a read-only tool trajectory, and a small memory store. Sensitive and task-relevant facts are mixed into realistic clutter such as email signatures, calendar invites, and database records.

PrivacyAlign synthetic data generation pipeline. — We over-generate, filter for plausibility, keep scenarios where naive agents leak, prune for diversity, and mine response pairs that human annotators can compare.

We over-generate and then filter. A scenario is kept only if it is plausible, a naive agent can respond sensibly, and at least one candidate response leaks. We then deduplicate and balance the pool across domains, tools, final actions, subject scopes, and generator model families.

What Humans Annotate

Annotators read the scenario, the user instruction, and two candidate final actions. For each response, they mark whether it leaks sensitive information and whether it omits relevant non-sensitive information. They also choose which response they prefer overall and explain why.

PrivacyAlign annotation interface showing scenario details, candidate responses, leak and omit labels, and preference selection. — Annotators compare two candidate responses and mark leaks, omissions, and overall preference.

PrivacyAlign annotation interface showing a written rationale box. — They also write a short rationale explaining the privacy tradeoff.

We audit rationales for quality, remove rushed or low-quality annotations, manually review unsure cases, and collect additional labels when there is no clear majority.

Annotation-Conditioned Judging

We use human annotations as prompt-specific evidence about how people interpret a privacy-sensitive scenario.

Leak and omit calls are subjective. A judge has to decide what was sensitive in this scenario and which non-sensitive details were important for the task. Rather than asking an LLM judge to infer that from scratch, we show it the original task, two reference responses, and the human leak labels, omit labels, preferences, and rationales for those responses.

In the paper, this annotation context makes judges more consistent with one another and closer to carefully audited gold labels. We use these annotation-conditioned judges to measure leak and omit rates in the experiments.

From Annotations To Rewards

We also ask whether the annotations can help train agents, not just evaluate them. The key idea is to use human annotations as prompt-specific reward context for fresh completions, rather than reducing them to a fixed label dataset. We study two reward sources for RL. One is an annotation-conditioned pairwise judge: at scoring time, it sees annotations for the same prompt and compares two new candidate completions. The other is a trained generative reward model that learns from structured labels across the training set.

We propose an annotation-conditioned reward: instead of scoring a new response from the prompt alone, the reward model also sees human annotations for the same scenario. This keeps the human signal in context. In our experiments, this works better than the trained generative reward model because it better balances privacy and usefulness. The annotations are guidance, not ground truth: the judge still has to compare the new responses. During RL, sampled responses are compared pairwise, and both response orderings are scored to reduce position bias.

Main Results

We report three rates on the 200-item PrivacyAlign test split. Leak rate measures responses that disclose sensitive information. Omit rate measures responses that leave out task-relevant information. Clean rate counts responses that do neither. The main comparison below shows frontier baselines, open-weight base models, and the same open-weight models after RL with our annotation-conditioned reward.

Frontier baselines

Model Leak % Omit % Clean %

Claude Opus 4.7

34.1 10.8 57.9

Claude Sonnet 4.6

28.9 17.3 56.9

Gemini 3.1 Flash Lite

39.2 46.0 35.4

Gemini 3.1 Pro

41.4 36.1 37.3

GPT-5.4-mini

28.1 36.8 44.9

Open-weight models: base vs. annotation-conditioned RL

Qwen3-4B

+8.4 clean

Run Leak % Omit % Clean %

Base 63.0 49.4 18.9

+ Ours, trained w/ annotation-conditioned reward 40.8 51.9 27.3

Qwen3-8B

+14.8 clean

Run Leak % Omit % Clean %

Base 74.6 41.2 13.3

+ Ours, trained w/ annotation-conditioned reward 41.2 48.8 28.1

Nemotron-3-Nano-4B

+13.5 clean

Run Leak % Omit % Clean %

Base 56.1 53.4 19.1

+ Ours, trained w/ annotation-conditioned reward 28.6 51.2 32.6

Leak, omit, and clean rates on PrivacyAlign. Clean gain is measured against the corresponding open-weight base model.

View Full Leaderboard

For the open-weight models, annotation-conditioned RL improves clean rate for all three model families. The largest gain is for Qwen3-8B, while Nemotron-3-Nano-4B reaches the highest clean rate among the trained open-weight models.

We also test whether these gains transfer beyond PrivacyAlign by evaluating on PrivacyLens and CIMemories, neither of which is used for training. The trained models improve over their base models there as well. For Nemotron-3-Nano-4B, PrivacyLens leakage falls from 49.3% to 38.3% while helpfulness rises, and CIMemories violation falls from 33.4% to 25.4% while completeness rises.

What The Privacy Failures Look Like

Many of the failures look superficially helpful. The agent completes the task, but adds a bank balance, clinical detail, safehouse status, or family fact that the recipient did not need. The issue is not formatting. It is a mismatch between information the agent had access to and information that should appear in the final action.

GPT-5.5 leaks internal finances

The funder needs an event budget, not the organization's internal cash position.

Trained Nemotron-3-Nano-4B: sends the grant request with the budget breakdown, without internal account details.

Opus 4.7 leaks clinical context

The logistics recipient needs ration priority, not stigmatizing malnutrition indicators.

Trained Nemotron-3-Nano-4B: lists only household names, registration numbers, and family sizes.

Gemini 3.1 Pro leaks safety details

A staffing update becomes a physical-safety risk by exposing safehouse and police-protection details.

Trained Nemotron-3-Nano-4B: gives only the requested high-level absence reason and withholds the safehouse and legal details.

View Privacy Failure Examples

Where This Leaves Us

Agentic privacy is a contextual decision problem. Whether a disclosure is acceptable, and whether an omission leaves out something the user needed, depends on human norms and expectations. Those judgments are hard to capture if evaluation and training rely only on automated proxies.

PrivacyAlign keeps those judgments close to both evaluation and training. The benchmark pairs privacy-sensitive agent scenarios with human preferences, leak and omit annotations, and free-text rationales. Those annotations then become prompt-specific context for automated judges and for the annotation-conditioned rewards used to train agents.

Taken together, the results suggest that human annotations can provide reusable, context-specific guidance for privacy alignment. They make automated evaluation more stable, help train open-weight agents to protect sensitive information without dropping what the task needs, and point toward agents that better reflect human privacy norms.

BibTeX

@article{tamber2026privacyalign,
  title         = {PrivacyAlign: Contextual Privacy Alignment for LLM Agents},
  author        = {Manveer Singh Tamber and Abhay Puri and Marc-Etienne Brunet and
                   Perouz Taslakian and Jimmy Lin and Spandana Gella},
  year          = {2026},
  eprint        = {2606.21710},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.21710}
}

PrivacyAlign: Contextual Privacy Alignment for LLM Agents

At a Glance

Why Privacy in Agents Needs Human Judgment

Overview

Why Agents Change The Privacy Problem

The PrivacyAlign Dataset

Generating Scenarios Where Agents Actually Leak

What Humans Annotate

Annotation-Conditioned Judging

From Annotations To Rewards

Main Results

Frontier baselines

Claude Opus 4.7

Claude Sonnet 4.6

Gemini 3.1 Flash Lite

Gemini 3.1 Pro

GPT-5.4-mini

GPT-5.5

Open-weight models: base vs. annotation-conditioned RL

Qwen3-4B

Qwen3-8B

Nemotron-3-Nano-4B

What The Privacy Failures Look Like

GPT-5.5 leaks internal finances

Opus 4.7 leaks clinical context

Gemini 3.1 Pro leaks safety details

Where This Leaves Us

BibTeX