Policy Maps: Tools for Guiding the Unbounded Space of LLM Behaviors

Michelle S. Lam, Fred Hohman, Dominik Moritz, Jeffrey P. Bigham, Kenneth Holstein, Mary Beth Kery · 2025 · Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST '25) · doi:10.1145/3746059.3747680

Summary

This paper introduces "policy maps," an approach to AI policy design for large language models inspired by physical mapmaking. The core insight is that comprehensive policy coverage over an unbounded space of LLM inputs and outputs is impossible — just as no map can capture every detail of a landscape. Instead, policy maps help practitioners make intentional choices about which aspects of model behavior to focus on and which to abstract away. The approach operates through three layers of abstraction: Cases (individual model input-output pairs observed in real data), Concepts (user-defined groupings of related cases, such as "Violence," "Medical Advice," or "Graphic Details"), and Policies (if-then rules composed of concepts that specify actions like BLOCK, WARN, SUPPRESS, or ADD). The authors implemented Policy Projector, an open-source interactive tool (available as a web app, Python library, and notebook widgets) built with SvelteKit/TypeScript, a Flask backend, Mosaic and DuckDB for visualization, and Sentence Transformers for embedding cases onto a 2D UMAP projection. The tool uses LLM-based classification (GPT-4o-mini) for zero/few-shot concept matching, and representation finetuning (ReFT) on Llama 3 8B for model steering — training interventions on an LLM's internal representations to suppress or add specified concepts, requiring only 3-5 training examples and completing in seconds.

Key findings

In a usage evaluation with 12 AI safety experts at Apple (spanning engineering, research, and product management), participants authored 24 new policies drawing on 43 concepts (12 from the existing safety taxonomy, 31 custom-defined). All 12 authored unique concepts that no other participant identified, and 28 of 31 custom concepts were distinct — demonstrating the value of multiple perspectives. Participants found the system especially helpful for authoring policies (11/12 helpful, 10/12 expressive) and for identifying policy gaps. Policies fell into several categories: 3/24 carved out scenarios that should be allowed but would normally be blocked (e.g., "Allow non-graphic mentions of death"), 2/24 carved out scenarios that should be blocked but would normally be allowed (e.g., "Block obscenities for child-owned devices"), 8/24 added warnings on sensitive content, and 7/24 directly blocked problematic content. Technical evaluation showed concept classification achieved 85.8% accuracy with 99.2% recall, with inter-rater agreement (Cohen's kappa 0.67-0.73) comparable to human annotator agreement (0.79). The concept suggestion algorithm recovered 40% of ground-truth concepts per trial (72.5% cumulatively across three trials) at a cost of approximately \/bin/zsh.005 per run. Model steering using ReFT significantly reduced positive concept classifications from M=0.72 to M=0.40 with just 5 training examples.

Relevance

While this paper focuses on AI safety policy rather than disability accessibility directly, it has important implications for the accessibility community. First, the paper identifies that LLMs can make "racist and ableist resume assessments" — disability bias in AI systems is an active concern that policy maps could help address. The concept-based approach to defining policy regions is directly applicable to accessibility: practitioners could define concepts like "disability stereotyping," "ableist language," or "inaccessible content generation" and author policies to suppress or modify these behaviors. The participatory potential is particularly relevant — the paper envisions external stakeholders, including disability communities, being able to review policy maps and propose new concepts and policies reflecting their needs. The "Git for policy" collaboration model could enable disability advocacy organisations to contribute to AI safety policies that affect them. More broadly, the work demonstrates that AI safety tooling is maturing rapidly, and the accessibility community should engage with these frameworks to ensure disability perspectives are represented in LLM policy design, rather than being an afterthought.

Tags: AI safety · AI policy · large language models · AI ethics · model evaluation · content moderation · data visualization · human-AI interaction · AI governance