How 2023 research predicted AI audit washing would enable discrimination

"This 2023 analysis correctly predicted that AI audit requirements would create compliance theater without meaningful bias prevention, warnings that have proven increasingly relevant as agent technologies emerge."

6 min read
How 2023 research predicted AI audit washing would enable discrimination
Photo by Victor / Unsplash

Selinger, Evan, Brenda Leong, and Albert Fox Cahn. "AI Audits: Who, When, How…Or Even If?" 2023.

I think this 2023 research by Evan Selinger, Brenda Leong, and Albert Fox Cahn proved remarkably prescient about how AI audit requirements would enable discrimination while appearing to prevent it—their warnings about compliance theater have only become more urgent as these mandates proliferate, though emerging agent technologies might offer paths toward the meaningful oversight they found lacking.

Looking back at this comprehensive analysis, I'm struck by how accurately the authors predicted the "audit washing" phenomenon we're seeing today. Writing as NYC's Local Law 144 was still developing implementation guidance, they warned that AI audits might legitimize fundamentally harmful systems rather than preventing discrimination. Their concern that technical audits would miss human bias in deployment contexts while providing legal cover for discriminatory practices has proven remarkably accurate.

What makes their analysis even more relevant today is how it illuminates the infrastructure gaps that emerging AI agent technology could potentially address. The authors identified continuous monitoring and comprehensive visibility as essential for meaningful bias detection, but traditional audit approaches rely on periodic sampling that misses discriminatory patterns occurring between formal reviews. Agent-based monitoring systems could provide the real-time oversight needed to capture the deployment bias patterns they warned about.

The civil rights critique they documented has intensified as implementation experience accumulates, but agent technology offers potential solutions to some of the accountability problems advocates identified. When the Stop Surveillance Technology Oversight Project argued that audits provide political cover for discriminatory systems, they pointed to the selective evidence presentation that traditional audits enable. Agent-based audit systems maintaining comprehensive, tamper-resistant activity logs could reduce opportunities for cherry-picking favorable data while ignoring problematic patterns.

The specific legal risks the authors identified have materialized exactly as predicted, but agent infrastructure concepts could mitigate some of these risks. Audit reports becoming discoverable evidence creates evidentiary complexities they anticipated, but comprehensive agent-generated audit trails could provide more complete pictures of system behavior rather than the incomplete snapshots that current audits produce.

The "bootstrap problem" they identified around data requirements could be addressed through agent systems that maintain continuous audit trails from initial deployment. Their observation that you need deployment data to audit meaningfully but can't deploy without audits first could be resolved by agent infrastructure that captures system behavior in real-time rather than requiring retrospective analysis of incomplete records.

Their analysis of independence requirements proved particularly prophetic, but agent-based auditing could reduce some institutional capture risks. The minimal sanctions they noted for compromised auditor objectivity remain standard, while competitive pressures favor clean reports over comprehensive bias detection. Autonomous audit agents operating under predefined protocols could minimize human discretion in evidence collection and analysis, reducing opportunities for bias in the audit process itself.

More fundamentally, their research exposed how technical audits systematically miss the human elements that drive discriminatory outcomes. Their facial recognition example—where systems performing equally across demographic groups produce discriminatory results when human operators choose which faces to analyze and how to interpret results—highlights exactly where agent-based monitoring could provide value. Agent systems could track not just algorithmic performance but also human decision patterns around system deployment and result interpretation.

The implementation challenges they documented around fundamental definitional questions persist, but agent infrastructure could standardize some audit processes across jurisdictions. Rather than each locality struggling to define terms like "substantially assist" or developing different approaches to intersectional analysis, agent-based audit systems could implement consistent monitoring protocols while generating jurisdiction-specific reports.

The authors' framework for reasonable audits, while providing useful structure, couldn't resolve the fundamental tension between technical validation and social impact assessment. However, agent technology could bridge this gap by monitoring deployment contexts alongside algorithmic performance. Instead of auditing systems in isolation, agent-based approaches could capture how human decisions about system use create discriminatory outcomes that purely technical audits miss.

Their five guiding questions about audit design become more tractable with agent assistance, though they also reveal new challenges. Who performs audits matters less when comprehensive monitoring is automated, but who oversees the audit agents becomes critical. What gets included could expand significantly with agent systems that capture comprehensive behavioral data, but this raises new questions about privacy and surveillance that the authors' framework doesn't address.

The business implications they outlined have expanded exactly as predicted, but agent infrastructure could help manage the compliance complexity they anticipated. Companies facing patchwork regulatory requirements across jurisdictions could deploy agent systems that automatically adapt monitoring and reporting to different local standards while maintaining consistent underlying data collection.

The competitive dynamics they warned about—market incentives favoring minimal compliance over comprehensive bias prevention—could be addressed through agent systems that make comprehensive monitoring less expensive relative to superficial compliance. When agent infrastructure reduces the cost of meaningful bias detection, economic incentives could align better with civil rights goals.

However, agent-based audit systems would introduce new risks that amplify some concerns the authors raised. If audit agents themselves embed biases or become targets for manipulation, they could systematize discrimination while appearing to provide objective oversight. The "audit washing" they warned about could become more sophisticated when powered by agent systems that appear to provide comprehensive monitoring but actually miss crucial bias patterns.

The attribution challenges they discussed—linking system outputs to responsible parties—align with agent infrastructure concepts, but also raise new accountability questions. While agent systems could maintain better records of decision pathways, they also diffuse responsibility across human operators, system designers, and agent monitors in ways that could complicate rather than clarify accountability.

Their documentation of civil rights advocates pushing for sectoral restrictions rather than auditing discriminatory systems into acceptability remains relevant for agent-assisted auditing. Even comprehensive agent-based monitoring couldn't address the fundamental concern that some AI applications cause inherent harm that no amount of oversight can mitigate.

For product counsel, their 2023 warning that audit requirements provide neither comprehensive protection nor reliable bias prevention frameworks remains valid even with agent assistance. Agent-based audit systems could provide more sophisticated compliance documentation and continuous monitoring, but they wouldn't resolve the underlying tension between technical compliance and meaningful discrimination prevention.

The institutional insight they provided—that audit effectiveness depends on professional standards, enforcement mechanisms, and stakeholder accountability rather than just technical capabilities—applies equally to agent-assisted auditing. Agent infrastructure could provide better tools for bias detection and compliance monitoring, but only within governance frameworks that ensure those tools serve civil rights goals rather than compliance theater.

Looking back, their most important contribution was recognizing that effective auditing requires institutional infrastructure that matches technical capabilities. Agent technology could address some of the monitoring and visibility gaps they identified, but it would need to be deployed within the comprehensive governance frameworks they advocated to avoid simply automating the audit washing they warned about.

The research they conducted in 2023 remains essential for understanding why current audit requirements fail and what would be needed to create meaningful accountability. Agent technology offers new possibilities for addressing some implementation challenges they identified, but their core insight about the need for institutional development alongside technical capability remains as relevant today as when they wrote it.

https://ssrn.com/abstract=4568208


TLDR: The report, "AI Audits: Who, When, How…Or Even If?", addresses the increasing integration of AI tools into high-risk decision-making (e.g., employment, credit, healthcare) and the resulting demand for responsible AI policies that ensure technical accuracy and reliability. While definitions vary, the authors adopt a narrow interpretation of an AI audit as an independent, technical assessment of a model's performance metrics against transparent rules or laws, yielding a binary compliance outcome. This primarily technical operation focuses on testing for fairness and equity, often using existing standards like disparate impact ratios for protected categories. The historical evolution of financial audits over a century serves as a baseline comparison, suggesting that AI audits can also mature with sufficient planning and commitment to professionalization and standard development.

New York City’s Local Law 144, the first U.S. law regulating automated employment decision tools (AEDTs), is presented as a key case study. This law mandates annual, independent bias audits for AEDTs (specifically for race/ethnicity and sex), public reporting of audit results, and advance notice to applicants. However, the law's implementation has been challenging, highlighting debates over the scope of protected classes, appropriate data usage (historical vs. test data), and a notable lack of provisions for remediation when subpar results are found. The authors suggest that the law, while a step in the right direction, was passed too quickly and remains minimalist and incomplete for comprehensive oversight.

To guide the development of reasonable AI audits, the report proposes three foundational claims: an audit should be an examination of a model's performance metrics, function as a part of a larger governance framework, and be designed with a clear public interest intent. These claims inform five guiding questions for designing audits: who should perform the audit, what should it include, what standards should audit outputs be measured against, what should the audit output be and who should access it, and how frequently audits should occur.

Crucially, the report articulates the skepticism of civil rights advocates regarding AI audits, who argue they are insufficient and potentially harmful. Critics fear "audit-washing", where ambiguities in audit requirements are exploited to demonstrate compliance without meaningful review, thereby legitimizing discriminatory tools and potentially providing a "shield" against accountability or liability under existing civil rights laws. They emphasize that many AI-related harms stem from human operation and data integrity, not just the model itself (e.g., facial recognition), meaning audits may only address a small subset of biases. These advocates often call for more stringent measures, including outright bans on certain high-risk AI systems.

The report concludes that AI audits are a contentious but essential component of AI governance, requiring ongoing debate, continuous improvement, and a realistic understanding of their strengths and limitations.