AI Without Women Is a Risk: A Benchmark for Peace and Security

Examining the risk of blind spots in generative AI models, this blog piece discusses the benefits of a dedicated Women, Peace and Security specific benchmark in achieving more effective AI tools and better security outcomes.

Generative AI models are rapidly finding a place in high-stakes decision-making—from drafting policy briefs to analyzing threats and coordinating humanitarian relief. Yet these systems often carry baked-in blind spots when their training data and evaluation frameworks inadvertently encode societal biases or disproportionately underrepresent perspectives from women, resulting in skewed outputs and meaningfully disparate accuracy across groups. Such blind spots can undermine human security, inhibit crisis response and prevent sustainable peace, all through the inadvertent marginalization of half the population.

Gender inclusion leads to more durable security outcomes. Studies show that peace agreements with women’s participation are 35% more likely to last at least 15 yearsUN Security Council Resolution 1325 represented the global recognition that women’s meaningful participation is crucial for effective peace processes, resilient communities, and inclusive governance. Despite this evidence, research that does not factor in the unique experiences and perspectives of men and women still dominates the security space. Therefore, the data sets that inform new AI tools have an inherent blind spot that will prevent applications from being as effective as they can be to help users achieve security outcomes.  To overcome this barrier, researchers have developed bias benchmarks to measure model performance, inform users, and enable engineers to build more effective products. A benchmark is a curated suite of tasks and quantitative metrics—often in the form of targeted prompts and scenario‑based probes—designed to systematically evaluate how well an AI model handles a specific domain or capability. However, most existing benchmarks focus on generic or decontextualized tasks and fail to capture the complex, multilingual, and intersectional scenarios inherent in conflict situations.

A dedicated Women, Peace and Security (WPS) specific benchmark would have multiple benefits. First, it would spotlight how large language models (LLMs) perform in real-world peace and security contexts: from ceasefire negotiations and refugee protection planning to counter-extremism communication in local languages. By evaluating models against scenario-based probes that reflect the nuances of WPS, we can give model developers clear performance goals to aim for in WPS scenarios. Second, it will also give us a standardized tool to evaluate mitigation approaches, such as enhanced prompting strategies, RAG (retrieval-augmented generation), fine-tuning with curated WPS datasets, or deploying domain-specific “constitutions” of guiding principles. Third, it would create a clear blueprint for similar, domain‑focused benchmarks across the human security space—so that whether it’s health crises, climate resilience, or atrocity prevention, we have a repeatable model for evaluating the use of AI tools where the stakes are highest. 

We hope that publishing WPS benchmark results can incentivize companies to embed human security considerations into their training pipelines – improving model outputs while mitigating risk for AI companies. In this way, benchmarks act as a feedback mechanism, translating WPS needs into concrete metrics that guide continuous model improvement and ensure AI tools support more effective and durable peace and security interventions.

The Limits of Today’s Bias Benchmarks

Overreliance on Generalized Tasks: Most bias benchmarks—StereoSetCrowS-PairsGenderBench, and the like—operate on broad, decontextualized language tasks. A sentence about a doctor or a nurse, a fill-in-the-blank prompt about athletes or artists. None of these directly mirror the nuance of a peace negotiation, a humanitarian assessment, or a violent extremism counter-messaging campaign. More generalized benchmarks may miss how an LLM behaves in high-stress conflict scenarios. 

Overlooking Intersectionality and Language Diversity: Bias often intensifies when identities overlap, precisely the same contexts in which conflict also intensifies. Most multilingual bias studies focus on English and a handful of major languages, and rarely consider regional, ethnic or linguistic identity, let alone gender dynamics within that. LLMs, with their vast neural nets and probabilistic outputs, generate different answers based on different contexts. Benchmarks that fail to examine how models behave under diverse intersectional and linguistic contexts risk missing critical areas of model bias. Context risks such as these in an LLM certainly will weaken models, but as models and benchmarks are developed, they will need to be continuously refined to address these overlaps. 

Failing to Address Cognitive Bias in High-Stakes Contexts: Bias benchmarks have exposed how models can replicate ingrained patterns of human thinking—status quo bias, in-group favoritism, confirmation bias. Yet they stop short of testing these biases in WPS scenarios. Can we trust a model to flag risks to peacekeepers, or defaulting to the typical needs or priorities of only half the population? Without such context-rich probes, we risk that decision-makers in national security agencies or disaster relief organizations using these tools will fail to factor in these real and consequential blind spots.

Disconnect Between Audits and Incentives: Finally, many bias evaluations remain academic exercises. They measure gaps but rarely feed back into model development or procurement processes. There are currently no standard requirements for companies to run specialized WPS tests, and donors or end-users lack common criteria to request them. The result? A gulf between what we know about bias and what gets fixed in practice.

These limitations don’t mean existing benchmarks lack value. Rather, they highlight the need for complementary tools. A WPS benchmark builds on them by applying similar tools to high-risk, underrepresented domains that are often ignored in mainstream evaluation.

How a Benchmark Can Drive Change

Building a WPS-specific benchmark is more than an academic exercise. It’s a lever to shift incentives across the AI ecosystem:

Shaping Model Development Priorities: By publishing a transparent suite of WPS probes—scenario-based vignettes, peace dialogue counterfactuals, multilingual intersectional templates—we create a clear roadmap for where models fall short. AI designers seeking market credibility will be incentivized to report their performance against these tests, just as they currently tout benchmarks like GLUE or SQuAD because they carry credibility from a community of experts.

Empowering Donors and Regulators: International donors and governments already require ethics checklists and technical standards. A WPS benchmark can be a practical compliance tool: funders can condition grants on passing core bias thresholds. Regulators exploring AI safeguards can reference the suite as an example of domain-tailored fairness metrics, reinforcing accountability.

Guiding Academic Partnerships: Research labs thrive on open challenges. A public WPS benchmark invites universities and think tanks to iterate on bias mitigation—whether through expanded pre-training datasets, targeted fine-tuning, adversarial data augmentation, or novel Reinforcement Learning with Human Feedback (RLHF) protocols. Collaborative leaderboard efforts can spotlight the highest-performing approaches, accelerating progress.

Elevating Civil Society’s Voice: By documenting AI bias in conflict and humanitarian contexts, we arm local NGOs, women’s rights groups, and peacebuilders with evidence. They can pinpoint how off-the-shelf AI tools perform in these high-risk scenarios, and advocate for tailored solutions that reflect their lived realities.

Closing the Loop: Crucially, a WPS benchmark isn’t a one-off test. It’s a continuous monitoring framework. Each model update invites re-evaluation: did cognitive-bias tuning reduce status quo bias in negotiation prompts? Did multilingual debiasing improve representation of diverse linguistic and cultural perspectives? This iterative loop ensures that WPS needs remain front and center in model development roadmaps.

Why Hasn’t This Been Done Yet? 

Until very recently, civil‑society tech efforts have focused almost exclusively on content moderation and online harms—areas that attract far more attention (and funding) from both governments and the private sector. Meanwhile, budgets for gender‑focused AI research have contracted, leaving a gap between policy commitments and technical practice. As a result, despite progress made to bake them into security and humanitarian frameworks, they’ve yet to meaningfully influence how AI tools are designed and evaluated.

Looking Ahead

A WPS benchmark won’t solve model bias overnight—but it will serve as a critical foundation linking rigorous research to real-world impact. By embedding domain-specific probes into AI evaluation pipelines, we can ensure models perform reliably in high-stakes peace and security contexts. Multilingual and intersectional case studies will catch harms that disproportionately affect women across ethnic, linguistic, and regional lines. This benchmark will not just expose bias, it will highlight practical, grounded, and evidence-backed strategies for mitigating bias—especially in high-stakes domains like peace and security—that can immediately be put into practice. This will empower funders, developers, governments and civil society to demand better AI and make better use of current systems. Over time, this feedback loop will help AI tools reinforce, rather than erase, women’s leadership in peacebuilding. 

The ripple effects will reach far beyond WPS, driving progress across adjacent domains—from responsive design to equitable healthcare and economic justice. The overarching goal of creating a benchmark is to overcome the blind spots that have long undermined peace and security decision making. By improving how emerging technologies handle these contexts, we can ensure AI doesn’t just accelerate decisions—it helps lead to safer, more inclusive, and more durable outcomes.

Editor’s Note: The WPS AI Benchmark Project is run by Our Secure Future with an intention of designing with and for the WPS Community. 

As a guest blog, the views expressed in this publication do not necessarily reflect the views of Our Secure Future or any particular organization.