AI Safety Fails When Images and Text Are Combined

April 20, 20262 min read

TL;DR

New research shows multimodal AI models struggle with joint image-text reasoning, leading to over-blocking or under-refusal of harmful content.

In a revealing study, multimodal AI models achieve over 90% accuracy on clear safety signals when processing vision or language inputs alone, but this performance plummets to as low as 20% when joint image-text reasoning is required to determine safety labels. This stark drop highlights a critical vulnerability in current AI systems, where benign content can become harmful only in combination, yet models fail to grasp these interactions. , from the VLSU framework, underscore how safety evaluations that treat modalities separately miss these emergent risks, potentially leading to unsafe AI deployments in applications like content moderation and autonomous systems.

The core insight from the research is that models exhibit systematic failures in compositional reasoning, with 34% of errors in joint safety classification occurring despite correct classifications of individual image or text components. This indicates that AI lacks the ability to integrate multimodal information effectively for safety judgments, a gap that could allow harmful content to slip through or cause unnecessary censorship. Such shortcomings are not just technical flaws but represent alignment issues where models struggle to balance refusal of unsafe content with appropriate responses to borderline cases, impacting user trust and ethical AI use.

To uncover these limitations, the authors developed the Vision Language Safety Understanding (VLSU) framework, employing a multi-stage pipeline that combines real-world images with human annotation to create a large-scale benchmark. This dataset comprises 8,187 samples across 15 harm categories and 17 distinct safety patterns, enabling fine-grained severity classification and combinatorial analysis. By evaluating eleven state-of-the-art models, ology systematically tests joint understanding, moving beyond unimodal assessments to simulate real-world scenarios where AI must interpret complex, multimodal inputs.

Reveal that models not only falter in accuracy but also in handling nuanced cases, such as borderline content where instruction framing can drastically alter outcomes. For instance, in Gemini-1.5, adjusting instructions reduced over-blocking rates on borderline content from 62.4% to 10.4%, but this came at the cost of under-refusing unsafe content, with refusal rates dropping from 90.8% to 53.9%. This trade-off illustrates the difficulty in tuning models for balanced safety responses, as improvements in one area often lead to regressions in another, complicating efforts to deploy reliable AI systems.

For everyday users and developers, these mean that AI tools in apps, social media, and other platforms may inconsistently handle mixed media, potentially flagging harmless posts as dangerous or missing subtle threats. This has practical for digital safety, as over-blocking can stifle free expression, while under-refusal exposes users to risks, emphasizing the need for more robust testing frameworks like VLSU to guide model improvements and regulatory standards.

Despite its comprehensive approach, the study acknowledges limitations, including its focus on a specific set of harm categories and the reliance on human-annotated data, which may not capture all real-world edge cases. The authors note that their benchmark, while large, is not exhaustive, and future work should expand to include more diverse datasets and evolving safety patterns to ensure broader applicability and continued progress in multimodal AI safety research.