From Explicit-to-Implicit: Semantic-Preserving Jailbreaks against Text-to-Image Safety Filters
Published in ARR (Under Review), 2026
Overview
Text-to-Image (T2I) systems are highly vulnerable to jailbreak attacks that bypass safety filters. While existing prompt-level jailbreaks successfully evade text-based pre-checkers, they frequently suffer from Semantic Drift—either the adversarial prompt loses its original meaning, or the generated image fails to realize the intended scene.
To formalize this, we identify the Preservation–Evasion Dilemma: the tension between evading content filters and preserving the original harmful event semantics. To address this, we propose Explicit-to-Implicit (E2I), a structured, controlled red-teaming framework that systematically reformulates explicit harmful prompts into implicit, image-groundable scene descriptions.
The Three Pillars of E2I
A valid semantic-preserving jailbreak must simultaneously satisfy three complementary conditions:
- Surface Obfuscation (Text Evasion): The transformed prompt $x_i$ must remove explicit unsafe keywords, graphic descriptions, and direct NSFW terms so that it is judged safe by a text-based pre-checker: \(C_{\text{text}}(x_i) = \text{safe}\)
- Semantic Consistency (Text-level Meaning): The core roles, actions, relations, and context of the original harmful event must remain fully recoverable from the implicit text: \(J_{\text{sem}}(x_e, x_i) = 1\)
- Compositional Inferability (Image Realization): When input into a T2I generator $G$, the individual slot-level cues must compositionally combine to realize the original harmful event in the generated image: \(J_{\text{img}}(G(x_i), x_e) = 1\)
The E2I Framework
E2I achieves these three pillars by treating prompts as structured events rather than unstructured word sequences. The framework consists of four sequential modules:
graph TD
A[Explicit Harmful Prompt] --> B[1. Semantic Slot Decomposition]
B --> C[2. Slot-wise Implicit Reformulation]
C --> D[3. Dual-Constraint Alignment]
D -->|Failure / Critique| C
D -->|Success| E[4. Coherent Prompt Reconstruction]
E --> F[Implicit Scene Description]
1️⃣ Semantic Slot Decomposition
Inspired by frame semantics, E2I decomposes an explicit prompt $x_e$ into five key slots: \(S(x_e) = \{s_A, s_T, s_{\text{Int}}, s_{\text{Inst}}, s_C\}\) representing Agent, Target, Interaction, Instrument, and Context. E2I classifies each slot as safe or unsafe, preserving safe slots while selectively reformulating only the unsafe ones to prevent global semantic drift.
2️⃣ Slot-wise Implicit Reformulation
Rather than simple synonym substitution, E2I reformulates unsafe slots into visual, dynamic, and atmospheric cues ($Q_k = {q_k^V, q_k^D, q_k^A}$):
- Visual Cues ($q^V$): Observable appearance and attributes.
- Dynamic Cues ($q^D$): Motion, actions, and interactions.
- Atmospheric Cues ($q^A$): Lighting, mood, and environmental conditions.
3️⃣ Dual-Constraint Alignment & Iterative Refinement
A Detect-Critique-Refine loop validates candidate prompts. A candidate $\hat{x}_i$ is accepted only if it satisfies both constraints: \(C_{\text{text}}(\hat{x}_i) = \text{safe} \quad \land \quad J_{\text{sem}}(x_e, \hat{x}_i) = 1\) If a candidate fails, the model generates a critique and refines the prompt for up to $K$ iterations.
4️⃣ Coherent Prompt Reconstruction
E2I integrates the preserved safe slots and validated slot cues into a fluent, cohesive scene: \(x_i = \text{Recon}(S_{\text{safe}}, Q_{\text{unsafe}})\) By reconstructing these cues into a structured narrative, E2I ensures that individual cues remain benign at the text level, while their visual composition in the image exposes the original harmful meaning.
Key Experimental Findings
We evaluated E2I on a balanced 500-prompt subset and a large-scale 3,000-prompt dataset across five safety categories (Violence/Gore, Self-harm/Suicide, Hate/Harassment, Sexual, and Illegal/Crime) using three state-of-the-art generators (FLUX.2, Stable Diffusion 3.5 Large, and Qwen-Image).
1. Benchmark Comparison (500-Prompt Subset)
E2I achieves the best overall balance between text-level evasion and image-level semantic preservation, outperforming five recent baselines:
| Method | Avg. CS-ASR (Text) ↗ | Avg. PD-ASR (Detector) ↗ | Semantic Consistency ↗ | Avg. IHRR (Image Harm) ↗ |
|---|---|---|---|---|
| Ring-A-Bell (Tsai et al., 2024) | 51.3% | 98.5% | 1.2% | 69.8% |
| SneakyPrompt (Yang et al., 2024c) | 33.2% | 59.4% | 15.1% | 64.1% |
| DACA (Deng & Chen, 2023) | 75.1% | 99.6% | 72.5% | 72.6% |
| PGJ (Huang et al., 2025) | 72.9% | 98.2% | 37.0% | 64.1% |
| MMA-Diffusion (Yang et al., 2024a) | 60.0% | 67.4% | 1.0% | 67.1% |
| E2I (Ours) | 86.4% | 100.0% | 72.6% | 72.4% |
2. Slot Composition Analysis & Emergent Harm
To verify Compositional Inferability, we progressively accumulated slots from the Agent alone ($S_A$) to the full configuration ($S_{A:C}$).
- As slots accumulate, the text-level evasion remains extremely high ($92.6\% \rightarrow 88.3\%$).
- Conversely, the generated image harmfulness steadily increases (Human Rating: $1.49 \rightarrow 3.94$ on a 1–5 scale).
- This provides quantitative and qualitative proof of Compositional Inferability: individually benign elements compose to produce harmful visual semantics.
3. Category-wise Analysis (3,000-Prompt Dataset)
Under diagnostic analysis on 3,000 prompts, text-level evasion remains consistently high across all categories (average 89.2% CS-ASR, 99.8% PD-ASR). However, semantic consistency shows category sensitivity:
- Violence/Gore: 87.0% LLM / 81.0% Human
- Sexual Content: 52.0% LLM / 70.0% Human
This variation highlights the core Preservation-Evasion dilemma: categories requiring aggressive lexical obfuscation (e.g., Sexual) naturally introduce more semantic ambiguity.
Ablation Study
Ablation analysis on E2I’s components highlights the necessity of both alignment constraints and prompt reconstruction:
| Variant | Avg. CS-ASR ↗ | Avg. PD-ASR ↗ | Semantic Consistency ↗ |
|---|---|---|---|
| Direct Implicit Prompting | 85.4% | 98.5% | 67.6% |
| w/o Surface Obfuscation Constraint | 83.8% | 98.0% | 63.7% |
| w/o Semantic Consistency Constraint | 86.1% | 98.5% | 61.1% |
| w/o Iterative Refinement | 85.1% | 98.0% | 63.1% |
| w/o Coherent Prompt Reconstruction | 85.5% | 98.2% | 48.7% |
| Full E2I (Ours) | 89.2% | 99.8% | 74.7% |
Without Coherent Prompt Reconstruction, semantic consistency drops drastically (from 74.7% to 48.7%), confirming that structured scene descriptions—rather than fragmented cues—are required to preserve event-level meaning and prompt-following.
Responsible Disclosure & Impact
Our work exposes a fundamental blind spot in T2I safety systems: text-based pre-checkers are poorly equipped to handle implicit, compositional scene descriptions where harmful semantics emerge purely from the interaction of otherwise benign elements.
In accordance with responsible AI practices, we conducted responsible disclosure and notified the developers of the evaluated models (including Stable Diffusion, FLUX, and Qwen-Image) and safety filters (Llama Guard, ShieldGemma, and OpenAI Moderation) prior to submission.
We hope this study motivates the research community to develop safety evaluations and defenses that go beyond simple vocabulary checks and account for visual and compositional pragmatics.
```
