A Wolf in Sheep’s Clothing: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search
Current jailbreak methods focus on optimizing prompts to bypass guardrails, but these approaches
fail against modern defenses that detect malicious intent. We argue that a more fundamental
vulnerability lies in the interconnected nature of an LLM's internal knowledge.
Restricted information can be reconstructed by weaving together a series of locally
innocuous queries that individually appear benign but collectively lead to the harmful
objective.
We introduce CKA-Agent (Correlated Knowledge Attack Agent), a framework that
operationalizes this vulnerability by reformulating jailbreaking as an adaptive tree search
over the target LLM's correlated knowledge. Instead of crafting a single malicious
prompt, CKA-Agent dynamically navigates the model's internal knowledge graph, using
the target's own responses to guide its multi-hop attack path. Through a simulation-free tree search
with a hybrid LLM evaluator, CKA-Agent achieves 96-99% attack success rates against
state-of-the-art commercial LLMs, representing a 15-21pp gain over the best decomposition
baseline and up to a 96× improvement over prompt optimization methods
on robustly defended models.
💡 Motivation: Adaptive Search over Correlated Knowledge
Target LLMs contain correlated knowledge that can be extracted through safe queries. While Prompt Optimization Attacks (POA) try to directly access target knowledge and get blocked, our approach uses adaptive search to dynamically explore correlated knowledge nodes, adapting when queries are blocked, and synthesizing the gathered information to construct the target answer.
Experimental Results
Attack Success Rates on HarmBench & StrongREJECT
Full Comparison with All Baselines
| Method | Gemini-2.5-Flash | Gemini-2.5-Pro | GPT-oss-120B | Claude-Haiku-4.5 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FS↑ | PS↑ | V↓ | R↓ | FS↑ | PS↑ | V↓ | R↓ | FS↑ | PS↑ | V↓ | R↓ | FS↑ | PS↑ | V↓ | R↓ | |
| HarmBench Dataset | ||||||||||||||||
| Vanilla | 0.151 | 0.032 | 0.000 | 0.818 | 0.222 | 0.064 | 0.000 | 0.714 | 0.048 | 0.032 | 0.032 | 0.889 | 0.008 | 0.016 | 0.000 | 0.976 |
| AutoDAN | 0.767 | 0.050 | 0.017 | 0.167 | 0.921 | 0.016 | 0.008 | 0.056 | 0.103 | 0.032 | 0.032 | 0.833 | 0.008 | 0.008 | 0.000 | 0.984 |
| PAIR | 0.810 | 0.064 | 0.015 | 0.111 | 0.905 | 0.071 | 0.008 | 0.056 | 0.278 | 0.214 | 0.405 | 0.492 | 0.032 | 0.040 | 0.048 | 0.880 |
| PAP (Logical) | 0.230 | 0.040 | 0.016 | 0.714 | 0.214 | 0.040 | 0.016 | 0.730 | 0.080 | 0.056 | 0.043 | 0.821 | 0.000 | 0.008 | 0.000 | 0.992 |
| ActorBreaker | 0.331 | 0.102 | 0.095 | 0.472 | 0.325 | 0.119 | 0.183 | 0.373 | 0.087 | 0.175 | 0.103 | 0.635 | 0.079 | 0.087 | 0.119 | 0.714 |
| X-Teaming | 0.595 | 0.056 | 0.016 | 0.333 | 0.762 | 0.063 | 0.008 | 0.167 | 0.071 | 0.056 | 0.071 | 0.802 | 0.000 | 0.000 | 0.000 | 1.000 |
| Multi-Agent Jailbreak | 0.794 | 0.143 | 0.040 | 0.024 | 0.818 | 0.143 | 0.032 | 0.008 | 0.762 | 0.167 | 0.048 | 0.024 | 0.786 | 0.119 | 0.048 | 0.048 |
| CKA-Agent (ours) | 0.968 | 0.025 | 0.000 | 0.007 | 0.968 | 0.025 | 0.007 | 0.000 | 0.976 | 0.016 | 0.008 | 0.000 | 0.960 | 0.024 | 0.008 | 0.008 |
| StrongREJECT Dataset | ||||||||||||||||
| Vanilla | 0.012 | 0.000 | 0.000 | 0.988 | 0.019 | 0.031 | 0.000 | 0.951 | 0.000 | 0.025 | 0.019 | 0.957 | 0.000 | 0.012 | 0.000 | 0.988 |
| AutoDAN | 0.463 | 0.037 | 0.025 | 0.475 | 0.852 | 0.012 | 0.000 | 0.136 | 0.080 | 0.025 | 0.019 | 0.877 | 0.006 | 0.000 | 0.006 | 0.988 |
| PAIR | 0.827 | 0.062 | 0.019 | 0.092 | 0.826 | 0.056 | 0.012 | 0.106 | 0.099 | 0.031 | 0.019 | 0.851 | 0.049 | 0.037 | 0.025 | 0.889 |
| PAP (Logical) | 0.154 | 0.012 | 0.019 | 0.815 | 0.130 | 0.043 | 0.000 | 0.827 | 0.080 | 0.056 | 0.043 | 0.821 | 0.000 | 0.006 | 0.000 | 0.994 |
| ActorBreaker | 0.340 | 0.111 | 0.043 | 0.506 | 0.333 | 0.093 | 0.068 | 0.506 | 0.136 | 0.167 | 0.074 | 0.624 | 0.068 | 0.080 | 0.074 | 0.778 |
| X-Teaming | 0.679 | 0.068 | 0.012 | 0.241 | 0.809 | 0.062 | 0.019 | 0.111 | 0.130 | 0.093 | 0.031 | 0.747 | 0.000 | 0.000 | 0.000 | 1.000 |
| Multi-Agent Jailbreak | 0.827 | 0.099 | 0.019 | 0.056 | 0.790 | 0.099 | 0.037 | 0.074 | 0.772 | 0.167 | 0.037 | 0.025 | 0.815 | 0.099 | 0.025 | 0.062 |
| CKA-Agent (ours) | 0.988 | 0.006 | 0.000 | 0.006 | 0.951 | 0.043 | 0.000 | 0.006 | 0.982 | 0.012 | 0.006 | 0.000 | 0.969 | 0.025 | 0.006 | 0.000 |
Metrics: FS = Full Success, PS = Partial Success, V = Vacuous, R = Refusal. ↑ means higher is better, ↓ means lower is better. Best results highlighted, second-best in blue.
⚙️ Evaluation Setup
Datasets: HarmBench (126 behaviors) + StrongREJECT (162 prompts) = 288 high-stakes harmful prompts
Target Models: Gemini-2.5-Flash, Gemini-2.5-Pro, GPT-oss-120B, Claude-Haiku-4.5
Attacker Model: Qwen3-32B-abliterated (for all methods requiring auxiliary models)
Judge Model: Gemini-2.5-Flash with 4-level rubric (Full Success, Partial Success, Vacuous, Refusal)
Note: All evaluation details, including judge prompts and scoring rubrics, are available in the arXiv paper. All experiments were conducted during Oct-Nov 2025.
Key Findings:
- POA Catastrophic Failure: Prompt optimization methods (PAIR, AutoDAN, PAP) collapse on robustly defended models. PAIR drops from 90.5% → 3.2% on Claude-Haiku-4.5, revealing that harmful intent remains semantically detectable regardless of algorithmic sophistication.
- DOA Superiority: Decomposition-based methods maintain consistent performance across all targets. Multi-Agent Jailbreak achieves 76.2%–81.8%, representing 24× improvement over PAIR on the most defended model.
- CKA-Agent SOTA: Achieves 15-21 percentage point gains over Multi-Agent Jailbreak and up to 96× improvement over POA methods on robustly defended models through adaptive exploration that dynamically learns from target responses.
- Critical Vulnerability Exposed: Current guardrails effectively detect optimized harmful prompts but cannot aggregate intent across adaptively constructed innocuous queries.
Context-Aware Defense: CKA-Agent vs CKA-Agent-Branch
Research Question: Can providing conversation history help target models detect correlated knowledge attacks?
Experimental Setup:
- CKA-Agent: Each sub-query sent independently (no history)
- CKA-Agent-Branch: Each sub-query includes full conversation history from its branch
| Model | Dataset | CKA-Agent | CKA-Agent-Branch | Degradation |
|---|---|---|---|---|
| Gemini-2.5-Flash | HarmBench | 96.8% | 92.1% | -4.7% |
| Gemini-2.5-Flash | StrongREJECT | 98.8% | 96.9% | -1.9% |
| GPT-oss-120B | HarmBench | 97.6% | 78.6% | -19.0% |
| Claude-Haiku-4.5 | HarmBench | 96.0% | 88.9% | -7.1% |
🔍 Defense Insights: While context-aware defense degrades performance by 2-19%, CKA-Agent-Branch still achieves 78.6%+ success rates across all models. This reveals current LLMs struggle to aggregate intent across multi-turn interactions, even with full conversation history.
💡 Defense Implication: Future guardrails must enhance cross-query intent aggregation and long-context reasoning.
⚙️ Methodology
How CKA-Agent Works
CKA-Agent reformulates jailbreaking from static prompt optimization to dynamic knowledge decomposition and adaptive tree search. The framework operates through iterative exploration cycles:
- Generate locally harmless sub-queries that extract correlated knowledge from the target model
- Execute queries against the target and collect responses
- Evaluate responses using hybrid scoring (introspection + target feedback)
- Branch adaptively based on UCT-guided selection to explore multiple promising paths
- Synthesize accumulated knowledge from successful exploration trajectories
- Backpropagate failure signals to guide future iterations toward unexplored high-value regions
CKA-Agent Architecture
Adaptive Branching Search Algorithm
- Selection: Global UCT policy selects the single most promising leaf: argmax(f_v + c√(ln N_parent / N_v))
- Expansion: Depth-first expansion with adaptive branching (B=1 for confident paths, B≤3 for uncertainty)
- Evaluation: Hybrid scoring = α·(introspection) + (1-α)·(target feedback), replacing costly MCTS rollouts
- Termination: Success when synthesis achieves judge score ≥ τ; otherwise backpropagate and iterate
- Efficiency: 70-95% first-iteration success; 92-95% success within two iterations
Additional Analysis
Defense Robustness: POA vs DOA Under Guardrails
Adaptive Branching: Progressive Recovery from Failures
Cost-Performance Trade-off: Pareto Optimality
Show More Cost-Performance Analysis Figures
Evaluation Validity: Human-LLM Judge Alignment
To validate our LLM-as-Judge evaluation methodology, we conducted a human evaluation study with 40 randomly sampled cases using a between-subjects design:
- Condition 1 (No Reasoning): 5 annotators evaluate responses with only prompts and outputs
- Condition 2 (With Reasoning): 5 different annotators have access to the judge model's analytical reasoning (but not its final score)
Citation
@misc{wei2025wolfsheepsclothingbypassing,
title={A Wolf in Sheep's Clothing: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search},
author={Rongzhe Wei and Peizhi Niu and Xinjie Shen and Tony Tu and Yifan Li and Ruihan Wu and Eli Chien and Olgica Milenkovic and Pan Li},
year={2025},
eprint={2512.01353},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2512.01353},
}
Responsible Disclosure
This research is conducted for academic purposes to identify vulnerabilities in LLM safety systems. We have responsibly disclosed our findings to affected model providers and advocate for enhanced defense mechanisms. The code and detailed attack prompts will be released following ethical review and coordinated disclosure timelines.