WARNING: This research contains potentially offensive and harmful text for academic safety research purposes.
November 2025 Preprint

A Wolf in Sheep’s Clothing: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

AUTHORS
1GaTech, 2UIUC, 3Tsinghua, 4UCSD, 5NTU
*Equal contribution    Corresponding author
Preprint

Current jailbreak methods focus on optimizing prompts to bypass guardrails, but these approaches fail against modern defenses that detect malicious intent. We argue that a more fundamental vulnerability lies in the interconnected nature of an LLM's internal knowledge. Restricted information can be reconstructed by weaving together a series of locally innocuous queries that individually appear benign but collectively lead to the harmful objective.
We introduce CKA-Agent (Correlated Knowledge Attack Agent), a framework that operationalizes this vulnerability by reformulating jailbreaking as an adaptive tree search over the target LLM's correlated knowledge. Instead of crafting a single malicious prompt, CKA-Agent dynamically navigates the model's internal knowledge graph, using the target's own responses to guide its multi-hop attack path. Through a simulation-free tree search with a hybrid LLM evaluator, CKA-Agent achieves 96-99% attack success rates against state-of-the-art commercial LLMs, representing a 15-21pp gain over the best decomposition baseline and up to a 96× improvement over prompt optimization methods on robustly defended models.

💡 Motivation: Adaptive Search over Correlated Knowledge

Target LLMs contain correlated knowledge that can be extracted through safe queries. While Prompt Optimization Attacks (POA) try to directly access target knowledge and get blocked, our approach uses adaptive search to dynamically explore correlated knowledge nodes, adapting when queries are blocked, and synthesizing the gathered information to construct the target answer.

Comparison of Attack Methods
Figure 1: Comparison of different attack methods. (1) Prompt Optimization Attacks (POA) directly attack the target LLM but often fail due to safety guardrails. (2) Decomposition-Only Attacks (DOA) rely on prior knowledge from a limited decomposer and have a static plan. (3) CKA-Agent uses an adaptive tree search to explore correlated knowledge and dynamically adjust the attack plan.

Experimental Results

Attack Success Rates on HarmBench & StrongREJECT

96.8%
Gemini-2.5-Pro (HarmBench)
98.8%
Gemini-2.5-Flash (StrongREJECT)
97.6%
GPT-oss-120B (HarmBench)
96.9%
Claude-Haiku-4.5 (StrongREJECT)

Full Comparison with All Baselines

Method Gemini-2.5-Flash Gemini-2.5-Pro GPT-oss-120B Claude-Haiku-4.5
FS↑ PS↑ V↓ R↓ FS↑ PS↑ V↓ R↓ FS↑ PS↑ V↓ R↓ FS↑ PS↑ V↓ R↓
HarmBench Dataset
Vanilla 0.151 0.032 0.000 0.818 0.222 0.064 0.000 0.714 0.048 0.032 0.032 0.889 0.008 0.016 0.000 0.976
AutoDAN 0.767 0.050 0.017 0.167 0.921 0.016 0.008 0.056 0.103 0.032 0.032 0.833 0.008 0.008 0.000 0.984
PAIR 0.810 0.064 0.015 0.111 0.905 0.071 0.008 0.056 0.278 0.214 0.405 0.492 0.032 0.040 0.048 0.880
PAP (Logical) 0.230 0.040 0.016 0.714 0.214 0.040 0.016 0.730 0.080 0.056 0.043 0.821 0.000 0.008 0.000 0.992
ActorBreaker 0.331 0.102 0.095 0.472 0.325 0.119 0.183 0.373 0.087 0.175 0.103 0.635 0.079 0.087 0.119 0.714
X-Teaming 0.595 0.056 0.016 0.333 0.762 0.063 0.008 0.167 0.071 0.056 0.071 0.802 0.000 0.000 0.000 1.000
Multi-Agent Jailbreak 0.794 0.143 0.040 0.024 0.818 0.143 0.032 0.008 0.762 0.167 0.048 0.024 0.786 0.119 0.048 0.048
CKA-Agent (ours) 0.968 0.025 0.000 0.007 0.968 0.025 0.007 0.000 0.976 0.016 0.008 0.000 0.960 0.024 0.008 0.008
StrongREJECT Dataset
Vanilla 0.012 0.000 0.000 0.988 0.019 0.031 0.000 0.951 0.000 0.025 0.019 0.957 0.000 0.012 0.000 0.988
AutoDAN 0.463 0.037 0.025 0.475 0.852 0.012 0.000 0.136 0.080 0.025 0.019 0.877 0.006 0.000 0.006 0.988
PAIR 0.827 0.062 0.019 0.092 0.826 0.056 0.012 0.106 0.099 0.031 0.019 0.851 0.049 0.037 0.025 0.889
PAP (Logical) 0.154 0.012 0.019 0.815 0.130 0.043 0.000 0.827 0.080 0.056 0.043 0.821 0.000 0.006 0.000 0.994
ActorBreaker 0.340 0.111 0.043 0.506 0.333 0.093 0.068 0.506 0.136 0.167 0.074 0.624 0.068 0.080 0.074 0.778
X-Teaming 0.679 0.068 0.012 0.241 0.809 0.062 0.019 0.111 0.130 0.093 0.031 0.747 0.000 0.000 0.000 1.000
Multi-Agent Jailbreak 0.827 0.099 0.019 0.056 0.790 0.099 0.037 0.074 0.772 0.167 0.037 0.025 0.815 0.099 0.025 0.062
CKA-Agent (ours) 0.988 0.006 0.000 0.006 0.951 0.043 0.000 0.006 0.982 0.012 0.006 0.000 0.969 0.025 0.006 0.000

Metrics: FS = Full Success, PS = Partial Success, V = Vacuous, R = Refusal. ↑ means higher is better, ↓ means lower is better. Best results highlighted, second-best in blue.

⚙️ Evaluation Setup

Datasets: HarmBench (126 behaviors) + StrongREJECT (162 prompts) = 288 high-stakes harmful prompts

Target Models: Gemini-2.5-Flash, Gemini-2.5-Pro, GPT-oss-120B, Claude-Haiku-4.5

Attacker Model: Qwen3-32B-abliterated (for all methods requiring auxiliary models)

Judge Model: Gemini-2.5-Flash with 4-level rubric (Full Success, Partial Success, Vacuous, Refusal)

Note: All evaluation details, including judge prompts and scoring rubrics, are available in the arXiv paper. All experiments were conducted during Oct-Nov 2025.

Key Findings:

  • POA Catastrophic Failure: Prompt optimization methods (PAIR, AutoDAN, PAP) collapse on robustly defended models. PAIR drops from 90.5% → 3.2% on Claude-Haiku-4.5, revealing that harmful intent remains semantically detectable regardless of algorithmic sophistication.
  • DOA Superiority: Decomposition-based methods maintain consistent performance across all targets. Multi-Agent Jailbreak achieves 76.2%–81.8%, representing 24× improvement over PAIR on the most defended model.
  • CKA-Agent SOTA: Achieves 15-21 percentage point gains over Multi-Agent Jailbreak and up to 96× improvement over POA methods on robustly defended models through adaptive exploration that dynamically learns from target responses.
  • Critical Vulnerability Exposed: Current guardrails effectively detect optimized harmful prompts but cannot aggregate intent across adaptively constructed innocuous queries.

Context-Aware Defense: CKA-Agent vs CKA-Agent-Branch

Research Question: Can providing conversation history help target models detect correlated knowledge attacks?

Experimental Setup:

  • CKA-Agent: Each sub-query sent independently (no history)
  • CKA-Agent-Branch: Each sub-query includes full conversation history from its branch
Model Dataset CKA-Agent CKA-Agent-Branch Degradation
Gemini-2.5-Flash HarmBench 96.8% 92.1% -4.7%
Gemini-2.5-Flash StrongREJECT 98.8% 96.9% -1.9%
GPT-oss-120B HarmBench 97.6% 78.6% -19.0%
Claude-Haiku-4.5 HarmBench 96.0% 88.9% -7.1%

🔍 Defense Insights: While context-aware defense degrades performance by 2-19%, CKA-Agent-Branch still achieves 78.6%+ success rates across all models. This reveals current LLMs struggle to aggregate intent across multi-turn interactions, even with full conversation history.

💡 Defense Implication: Future guardrails must enhance cross-query intent aggregation and long-context reasoning.

⚙️ Methodology

How CKA-Agent Works

CKA-Agent reformulates jailbreaking from static prompt optimization to dynamic knowledge decomposition and adaptive tree search. The framework operates through iterative exploration cycles:

  1. Generate locally harmless sub-queries that extract correlated knowledge from the target model
  2. Execute queries against the target and collect responses
  3. Evaluate responses using hybrid scoring (introspection + target feedback)
  4. Branch adaptively based on UCT-guided selection to explore multiple promising paths
  5. Synthesize accumulated knowledge from successful exploration trajectories
  6. Backpropagate failure signals to guide future iterations toward unexplored high-value regions

CKA-Agent Architecture

CKA-Agent Framework
Figure 2: The CKA-Agent framework architecture. Each iteration performs: (1) Selection via UCT policy to identify the most promising leaf node from the entire tree; (2) Depth-First Expansion generating and executing sub-queries until reaching a terminal state; (3) Hybrid Evaluation combining introspection scores (query quality) and target feedback scores (response informativeness); (4) Synthesis of accumulated knowledge along the explored path; (5) Backpropagation of failure signals to guide subsequent iterations toward unexplored regions.

Adaptive Branching Search Algorithm

  • Selection: Global UCT policy selects the single most promising leaf: argmax(f_v + c√(ln N_parent / N_v))
  • Expansion: Depth-first expansion with adaptive branching (B=1 for confident paths, B≤3 for uncertainty)
  • Evaluation: Hybrid scoring = α·(introspection) + (1-α)·(target feedback), replacing costly MCTS rollouts
  • Termination: Success when synthesis achieves judge score ≥ τ; otherwise backpropagate and iterate
  • Efficiency: 70-95% first-iteration success; 92-95% success within two iterations

Additional Analysis

Defense Robustness: POA vs DOA Under Guardrails

Defense Comparison
Figure 3: Performance across defense mechanisms (Target: Gemini-2.5-Flash). LLM Guard Detection: POA methods (AutoDAN, PAIR, PAP, ActorBreaker, X-Teaming) suffer catastrophic failure due to prompt interception, while DOA methods (Multi-Agent, CKA-Agent) maintain near-perfect success rates since sub-queries are individually benign. Mutation Defenses (Rephrasing, Perturbation): Minimal impact on most methods; modern LLMs exhibit strong tolerance to lexical perturbations. Only AutoDAN's evolutionary patterns are significantly disrupted.

Adaptive Branching: Progressive Recovery from Failures

Adaptive Branching Results
Figure 4: Cumulative success rates across iterations (ablation study). First-iteration performance: 70-95% success demonstrates effective core design—the attack agent successfully leverages target responses while the hybrid evaluator distinguishes high-quality knowledge extraction from refusals. Multi-iteration robustness: When sub-queries are blocked or yield low-quality responses, UCT-based selection identifies the most promising unexplored nodes to initiate new trajectories. 92-95% of eventual successes occur within two iterations, validating both efficiency and the fundamental value of tree-structured adaptive recovery.

Cost-Performance Trade-off: Pareto Optimality

Cost Performance Analysis
Figure 5: Attack success rate vs. resource consumption on HarmBench (Target: Gemini-2.5-Flash). X-axis shows API calls (left) and total tokens (right) per sample; Y-axis shows Full Success Rate. CKA-Agent achieves the Pareto frontier: highest attack success rate (96.8%) with moderate cost. Competing methods either sacrifice performance to reduce cost or expend far more resources without matching CKA-Agent's effectiveness. This reflects the efficiency of adaptive branching search, which minimizes redundant queries and focuses exploration on promising knowledge paths.
Show More Cost-Performance Analysis Figures
Cost Performance Analysis
Figure 6: Attack success rate vs. resource consumption on HarmBench (Target: Gemini-2.5-Pro).
Cost Performance Analysis
Figure 7: Attack success rate vs. resource consumption on HarmBench (Target: GPT-oss-120B).

Evaluation Validity: Human-LLM Judge Alignment

To validate our LLM-as-Judge evaluation methodology, we conducted a human evaluation study with 40 randomly sampled cases using a between-subjects design:

  • Condition 1 (No Reasoning): 5 annotators evaluate responses with only prompts and outputs
  • Condition 2 (With Reasoning): 5 different annotators have access to the judge model's analytical reasoning (but not its final score)
Human-LLM Judge Alignment
Figure 8: Alignment between human and LLM judgments shows dramatic improvement when humans access judge reasoning. Spearman correlation increases from ρ=0.52 → 0.90. Without reasoning, human judgments exhibit high variance and systematic leniency bias—accurately assessing jailbreak response actionability requires domain-specific knowledge that annotators lack without independent research. When provided with the judge model's analytical reasoning, humans reach conclusions closely aligned with the LLM's assessments, indicating discrepancies stem from information asymmetry rather than fundamental disagreement on evaluation criteria. This validates our reliance on SOTA LLMs as automated judges for both scalability and reliability.

Citation

@misc{wei2025wolfsheepsclothingbypassing,
      title={A Wolf in Sheep's Clothing: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search}, 
      author={Rongzhe Wei and Peizhi Niu and Xinjie Shen and Tony Tu and Yifan Li and Ruihan Wu and Eli Chien and Olgica Milenkovic and Pan Li},
      year={2025},
      eprint={2512.01353},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2512.01353}, 
}

Responsible Disclosure

This research is conducted for academic purposes to identify vulnerabilities in LLM safety systems. We have responsibly disclosed our findings to affected model providers and advocate for enhanced defense mechanisms. The code and detailed attack prompts will be released following ethical review and coordinated disclosure timelines.