November 2025 Preprint

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

AUTHORS

Rongzhe Wei^1,*, Peizhi Niu^2,*, Xinjie Shen^1,*, Tong Tu¹, Yifan Li³

Ruihan Wu⁴, Eli Chien⁵, Pin-Yu Chen⁶, Olgica Milenkovic², Pan Li^1,†

¹GaTech, ²UIUC, ³Tsinghua, ⁴UCSD, ⁵NTU, ⁶IBM

^*Equal contribution ^†Corresponding author

Preprint

Code Paper Check Examples Cite

Leaderboard (Defense & Attack)

🔥 Latest Results on Frontier Models (Dec 2025)

CKA-Agent demonstrates consistent high attack success rates against the latest frontier models: GPT-5.2, Gemini-3.0-Pro, and Claude-Haiku-4.5.

GPT-5.2

93.2%

StrongREJECT

Gemini-3.0-Pro

95.1%

StrongREJECT

Claude-Haiku-4.5

96.9%

StrongREJECT

HarmBench Dataset

Model	FS↑	PS↑	V↓	R↓
GPT-5.2	0.889	0.079	0.024	0.008
Gemini-3.0-Pro	0.881	0.087	0.000	0.032
Claude-Haiku-4.5	0.960	0.024	0.008	0.008

StrongREJECT Dataset

Model	FS↑	PS↑	V↓	R↓
GPT-5.2	0.932	0.056	0.006	0.006
Gemini-3.0-Pro	0.951	0.037	0.006	0.006
Claude-Haiku-4.5	0.969	0.025	0.006	0.000

Metrics: FS = Full Success, PS = Partial Success, V = Vacuous, R = Refusal. Results collected in December 2025.

Current jailbreak methods focus on optimizing prompts to bypass guardrails, but these approaches fail against modern defenses that detect malicious intent. We argue that a more fundamental vulnerability lies in the interconnected nature of an LLM's internal knowledge. Restricted information can be reconstructed by weaving together a series of locally innocuous queries that individually appear benign but collectively lead to the harmful objective.
We introduce CKA-Agent (Correlated Knowledge Attack Agent), a framework that operationalizes this vulnerability by reformulating jailbreaking as an adaptive tree search over the target LLM's correlated knowledge. Instead of crafting a single malicious prompt, CKA-Agent dynamically navigates the model's internal knowledge graph, using the target's own responses to guide its multi-hop attack path. Through a simulation-free tree search with a hybrid LLM evaluator, CKA-Agent achieves 96-99% attack success rates against state-of-the-art commercial LLMs, representing a 15-21pp gain over the best decomposition baseline and up to a 96× improvement over prompt optimization methods on robustly defended models.

💡 Motivation: Adaptive Search over Correlated Knowledge

Target LLMs contain correlated knowledge that can be extracted through safe queries. While Prompt Optimization Attacks (POA) try to directly access target knowledge and get blocked, our approach uses adaptive search to dynamically explore correlated knowledge nodes, adapting when queries are blocked, and synthesizing the gathered information to construct the target answer.

Comparison of POA, DOA, and CKA-Agent attack methods showing different approaches to LLM jailbreaking — **Figure 1:** Comparison of different attack methods. (1) Prompt Optimization Attacks (POA) directly attack the target LLM but often fail due to safety guardrails. (2) Decomposition-Only Attacks (DOA) rely on prior knowledge from a limited decomposer and have a static plan. (3) CKA-Agent uses an adaptive tree search to explore correlated knowledge and dynamically adjust the attack plan.

Experimental Results

Attack Success Rates on HarmBench & StrongREJECT

96.8%

Gemini-2.5-Pro (HarmBench)

98.8%

Gemini-2.5-Flash (StrongREJECT)

97.6%

GPT-oss-120B (HarmBench)

96.9%

Claude-Haiku-4.5 (StrongREJECT)

Full Comparison with All Baselines

Method	Gemini-2.5-Flash				Gemini-2.5-Pro				GPT-oss-120B				Claude-Haiku-4.5
Method	FS↑	PS↑	V↓	R↓	FS↑	PS↑	V↓	R↓	FS↑	PS↑	V↓	R↓	FS↑	PS↑	V↓	R↓
HarmBench Dataset
Vanilla	0.151	0.032	0.000	0.818	0.222	0.064	0.000	0.714	0.048	0.032	0.032	0.889	0.008	0.016	0.000	0.976
AutoDAN	0.767	0.050	0.017	0.167	0.921	0.016	0.008	0.056	0.103	0.032	0.032	0.833	0.008	0.008	0.000	0.984
PAIR	0.810	0.064	0.015	0.111	0.905	0.071	0.008	0.056	0.278	0.214	0.405	0.492	0.032	0.040	0.048	0.880
PAP (Logical)	0.230	0.040	0.016	0.714	0.214	0.040	0.016	0.730	0.080	0.056	0.043	0.821	0.000	0.008	0.000	0.992
TAP	0.824	0.096	0.040	0.040	0.849	0.095	0.016	0.040	0.095	0.031	0.016	0.857	0.104	0.120	0.024	0.752
ActorBreaker	0.331	0.102	0.095	0.472	0.325	0.119	0.183	0.373	0.087	0.175	0.103	0.635	0.079	0.087	0.119	0.714
X-Teaming	0.595	0.056	0.016	0.333	0.762	0.063	0.008	0.167	0.071	0.056	0.071	0.802	0.000	0.000	0.000	1.000
Multi-Agent Jailbreak	0.794	0.143	0.040	0.024	0.818	0.143	0.032	0.008	0.762	0.167	0.048	0.024	0.786	0.119	0.048	0.048
CKA-Agent (ours)	0.968	0.025	0.000	0.007	0.968	0.025	0.007	0.000	0.976	0.016	0.008	0.000	0.960	0.024	0.008	0.008
StrongREJECT Dataset
Vanilla	0.012	0.000	0.000	0.988	0.019	0.031	0.000	0.951	0.000	0.025	0.019	0.957	0.000	0.012	0.000	0.988
AutoDAN	0.463	0.037	0.025	0.475	0.852	0.012	0.000	0.136	0.080	0.025	0.019	0.877	0.006	0.000	0.006	0.988
PAIR	0.827	0.062	0.019	0.092	0.826	0.056	0.012	0.106	0.099	0.031	0.019	0.851	0.049	0.037	0.025	0.889
PAP (Logical)	0.154	0.012	0.019	0.815	0.130	0.043	0.000	0.827	0.080	0.056	0.043	0.821	0.000	0.006	0.000	0.994
TAP	0.864	0.068	0.019	0.049	0.870	0.056	0.012	0.061	0.095	0.032	0.016	0.857	0.124	0.099	0.012	0.765
ActorBreaker	0.340	0.111	0.043	0.506	0.333	0.093	0.068	0.506	0.136	0.167	0.074	0.624	0.068	0.080	0.074	0.778
X-Teaming	0.679	0.068	0.012	0.241	0.809	0.062	0.019	0.111	0.130	0.093	0.031	0.747	0.000	0.000	0.000	1.000
Multi-Agent Jailbreak	0.827	0.099	0.019	0.056	0.790	0.099	0.037	0.074	0.772	0.167	0.037	0.025	0.815	0.099	0.025	0.062
CKA-Agent (ours)	0.988	0.006	0.000	0.006	0.951	0.043	0.000	0.006	0.982	0.012	0.006	0.000	0.969	0.025	0.006	0.000

Metrics: FS = Full Success, PS = Partial Success, V = Vacuous, R = Refusal. ↑ means higher is better, ↓ means lower is better. Best results highlighted, second-best in blue.

⚙️ Evaluation Setup

Datasets: HarmBench (126 behaviors) + StrongREJECT (162 prompts) = 288 high-stakes harmful prompts

Target Models: Gemini-2.5-Flash, Gemini-2.5-Pro, GPT-oss-120B, Claude-Haiku-4.5

Attacker Model: Qwen3-32B-abliterated (for all methods requiring auxiliary models)

Judge Model: Gemini-2.5-Flash with 4-level rubric (Full Success, Partial Success, Vacuous, Refusal)

Note: All evaluation details, including judge prompts and scoring rubrics, are available in the arXiv paper. All experiments were conducted during Oct-Nov 2025.

Key Findings:

POA Catastrophic Failure: Prompt optimization methods (PAIR, AutoDAN, PAP) collapse on robustly defended models. PAIR drops from 90.5% → 3.2% on Claude-Haiku-4.5, revealing that harmful intent remains semantically detectable regardless of algorithmic sophistication.
DOA Superiority: Decomposition-based methods maintain consistent performance across all targets. Multi-Agent Jailbreak achieves 76.2%–81.8%, representing 24× improvement over PAIR on the most defended model.
CKA-Agent SOTA: Achieves 15-21 percentage point gains over Multi-Agent Jailbreak and up to 96× improvement over POA methods on robustly defended models through adaptive exploration that dynamically learns from target responses.
Critical Vulnerability Exposed: Current guardrails effectively detect optimized harmful prompts but cannot aggregate intent across adaptively constructed innocuous queries.

Context-Aware Defense: CKA-Agent vs CKA-Agent-Branch

Research Question: Can providing conversation history help target models detect correlated knowledge attacks?

Experimental Setup:

CKA-Agent: Each sub-query sent independently (no history)
CKA-Agent-Branch: Each sub-query includes full conversation history from its branch

Model	Dataset	CKA-Agent	CKA-Agent-Branch	Degradation
Gemini-2.5-Flash	HarmBench	96.8%	92.1%	-4.7%
Gemini-2.5-Flash	StrongREJECT	98.8%	96.9%	-1.9%
GPT-oss-120B	HarmBench	97.6%	78.6%	-19.0%
Claude-Haiku-4.5	HarmBench	96.0%	88.9%	-7.1%

🔍 Defense Insights: While context-aware defense degrades performance by 2-19%, CKA-Agent-Branch still achieves 78.6%+ success rates across all models. This reveals current LLMs struggle to aggregate intent across multi-turn interactions, even with full conversation history.

💡 Defense Implication: Future guardrails must enhance cross-query intent aggregation and long-context reasoning.

⚙️ Methodology

How CKA-Agent Works

CKA-Agent reformulates jailbreaking from static prompt optimization to dynamic knowledge decomposition and adaptive tree search. The framework operates through iterative exploration cycles:

Generate locally harmless sub-queries that extract correlated knowledge from the target model
Execute queries against the target and collect responses
Evaluate responses using hybrid scoring (introspection + target feedback)
Branch adaptively based on UCT-guided selection to explore multiple promising paths
Synthesize accumulated knowledge from successful exploration trajectories
Backpropagate failure signals to guide future iterations toward unexplored high-value regions

CKA-Agent Architecture

CKA-Agent framework showing Selection, Expansion, Evaluation, Synthesis, and Backpropagation phases — **Figure 2:** The CKA-Agent framework architecture. Each iteration performs: (1) Selection via UCT policy to identify the most promising leaf node from the entire tree; (2) Depth-First Expansion generating and executing sub-queries until reaching a terminal state; (3) Hybrid Evaluation combining introspection scores (query quality) and target feedback scores (response informativeness); (4) Synthesis of accumulated knowledge along the explored path; (5) Backpropagation of failure signals to guide subsequent iterations toward unexplored regions.

Adaptive Branching Search Algorithm

Selection: Global UCT policy selects the single most promising leaf: argmax(f_v + c√(ln N_parent / N_v))
Expansion: Depth-first expansion with adaptive branching (B=1 for confident paths, B≤3 for uncertainty)
Evaluation: Hybrid scoring = α·(introspection) + (1-α)·(target feedback), replacing costly MCTS rollouts
Termination: Success when synthesis achieves judge score ≥ τ; otherwise backpropagate and iterate
Efficiency: 70-95% first-iteration success; 92-95% success within two iterations

Additional Analysis

Defense Robustness: POA vs DOA Under Guardrails

Defense mechanism comparison showing POA methods failing under LLM Guard while DOA methods maintain success — **Figure 3:** Performance across defense mechanisms (Target: Gemini-2.5-Flash). **LLM Guard Detection:** POA methods (AutoDAN, PAIR, PAP, ActorBreaker, X-Teaming) suffer catastrophic failure due to prompt interception, while DOA methods (Multi-Agent, CKA-Agent) maintain near-perfect success rates since sub-queries are individually benign. **Mutation Defenses (Rephrasing, Perturbation):** Minimal impact on most methods; modern LLMs exhibit strong tolerance to lexical perturbations. Only AutoDAN's evolutionary patterns are significantly disrupted.

Adaptive Branching: Progressive Recovery from Failures

Cumulative success rates showing 92-95% of successes achieved within two iterations — **Figure 4:** Cumulative success rates across iterations (ablation study). **First-iteration performance:** 70-95% success demonstrates effective core design—the attack agent successfully leverages target responses while the hybrid evaluator distinguishes high-quality knowledge extraction from refusals. **Multi-iteration robustness:** When sub-queries are blocked or yield low-quality responses, UCT-based selection identifies the most promising unexplored nodes to initiate new trajectories. **92-95% of eventual successes occur within two iterations**, validating both efficiency and the fundamental value of tree-structured adaptive recovery.

Cost-Performance Trade-off: Pareto Optimality

Cost vs performance scatter plot showing CKA-Agent achieves Pareto frontier — **Figure 5:** Attack success rate vs. resource consumption on HarmBench (Target: Gemini-2.5-Flash). X-axis shows API calls (left) and total tokens (right) per sample; Y-axis shows Full Success Rate. **CKA-Agent achieves the Pareto frontier:** highest attack success rate (96.8%) with moderate cost. Competing methods either sacrifice performance to reduce cost or expend far more resources without matching CKA-Agent's effectiveness. This reflects the efficiency of adaptive branching search, which minimizes redundant queries and focuses exploration on promising knowledge paths.

Show More Cost-Performance Analysis Figures

Cost performance analysis for Gemini-2.5-Pro on HarmBench — **Figure 6:** Attack success rate vs. resource consumption on HarmBench (Target: Gemini-2.5-Pro).

Cost performance analysis for GPT-oss-120B on HarmBench — **Figure 7:** Attack success rate vs. resource consumption on HarmBench (Target: GPT-oss-120B).

Evaluation Validity: Human-LLM Judge Alignment

To validate our LLM-as-Judge evaluation methodology, we conducted a human evaluation study with 40 randomly sampled cases using a between-subjects design:

Condition 1 (No Reasoning): 5 annotators evaluate responses with only prompts and outputs
Condition 2 (With Reasoning): 5 different annotators have access to the judge model's analytical reasoning (but not its final score)

Violin plots showing Spearman correlation improving from 0.52 to 0.90 with reasoning access — **Figure 8:** Alignment between human and LLM judgments shows dramatic improvement when humans access judge reasoning. **Spearman correlation increases from ρ=0.52 → 0.90**. Without reasoning, human judgments exhibit high variance and systematic leniency bias—accurately assessing jailbreak response actionability requires domain-specific knowledge that annotators lack without independent research. When provided with the judge model's analytical reasoning, humans reach conclusions closely aligned with the LLM's assessments, indicating discrepancies stem from **information asymmetry** rather than fundamental disagreement on evaluation criteria. This validates our reliance on SOTA LLMs as automated judges for both scalability and reliability.

Citation

@misc{wei2025trojan,
      title={The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search}, 
      author={Rongzhe Wei and Peizhi Niu and Xinjie Shen and Tony Tu and Yifan Li and Ruihan Wu and Eli Chien and Pin-Yu Chen and Olgica Milenkovic and Pan Li},
      year={2025},
      eprint={2512.01353},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2512.01353}, 
}

Responsible Disclosure

This research is conducted for academic purposes to identify vulnerabilities in LLM safety systems. We have responsibly disclosed our findings to affected model providers and advocate for enhanced defense mechanisms. The code and detailed attack prompts will be released following ethical review and coordinated disclosure timelines.