February 6, 2024
LLM Security
Разборы статей, блогов и новостей про безопасность и атаки на большие языковые модели.
Джейлбрейки
- Jailbroken: How Does LLM Safety Training Fail?, Wei et al., 2023
- Universal and Transferable Adversarial Attacks on Aligned Language Models, Zou et al., 2024
- AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models, Liu et al., 2024
- MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots, Deng et al., 2023
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study, Liu et al., 2023
- Jailbreaking Black Box Large Language Models in Twenty Queries, Chao et al., 2023
- Tree of Attacks: Jailbreaking Black-Box LLMs Automatically, Mehrotra et al., 2023
- Fundamental Limitations of Alignment in Large Language Models, Wolf et al., 2023
- Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild, Inie et al., 2023
- ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs, Jiang et al., 2024
- Refusal in Language Models Is Mediated by a Single Direction, Arditi et al, 2024
- Does Refusal Training in LLMs Generalize to the Past Tense?, Andriushchenko and Flammarion, 2024
- Best-of-N Jailbreaking, John Hughes et al., 2024
- Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack, Mark Russinovich et al, Microsoft, 2023
- Removing RLHF Protections in GPT-4 via Fine-Tuning, Qiusi Zhan et al., 2023
- Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models, Xianjun Yang et al, 2023
- LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B, Simon Lermen et al, 2023
Prompt Injection
- Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection, Greshake at a.l, 2023
- ComPromptMized: Unleashing Zero-click Worms that Target GenAI-Powered Applications, Cohen et al., 2024
- A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems, Wu et al., 2024
- The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions, Wallace et al., 2024
- Knowledge Return Oriented Prompting (KROP), Martin et al., 2024
- Data Exfiltration from Slack AI via indirect prompt injection, PromptArmor, 2024
Offensive LLM
- An update on disrupting deceptive uses of AI, Nimmo & Flossman, OpenAI, 2024
- Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities, Anurin et al., Apart Research, 2024
- Adversarial Misuse of Generative AI, Google Threat Intelligence Group, 2025
- Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects, Heiding et al., 2024
- Disrupting malicious uses of AI: February 2025 update, Nimmo et al., OpenAI, 2025
Защита LLM-систем
- Baseline Defenses for Adversarial Attacks Against Aligned Language Models, Jain et al., 2023
- Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes, Hu et al., 2024
- Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, Inan et al., 2023
- ShieldGemma: Generative AI Content Moderation Based on Gemma, ShieldGemma Team, Google LLC, 2024
- Rapid Response: Mitigating LLM Jailbreaks with a Few Examples, Peng et al., 2024
- Defending Against Indirect Prompt Injection Attacks With Spotlighting, Keegan Hines et al, Microsoft, 2024
- Are you still on track!? Catching LLM Task Drift with Activations, Abdelnabi et al., 2024
- Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming, Mrinank Sharma et al., Anthropic. 2025
- The Dual LLM pattern for building AI assistants that can resist prompt injection, Simon Willison, 2023
Бенчмарки
- Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models, Bhatt et al., 2023
- CYBERSECEVAL 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models, Bhatt et al., 2024
- CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models, Wan et al., 2024
- AIR-BENCH 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies, Zeng et al., 2024
- AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents, Edoardo Debenedetti et al., 2024
- A StrongREJECT for Empty Jailbreaks, Souly et al., 2024
- Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models, Andy K. Zhang et al, Stanford, 2024
- The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning, Li et al, 2024
Policy
- The Coming Wave, Mustafa Suleyman, 2024
- AI existential risk probabilities are too unreliable to inform policy, Narayanan and Kapoor, 2024
Safety & Reliability
- Towards Understanding Sycophancy in Language Models, Sharma et al, 2023
- Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models, Denison et al, 2024
- AI Risk Categorization Decoded (AIR 2024): From Government Regulations to Corporate Policies, Zeng et al., 2024
- Constitutional AI: Harmlessness from AI Feedback, Bai et al., Anthropic, 2022
- Frontier Models are Capable of In-context Scheming, Alexander Meinke et al., Apollo Research, 2024
- Demonstrating specification gaming in reasoning models, Alexander Bondarenko et al., Palisade Research, 2025
- Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs, Jan Betley et al., 2025
AI Alignment Course
Model Stealing & Inversion
- LLMmap: Fingerprinting For Large Language Models, Pasquini et al., 2024
- Stealing Part of a Production Language Model, Carlini et al., 2024
Гайдлайны
Misc
- What Was Your Prompt? A Remote Keylogging Attack on AI Assistants, Weiss et al., 2024
- Smuggling arbitrary data through an emoji, Paul Butler, 2025
Полезные каналы
- https://t.me/addlist/40D9BRf6rDoxNzg6 - большой список каналов на тему AI + Security
- https://t.me/pwnai
- https://t.me/rybolos_channel
- https://t.me/aisecnews
- https://t.me/kokuykin
Теги: AI Safety, AI Security, LLM Security, LLM Safety, Adversarial ML, AI in Cybersecurity, атаки на LLM, атаки на большие языковые модели, защита больших языковых моделей, разборы на русском языке
February 6, 2024, 18:30
0 views
0 reposts