AI agents under attack: prompt injection exploits and new defenses
Enterprises deploying AI assistants and desktop agents face real prompt-injection and safety failures in tools like Copilot, ChatGPT, Grok, and OpenClaw, while new detection methods that inspect LLM internals are emerging to harden defenses. Security researchers show popular assistants can be steered into malware generation, phishing, and data exfiltration via prompt injection and social engineering, with heightened risk when models tap external data sources, as covered in [WebProNews](https://www.webpronews.com/when-your-ai-assistant-turns-against-you-how-hackers-are-weaponizing-copilot-grok-and-chatgpt-to-spread-malware/). Companies are also restricting high-privilege agents like [OpenClaw](https://arstechnica.com/ai/2026/02/openclaw-security-fears-lead-meta-other-ai-firms-to-restrict-its-use/), citing unpredictability and privacy risk, even as OpenAI commits to keep it open source. The fragility extends to retrieval and web-grounded answers: a reporter manipulated [ChatGPT and Google’s AI](https://www.bbc.com/future/article/20260218-i-hacked-chatgpt-and-googles-ai-and-it-only-took-20-minutes?_bhlid=fca599b94127e0d5009ae7449daf996994809fc2) with a single blog post, underscoring the ease of large-scale influence. AppSec leaders are already reframing strategy for AI-era vulns, as flagged by [The New Stack](https://thenewstack.io/ai-agents-appsec-strategy/). Beyond I/O filters, Zenity proposes a maliciousness classifier that reads the model’s internal activations to flag manipulative prompts, releasing paper, infra, and cross-domain benchmarks to foster “agentic security” practices, detailed by [Zenity Labs](https://labs.zenity.io/p/looking-inside-a-maliciousness-classifier-based-on-the-llm-s-internals).