We spend a lot of time red-teaming our AI models, but this week I'm focusing more on red-teaming the data they ingest. A report in WIRED about a Gemini vulnerability serves as a strong reminder that an AI agent is only as secure as the information it's allowed to consume.
Researchers demonstrated they could control smart home devices connected to a Google account with a simple, yet sneaky, technique. The attack didn't require hacking Google's infrastructure; it involved sending a malicious calendar invite. Hidden within the event details was a prompt injection instructing the AI to perform an action, like opening the window blinds.
The truly dangerous part was the mechanism, which the researchers called "delayed automatic tool invocation." When the user asked Gemini to summarize their calendar, the AI processed the hidden command but didn't act immediately. The command was set to execute only when the user said a common follow-up word, like "great" or "thanks." This completely separates the malicious input from the eventual physical-world output.
The security perimeter isn't just the user's chat input anymore. It includes every data source the agent interacts with. The key control is designing better confirmation flows. If an AI is about to take a significant action based on data from an unverified source, it needs to trigger a specific, out-of-band confirmation from the user. We must treat inbound, unstructured data as potentially hostile by default.
