AI promises are often oversold, and nowhere is this more evident...

IXN.AI Research · March 2026

Originally published on Tumblr.



AI promises are often oversold, and nowhere is this more evident than in the realm of vision-language models (VLMs). These systems, which combine visual and textual data processing, are hailed as the future of AI. But beneath the surface, they harbor vulnerabilities that are both intricate and alarming. Let’s dive into the technical depths of gambit and multi-modal injection attacks, which exploit these very vulnerabilities.

Adversarial images with embedded text instructions are a prime example. These images exploit the optical character recognition (OCR) preprocessing in VLMs. By embedding text instructions within images, attackers can bypass traditional text-based input filters. This typographic attack is particularly insidious because it leverages the visual rendering of instructions, which are often overlooked by systems designed to scrutinize text inputs alone.

The attack surface is further expanded by CLIP encoders, which map images and text to the same embedding space. This shared space means that instruction-following behavior can transfer to visual inputs, even if the model wasn’t explicitly trained for such tasks. It’s a bit like teaching a dog to fetch a ball and then being surprised when it fetches a stick—except in this case, the “stick” could be a malicious command hidden in an image.

Sanitizing image inputs is a Herculean task. Pixel-level perturbations can encode arbitrary instructions through steganography, a technique that hides information within seemingly innocuous data. This makes it nearly impossible to filter out malicious content without also discarding legitimate data. The challenge is compounded by the fact that multi-modal systems don’t just add attack vectors—they multiply them. Each modality introduces its own vulnerabilities, and when combined, they create a complex web of potential exploits.

Consider the recent controversy surrounding AI-generated art and the ethical implications of its use. This story highlights the broader issue of AI systems being deployed without fully understanding their limitations and risks. In the case of VLMs, the impossibility of validating semantic safety across modalities with fundamentally different information densities is a critical concern. Text and images convey information in vastly different ways, and ensuring that both are safe and accurate is a daunting task.

Ultimately, the promise of AI should be tempered with caution. While the potential benefits are immense, the risks are equally significant. As engineers and developers, it’s our responsibility to prioritize social wellbeing over corporate interests. A strong economy arises from a strong, free, and secure society, and that means building AI systems that are not only powerful but also safe and trustworthy.