AI alignment is not enough. While alignment techniques like...
Originally published on Tumblr.

AI alignment is not enough. While alignment techniques like Reinforcement Learning from Human Feedback (RLHF) and constitutional AI training aim to make models helpful and harmless, they fall short in securing models against adversarial instructions. The crux of the issue lies in the distinction between alignment and robustness. Alignment focuses on teaching models to refuse harmful requests from users, but it doesn’t equip them to differentiate between legitimate user requests and maliciously injected instructions.
Mathematically, this boils down to the difference between robust optimization and distributional optimization. Robust optimization, often achieved through adversarial training, is about preparing models to withstand worst-case scenarios. In contrast, distributional optimization, which underpins alignment, is about optimizing models to perform well on average across a distribution of tasks. This fundamental difference means that aligned models are not inherently robust models.
Recent jailbreak research highlights this vulnerability. Aligned models, despite their safety training, can be manipulated through prompt engineering to bypass their safety protocols. This isn’t just a theoretical concern; it’s a practical one. The more we train models to follow instructions, the more susceptible they become to instruction injection attacks. It’s a classic no-free-lunch scenario: enhancing a model’s ability to follow instructions inadvertently increases its vulnerability to adversarial manipulation.
Alignment and security, therefore, are orthogonal properties. They require fundamentally different training objectives. While alignment focuses on ethical and safe behavior, security demands resilience against adversarial tactics. Both are crucial, but they can’t be achieved through the same methods.
This isn’t just academic musing. Consider the recent AI funding bubbles and overpromised capabilities that have left many projects floundering. The hype often overlooks these nuanced technical challenges, leading to systems that are aligned in theory but insecure in practice. For a truly robust AI deployment, we must prioritize both alignment and security, recognizing that a strong, free, and secure society depends on it.