Longueur Is the Attack Surface Alignment Won’t Close TL;DR:...

IXN.AI Research · June 2026

Longueur Is the Attack Surface Alignment Won’t Close

TL;DR: RLHF and constitutional training optimize models to be agreeable under expected prompts, but prompt-injection defense requires adversarial robustness over instruction provenance, which is a different objective.

Alignment is not a firewall.

The tedious length of modern AI workflows — the longueurs of system prompts, tool traces, retrieved documents, email threads, PDFs, tickets, browser pages, and chat history — is exactly where security fails. A model doesn’t “see” authority the way an operating system does. It sees tokens. RLHF teaches it that some token continuations are preferred: refuse the bomb recipe, avoid slurs, don’t fabricate too confidently, be helpful when the user asks nicely. Constitutional AI adds another layer of preference shaping, usually by scoring outputs against written principles. That can produce a more polite assistant. It doesn’t produce an access-control mechanism.

Here’s the technical mismatch. Alignment is usually distributional optimization: maximize expected reward over samples from a training or deployment-like prompt distribution, roughly max_θ E_{x~D}[R(y_θ(x), x)]. Robust injection defense is closer to adversarial optimization: maximize worst-case performance under perturbations and maliciously constructed contexts, roughly max_θ E_{x~D}[min_{δ∈A(x)} S(y_θ(x ⊕ δ), x)], where δ may be an injected instruction hidden in a webpage, document, calendar invite, or tool output. Those aren’t the same problem. The first says “behave well on prompts like these.” The second says “behave correctly even when an attacker controls part of the input channel.” A model can score beautifully on the first while failing catastrophically on the second. That’s not a bug in the benchmark; it’s the objective doing what it was asked to do.

This is why jailbreak research keeps looking embarrassingly repetitive. Different wrappers, same failure mode. Ask directly for disallowed content and the aligned model refuses. Wrap the same intent in roleplay, translation, formatting constraints, fake policies, multi-turn pressure, or “ignore previous instructions,” and some fraction of attempts succeed — not because the model has a secret evil module, but because instruction-following and safety refusal are both learned textual behaviors competing inside one sequence model. The model isn’t reliably parsing “user request” versus “untrusted quoted text” versus “retrieved page content” as separate security principals. It’s performing next-token inference conditioned on a long context. Longueur becomes privilege confusion.

Alignment teaches preference compliance, not provenance tracking. RLHF can make “I can’t help with that” more likely after recognizable harmful requests, but it doesn’t impose a non-bypassable lattice of authority across system, developer, user, tool, and data channels.
Robustness requires adversarial training and formal boundaries. Injection defense needs threat models, taint tracking, constrained decoding, capability separation, sandboxed tools, least privilege, and evaluation against adaptive attackers — not just nicer refusals.
There’s a no-free-lunch tradeoff. The more we reward a model for being flexible, obedient, context-sensitive, and able to infer implicit instructions from messy prose, the more we train exactly the behavior attackers exploit: treating arbitrary text as operational guidance.

The AI funding cycle keeps promising “agentic” systems that read the internet, operate browsers, file tickets, and transact on our behalf; the quieter lesson from overhyped demos and failed deployments is that reliability doesn’t emerge from vibes, scale, or another safety preamble. A strong society doesn’t need assistants that merely sound careful while collapsing under adversarial text. It needs systems whose authority boundaries are engineered, tested, and limited before they’re placed between people and essential services. Stop calling aligned models secure models; demand security objectives, adversarial evaluations, and hard containment before giving language models real power.