Back to articles
AI Research

Why We Format Chats: System, User, Assistant—and What Apollo 13 Taught Us

10 min read
November 9, 2025
By Devansh Choubey
Article hero image

Apollo 13's Comm Loops (1970)

When an oxygen tank exploded on Apollo 13, chaos didn't win. NASA's Mission Control ran on rigid communication loops:

  • CAPCOM was the only voice to the crew.
  • Specialists (EECOM, GUIDO, etc.) debated on internal loops and routed decisions through the Flight Director.
  • Each transmission had role + priority baked in, so in a crisis the right info reached the right ears.

That structure didn't change what was said—it changed how attention and memory were routed under pressure.

Chat LLMs work surprisingly similar. We add a minimal protocol:

<SYS> director's notes (policies, style)
<USR> astronaut's question
<AST> ground's reply

Those tags are just tokens in a single sequence, but they act like Mission Control's loops: who's speaking, to whom, and what matters.

What the Format Buys Us

Clear Boundaries → Clean Learning. In supervised fine-tuning we compute loss only on the assistant span. Role tags mark that span unambiguously—no guessing where the model's response should begin.

Attention Anchors. Special tokens like <SYS> and <USR> become reliable landmarks. Long prompt? Doesn't matter. The model can re-find these markers even 10k tokens deep.

Protocol Memory. Training with the same template you use in production isn't just good practice—it teaches the model that grammar. Less surprises, more consistency.

Diagram: One Stream, Labeled Spans

← Token Stream (left to right) →
<SYS> Always answer in French . <USR> What is the color of the sky ? <AST> Bleu .
System User Assistant (generation direction →)

Alt text: A left-to-right token stream with <SYS>, <USR>, <AST> blocks and an arrow showing generation order.

How Formatting Helps Attention

Here's a toy example that's surprisingly telling. Same content, two encodings. Watch where the model's attention goes for the next assistant token:

With Roles → Attention Peaks at the System Boundary and User Span

Attention with Explicit Roles
<BOS>
<S>
Always
answer
in
French
.
<U>
What
is
the
color
of
the
sky
?
<A>
Bleu
.
Notice: Strong attention (bright yellow) on <SYS> and <USR> boundary tokens

Without Roles → Attention Diffuses Across Many Mid-Sentence Tokens

Attention without Explicit Roles
<BOS>
Always
answer
in
French
.
What
is
the
color
of
the
sky
?
<AST>
Attention scattered: "is", "of", "sky" get significant weight without clear role markers

Why This Happens

Boundary tokens as beacons. Distinct markers learn unique directions in embedding space—they're easy to retrieve. Think of them as bright lighthouses in a sea of prose.

Less guesswork. The model doesn't have to infer which earlier sentence was the rule. <SYS> is a unique island it can always find.

Stability with distance. Long context window? These anchors counter the "lost in the middle" problem. At 100k tokens, you still need to find that system prompt.

Training-Time View: Assistant-Only Loss

In SFT we mask labels outside the assistant span (loss = 0 on system/user + role tokens). Only the assistant's actual response gets backprop:

SFT Loss Mask: labels=1 on assistant tokens, 0 elsewhere
<BOS> <SYS> Always answer in French . <EOS_SYS> <USR> What is the color of the sky ? <EOS_USR> <AST> Bleu .
■ Loss = 0 (masked) ■ Loss = computed (assistant tokens only)

Only the assistant's response ("Bleu.") contributes to the training loss

Chat formatting isn't cosmetic. Like NASA's comm protocols, it doesn't change the information—it changes how the system routes attention and memory.

In a crisis—or a 50k-token prompt—that structure is the difference between noise and signal.

References

  1. Patterson, E. S., Watts-Perotti, J., & Woods, D. D. (1999). Voice Loops as Coordination Aids in Space Shuttle Mission Control. Computer Supported Cooperative Work.
  2. NASA. (1970). Apollo 13 Flight Journal - Day 3: The Problem.
  3. Hugging Face. Chat Templates. Transformers Documentation.
  4. Hugging Face. Chat Templates. Hugging Face NLP Course.
  5. Hugging Face. SFT Trainer. TRL Documentation.
  6. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv preprint.