LLM Wire Format Benchmark: Which Format Can AI Actually Read and Write?
Every LLM wire format claims token savings. Nobody proves whether AI models can actually comprehend the format at scale, or produce valid output in it. We did both: 1,300+ LLM evaluations across 10 models from Anthropic, OpenAI, and Google. Deterministic ground truth, no LLM judge, reproducible from one command.
The results are unambiguous. JSON breaks at 500 records. GPT-5.5 returns empty strings: it can't even attempt an answer at 53,000 tokens of repeated field names. Opus spends 143 lines manually enumerating symbols to count them and still gets the wrong answer. The format designed for "human readability" is incomprehensible to the systems actually reading it.
TOON is worse than it looks. Its official decoder rejects LLM-generated output on 7 of 9 models tested. Claude Opus scores 0/5 on TOON generation. GPT-5.4: 0/5. Gemini 3.1 Pro: 0/5. The error is always the same: toon: cannot assign string to int. The model writes "target" in the distance column because that's what it was told. TOON expects 0. The format's flat tabular design forces an encoding step that no model performs unprompted. This is a structural design flaw, not a training problem.
GCF wins both dimensions on every model tested. Four models achieve 100% comprehension: Claude Sonnet, Gemini 2.5 Pro, Gemini 3.1 Pro, and Gemini 3.5 Flash. Every frontier model produces valid GCF at 5/5 from a 3-line primer. No model has ever been trained on GCF. The format didn't exist until we built it, and every model speaks it natively because the structure aligns with how LLMs already process information.
A 500-symbol, 200-edge code graph. Encoded in GCF, TOON, and JSON. 13 structured extraction questions. The model gets the payload and a question. No format instructions. No system prompt. No hints.
| Model | Runs | GCF avg | TOON avg | JSON avg | GCF margin |
|---|---|---|---|---|---|
| Claude Opus 4.6 | 2 | 96.2% | 84.6% | 73.1% | +11.6 vs TOON |
| Claude Sonnet 4.6 | 2 | 100% | 73.1% | 53.8% | +26.9 vs TOON |
| Claude Haiku 4.5 | 2 | 96.2% | 69.2% | 57.7% | +27.0 vs TOON |
| GPT-5.5 | 5 | 84.1% | 67.7% | 45.8% | +16.4 vs TOON |
| GPT-5.4 | 4 | 76.4% | 56.0% | 44.1% | +20.4 vs TOON |
| GPT-5.4-mini | 2 | 71.8% | 64.1% | 54.2% | +7.7 vs TOON |
| Gemini 2.5 Flash | 3 | 80.6% | 54.6% | 57.0% | +26.0 vs TOON |
| Gemini 2.5 Pro | 1 | 100% | 76.9% | 58.3% | +23.1 vs TOON |
| Gemini 3.1 Pro | 1 | 100% | 76.9% | 46.2% | +23.1 vs TOON |
| Gemini 3.5 Flash | 1 | 100% | 61.5% | 46.2% | +38.5 vs TOON |
GCF > TOON > JSON on every model from every provider. No exceptions. Four models achieve 100%: Claude Sonnet, Gemini 2.5 Pro, Gemini 3.1 Pro, Gemini 3.5 Flash.
| Format | Tokens | vs JSON |
|---|---|---|
| GCF | 11,090 | 79% fewer |
| TOON | 16,378 | 69% fewer |
| JSON | 53,341 | baseline |
GCF is the cheapest format. It's also the most accurate. Usually you trade cost for quality. Not here.
At 8 symbols, JSON scores 100%. Everything works. At 500 symbols, it falls apart.
GPT-5.5 returns empty strings. Not wrong answers. Nothing. The model receives 53,341 tokens of {"qualifiedName": "...", "kind": "...", "score": ..., "provenance": "...", "distance": ...} repeated 500 times and cannot produce any response. Ask "how many symbols?" and it returns "". The attention mechanism drowns in 2,500 identical field-name tokens.
Claude Opus enumerates 143 symbols by hand. Asked "how many related symbols?" (answer: 167), Opus responds with:
Let me count precisely by going through the list: 1. handler.Response.Notify 2. model.SubscribeConfig 3. service.PublishOptions ... 143. store.DispatchConfig So: 143.
143 lines of output tokens. Wrong answer. This happened on two separate runs with different payloads (143 on run 1, 119 on run 2). The most capable model in the world cannot count JSON objects because the structural noise overwhelms the signal. GCF answers the same question from a 3-character header: [167].
Every model fails distance filtering. "How many symbols have distance 0?" requires parsing 500 JSON objects, reading the distance field on each, and counting matches. Correct answer: 166. Opus answers 200 (read the edge count instead). GPT-5.4 answers 300–404. GPT-5.4-mini answers 300.
JSON repeats "qualified_name":, "kind":, "score":, "provenance":, "distance": on every single record. That's 2,500 structurally identical tokens carrying zero semantic content. They exist for human readability. The consumer isn't a human.
| Failure type | Count | Models | Cause |
|---|---|---|---|
| Empty string response | 33 | GPT-5.5 | 53k tokens of repeated field names overwhelms attention. Model produces nothing. |
| Massive undercount | 9 | Opus/Sonnet, Haiku, GPT-5.4, mini | Field-name repetition dilutes signal. Model loses count mid-scan. |
| Distance filter failure | 29 | Opus/Sonnet, Haiku, GPT-5.4, mini | Must parse JSON objects AND filter by field value. Fails consistently. |
| Field confusion | 3 | GPT-5.4 | Reads edge type instead of symbol kind. |
JSON median error magnitude: 56. GCF median error magnitude: 4.
TOON does better than JSON on counting — it gets symbol_count=500 correct. But it fails on anything that requires filtering by column value.
Distance grouping fails on every model. "How many targets (distance 0)?" requires scanning 500 TOON rows and filtering by the last column. Correct answer: 166.
The answers are wildly inconsistent across runs. The models aren't wrong in a systematic way — they're guessing. TOON has no section headers for distance groups. The only way to answer "how many targets?" is to scan every row and count. At 500 rows, models give up and guess round numbers.
Attention decays by row 500. "What kind is the last symbol?" should be trivial. TOON answers "method" instead of "interface" on multiple models. By the time the model reaches row 500 of a flat table, attention has diluted to noise.
| Failure type | Count | Models | Cause |
|---|---|---|---|
| Distance grouping failure | 25 | Opus/Sonnet, Haiku, GPT-5.4, mini | Must scan 500 rows and filter by distance column. Wildly inconsistent answers. |
| Round-number guessing | 7 | Haiku, mini | Model gives up counting and guesses "100". |
| Attention decay (last row) | 5 | Opus/Sonnet, Haiku, GPT-5.4 | last_symbol_kind wrong. Loses track at row 500. |
| Empty response | 20 | GPT-5.5 | Context overwhelm. Same as JSON. |
TOON median error magnitude: 53.
GCF answers are structural, not computational.
"How many symbols?" Read the header: symbols=500. Done.
"How many edges?" Read the section header: ## edges [200]. Done.
"How many targets?" Count lines in ## targets. The section boundary gives the grouping for free. No column filtering. No scanning 500 rows.
"What kind is the last symbol?" The last line in ## extended is the last symbol. The model reads the last line of the last section. No attention decay across 500 flat rows.
One design decision creates this gap: hierarchical sections vs flat tabular. GCF groups data by category. TOON and JSON present flat lists and force the model to compute groupings from raw values. At scale, that computation fails.
| Failure type | Count | Models | Cause |
|---|---|---|---|
| Off-by-1–2 header misread | 5 | Haiku, GPT-5.4, mini | Header says [167], model reads 166. Tokenization artifact. |
| Column scan miscount | 10 | GPT-5.4, mini | Must scan fn kind across rows. Deterministic: function_count=84 every run. |
| Field confusion | 2 | GPT-5.4, mini | Read symbol count instead of edge count. |
| Empty response | 10 | GPT-5.5 | Context overwhelm at 53k+ input tokens (JSON payload size). |
GCF median error magnitude: 4. GCF failures on Claude are near-zero. GCF failures on OpenAI are deterministic and repeatable — same wrong number every run — suggesting a tokenizer-level parsing difference, not a comprehension failure.
We asked every model to produce structured output in each format. 3-line primer in the prompt. Output validated through the real decoder. No hand-holding.
| Model | GCF | TOON (natural) | JSON |
|---|---|---|---|
| Claude Opus 4.6 | 5/5 | 0/5 | 5/5 |
| Claude Sonnet 4.6 | 5/5 | 2–3/5 | 5/5 |
| Claude Haiku 4.5 | 5/5 | 1–3/5 | 5/5 |
| GPT-5.5 | 4–5/5 | 1–2/5 | 5/5 |
| GPT-5.4 | 5/5 | 0/5 | 5/5 |
| GPT-5.4-mini | 5/5 | 0/5 | 5/5 |
| Gemini 2.5 Pro | 5/5 | 1/5 | 5/5 |
| Gemini 3.1 Pro | 5/5 | 0/5 | 5/5 |
| Gemini 3.1 Flash Lite | 4–5/5 | 0/5 | 4/5 |
| Gemini 3.5 Flash | 3/5 | 1/5 | 3/5 |
| Gemini 2.5 Flash | 2–3/5 | 0–4/5 | 0–3/5 |
No model has ever been trained on GCF. It didn't exist before we built it. Yet every frontier model (Opus, Sonnet, GPT-5.5, Gemini 2.5 Pro, Gemini 3.1 Pro) produces valid, decoder-parseable output on first exposure with a 3-line primer.
TOON has been published for months. It has documentation, examples, a playground, SDK implementations. And Claude Opus scores 0/5. Gemini 3.1 Pro scores 0/5. GPT-5.4 scores 0/5.
Every TOON generation failure produces the same error:
INVALID: symbols: index 0: distance: toon: cannot assign string to int
The model writes:
symbols[5]{name,kind,score,provenance,distance}:
pkg/api.HandleRequest,function,0.95,lsp_resolved,target
TOON expects:
symbols[5]{name,kind,score,provenance,distance}:
pkg/api.HandleRequest,function,0.95,lsp_resolved,0
The model is told "this symbol is a target." It writes target. TOON's decoder rejects it because it expects the integer 0. The model would need to know, unprompted, that "target" maps to 0, "related" maps to 1, "extended" maps to 2. No model does this.
This isn't a training problem. This is a design flaw. TOON's flat tabular format encodes semantic categories as integers. The model has to perform a mapping step that has no structural cue in the format itself. When does a column value need to be an integer? When is a string acceptable? TOON gives no signal. The model guesses wrong.
GCF expresses distance through section placement:
## targets @0 fn pkg.HandleRequest 0.95 lsp_resolved ## related @1 type pkg.ProcessResponse 0.74 ast_inferred ## extended @2 method pkg.ValidateConfig 0.52 structural
The model is told "this symbol is a target." It writes it in ## targets. No integer mapping. No encoding step. The format aligns with how the model naturally expresses grouped data. Sections are categories. That's how markdown works. That's how every model already thinks.
When we explicitly pre-encode distances as integers in the prompt ("distance 0" instead of "target"), TOON passes. But this means the caller must know TOON's internal encoding and pre-process every field before the model can write valid output.
| Format | Prompt style | Valid | 100 sym output |
|---|---|---|---|
| GCF | natural labels | 5/5 | 5,984 B |
| TOON | hand-held (integers) | 5/5 | 8,336 B |
| TOON | natural labels | 0/5 | invalid |
| JSON | natural labels | 5/5 | 16,121 B |
GCF works with natural language. TOON requires a preprocessing step. And even with that step, GCF output is 28% smaller.
No model has seen GCF in training. And yet:
This happens because GCF is aligned with patterns LLMs already understand:
## section_name is a markdown header. Every model knows this.@0 fn pkg.Auth 0.78 lsp_resolved is positional. One token per field. No ambiguity.@1<@0 calls is 4 tokens. Self-contained. No nested objects.The format was designed for the machine's native expression patterns. TOON was designed for human readability. JSON was designed for human readability. Neither format was designed for the reader that's actually doing the work.
We forked TOON's benchmark repository, added a GCF formatter, and ran their datasets with their tokenizer and their methodology.
| Dataset | GCF tokens | TOON tokens | Result |
|---|---|---|---|
| Semi-uniform event logs | 108,158 | 154,032 | GCF 42% smaller |
| E-commerce orders | 61,593 | 73,246 | GCF 19% smaller |
| Deeply nested config | 616 | 618 | GCF 0.3% smaller |
| Employee records | 49,055 | 49,966 | GCF 2% smaller |
| Analytics time-series | 8,398 | 9,127 | GCF 8% smaller |
| GitHub repos | 8,576 | 8,744 | GCF 2% smaller |
TOON's home turf. TOON's datasets. TOON's methodology. GCF wins every single one.
Even on flat tabular employee records, the dataset TOON was literally designed for, GCF is smaller. On semi-uniform data where structures vary, the gap blows open to 42%.
GCF has a feature no other format supports: session statefulness. Symbols seen in prior tool calls are referenced by ID instead of re-serialized.
First call: full payload. Second call: only new symbols, plus @ref IDs for previously-seen ones. By the 5th call in a conversation: 92.7% token savings.
TOON and JSON re-serialize everything on every call. There is no mechanism for cross-call deduplication. Every tool response pays full price regardless of what the model already knows.
This is where GCF's advantage compounds over a session. The per-call savings (32–79% vs JSON) multiply across 5–10 tool calls in a typical agent interaction.
JSON was designed for humans to read. TOON was designed as a compromise between humans and machines. GCF was designed for the machine. The consumer of your API responses is not a human. Optimize for the reader that matters.
The eval is open source. Every result is committed. Every log file is in the repository.
git clone https://github.com/blackwell-systems/gcf-go cd gcf-go/eval # Comprehension (any provider) GOWORK=off go test -run TestComprehension -v -timeout 0 EVAL_BACKEND=openai OPENAI_API_KEY=... EVAL_MODEL=gpt-5.5 GOWORK=off go test -run TestComprehension -v -timeout 0 EVAL_BACKEND=google GOOGLE_API_KEY=... EVAL_MODEL=gemini-2.5-flash GOWORK=off go test -run TestComprehension -v -timeout 0 # Generation (all three formats) GOWORK=off go test -run "TestGeneration$|TestGenerationTOON|TestGenerationJSON" -v -timeout 0 # Token efficiency (TOON's benchmark) git clone https://github.com/blackwell-systems/toon.git cd toon && git checkout gcf-comparison && cd benchmarks && pnpm install && pnpm benchmark:tokens
Run it yourself. The numbers don't change.