GCF is Better Than JSON

LLM Wire Format Benchmark: Which Format Can AI Actually Read and Write?

Every LLM wire format claims token savings. Nobody proves whether AI models can actually comprehend the format at scale, or produce valid output in it. We did both: 1,300+ LLM evaluations across 10 models from Anthropic, OpenAI, and Google. Deterministic ground truth, no LLM judge, reproducible from one command.

The results are unambiguous. JSON breaks at 500 records. GPT-5.5 returns empty strings: it can't even attempt an answer at 53,000 tokens of repeated field names. Opus spends 143 lines manually enumerating symbols to count them and still gets the wrong answer. The format designed for "human readability" is incomprehensible to the systems actually reading it.

TOON is worse than it looks. Its official decoder rejects LLM-generated output on 7 of 9 models tested. Claude Opus scores 0/5 on TOON generation. GPT-5.4: 0/5. Gemini 3.1 Pro: 0/5. The error is always the same: toon: cannot assign string to int. The model writes "target" in the distance column because that's what it was told. TOON expects 0. The format's flat tabular design forces an encoding step that no model performs unprompted. This is a structural design flaw, not a training problem.

GCF wins both dimensions on every model tested. Four models achieve 100% comprehension: Claude Sonnet, Gemini 2.5 Pro, Gemini 3.1 Pro, and Gemini 3.5 Flash. Every frontier model produces valid GCF at 5/5 from a 3-line primer. No model has ever been trained on GCF. The format didn't exist until we built it, and every model speaks it natively because the structure aligns with how LLMs already process information.

1,300+
LLM evaluations
79%
fewer input tokens than JSON
22/23
comprehension runs won
5/5
generation on every frontier model

Comprehension: 500 Symbols, 13 Questions, Zero Instructions

A 500-symbol, 200-edge code graph. Encoded in GCF, TOON, and JSON. 13 structured extraction questions. The model gets the payload and a question. No format instructions. No system prompt. No hints.

23 runs. 22 wins. 0 losses.

ModelRunsGCF avgTOON avgJSON avgGCF margin
Claude Opus 4.6296.2%84.6%73.1%+11.6 vs TOON
Claude Sonnet 4.62100%73.1%53.8%+26.9 vs TOON
Claude Haiku 4.5296.2%69.2%57.7%+27.0 vs TOON
GPT-5.5584.1%67.7%45.8%+16.4 vs TOON
GPT-5.4476.4%56.0%44.1%+20.4 vs TOON
GPT-5.4-mini271.8%64.1%54.2%+7.7 vs TOON
Gemini 2.5 Flash380.6%54.6%57.0%+26.0 vs TOON
Gemini 2.5 Pro1100%76.9%58.3%+23.1 vs TOON
Gemini 3.1 Pro1100%76.9%46.2%+23.1 vs TOON
Gemini 3.5 Flash1100%61.5%46.2%+38.5 vs TOON

GCF > TOON > JSON on every model from every provider. No exceptions. Four models achieve 100%: Claude Sonnet, Gemini 2.5 Pro, Gemini 3.1 Pro, Gemini 3.5 Flash.

Token cost for the same payload

FormatTokensvs JSON
GCF11,09079% fewer
TOON16,37869% fewer
JSON53,341baseline

GCF is the cheapest format. It's also the most accurate. Usually you trade cost for quality. Not here.


How JSON Dies at Scale

At 8 symbols, JSON scores 100%. Everything works. At 500 symbols, it falls apart.

GPT-5.5 returns empty strings. Not wrong answers. Nothing. The model receives 53,341 tokens of {"qualifiedName": "...", "kind": "...", "score": ..., "provenance": "...", "distance": ...} repeated 500 times and cannot produce any response. Ask "how many symbols?" and it returns "". The attention mechanism drowns in 2,500 identical field-name tokens.

Claude Opus enumerates 143 symbols by hand. Asked "how many related symbols?" (answer: 167), Opus responds with:

Let me count precisely by going through the list:

1. handler.Response.Notify
2. model.SubscribeConfig
3. service.PublishOptions
...
143. store.DispatchConfig

So: 143.

143 lines of output tokens. Wrong answer. This happened on two separate runs with different payloads (143 on run 1, 119 on run 2). The most capable model in the world cannot count JSON objects because the structural noise overwhelms the signal. GCF answers the same question from a 3-character header: [167].

Every model fails distance filtering. "How many symbols have distance 0?" requires parsing 500 JSON objects, reading the distance field on each, and counting matches. Correct answer: 166. Opus answers 200 (read the edge count instead). GPT-5.4 answers 300–404. GPT-5.4-mini answers 300.

JSON repeats "qualified_name":, "kind":, "score":, "provenance":, "distance": on every single record. That's 2,500 structurally identical tokens carrying zero semantic content. They exist for human readability. The consumer isn't a human.

JSON failure taxonomy

Failure typeCountModelsCause
Empty string response33GPT-5.553k tokens of repeated field names overwhelms attention. Model produces nothing.
Massive undercount9Opus/Sonnet, Haiku, GPT-5.4, miniField-name repetition dilutes signal. Model loses count mid-scan.
Distance filter failure29Opus/Sonnet, Haiku, GPT-5.4, miniMust parse JSON objects AND filter by field value. Fails consistently.
Field confusion3GPT-5.4Reads edge type instead of symbol kind.

JSON median error magnitude: 56. GCF median error magnitude: 4.


How TOON Fails on Grouping

TOON does better than JSON on counting — it gets symbol_count=500 correct. But it fails on anything that requires filtering by column value.

Distance grouping fails on every model. "How many targets (distance 0)?" requires scanning 500 TOON rows and filtering by the last column. Correct answer: 166.

The answers are wildly inconsistent across runs. The models aren't wrong in a systematic way — they're guessing. TOON has no section headers for distance groups. The only way to answer "how many targets?" is to scan every row and count. At 500 rows, models give up and guess round numbers.

Attention decays by row 500. "What kind is the last symbol?" should be trivial. TOON answers "method" instead of "interface" on multiple models. By the time the model reaches row 500 of a flat table, attention has diluted to noise.

TOON failure taxonomy

Failure typeCountModelsCause
Distance grouping failure25Opus/Sonnet, Haiku, GPT-5.4, miniMust scan 500 rows and filter by distance column. Wildly inconsistent answers.
Round-number guessing7Haiku, miniModel gives up counting and guesses "100".
Attention decay (last row)5Opus/Sonnet, Haiku, GPT-5.4last_symbol_kind wrong. Loses track at row 500.
Empty response20GPT-5.5Context overwhelm. Same as JSON.

TOON median error magnitude: 53.


How GCF Solves Both Problems

GCF answers are structural, not computational.

"How many symbols?" Read the header: symbols=500. Done.

"How many edges?" Read the section header: ## edges [200]. Done.

"How many targets?" Count lines in ## targets. The section boundary gives the grouping for free. No column filtering. No scanning 500 rows.

"What kind is the last symbol?" The last line in ## extended is the last symbol. The model reads the last line of the last section. No attention decay across 500 flat rows.

One design decision creates this gap: hierarchical sections vs flat tabular. GCF groups data by category. TOON and JSON present flat lists and force the model to compute groupings from raw values. At scale, that computation fails.

GCF failure taxonomy (precision errors only)

Failure typeCountModelsCause
Off-by-1–2 header misread5Haiku, GPT-5.4, miniHeader says [167], model reads 166. Tokenization artifact.
Column scan miscount10GPT-5.4, miniMust scan fn kind across rows. Deterministic: function_count=84 every run.
Field confusion2GPT-5.4, miniRead symbol count instead of edge count.
Empty response10GPT-5.5Context overwhelm at 53k+ input tokens (JSON payload size).

GCF median error magnitude: 4. GCF failures on Claude are near-zero. GCF failures on OpenAI are deterministic and repeatable — same wrong number every run — suggesting a tokenizer-level parsing difference, not a comprehension failure.


Generation: TOON is Broken

We asked every model to produce structured output in each format. 3-line primer in the prompt. Output validated through the real decoder. No hand-holding.

9 models. 3 providers. GCF is the only format that works everywhere.

ModelGCFTOON (natural)JSON
Claude Opus 4.65/50/55/5
Claude Sonnet 4.65/52–3/55/5
Claude Haiku 4.55/51–3/55/5
GPT-5.54–5/51–2/55/5
GPT-5.45/50/55/5
GPT-5.4-mini5/50/55/5
Gemini 2.5 Pro5/51/55/5
Gemini 3.1 Pro5/50/55/5
Gemini 3.1 Flash Lite4–5/50/54/5
Gemini 3.5 Flash3/51/53/5
Gemini 2.5 Flash2–3/50–4/50–3/5

No model has ever been trained on GCF. It didn't exist before we built it. Yet every frontier model (Opus, Sonnet, GPT-5.5, Gemini 2.5 Pro, Gemini 3.1 Pro) produces valid, decoder-parseable output on first exposure with a 3-line primer.

TOON has been published for months. It has documentation, examples, a playground, SDK implementations. And Claude Opus scores 0/5. Gemini 3.1 Pro scores 0/5. GPT-5.4 scores 0/5.

The exact failure

Every TOON generation failure produces the same error:

INVALID: symbols: index 0: distance: toon: cannot assign string to int

The model writes:

symbols[5]{name,kind,score,provenance,distance}:
  pkg/api.HandleRequest,function,0.95,lsp_resolved,target

TOON expects:

symbols[5]{name,kind,score,provenance,distance}:
  pkg/api.HandleRequest,function,0.95,lsp_resolved,0

The model is told "this symbol is a target." It writes target. TOON's decoder rejects it because it expects the integer 0. The model would need to know, unprompted, that "target" maps to 0, "related" maps to 1, "extended" maps to 2. No model does this.

This isn't a training problem. This is a design flaw. TOON's flat tabular format encodes semantic categories as integers. The model has to perform a mapping step that has no structural cue in the format itself. When does a column value need to be an integer? When is a string acceptable? TOON gives no signal. The model guesses wrong.

GCF never has this problem

GCF expresses distance through section placement:

## targets
@0 fn pkg.HandleRequest 0.95 lsp_resolved
## related
@1 type pkg.ProcessResponse 0.74 ast_inferred
## extended
@2 method pkg.ValidateConfig 0.52 structural

The model is told "this symbol is a target." It writes it in ## targets. No integer mapping. No encoding step. The format aligns with how the model naturally expresses grouped data. Sections are categories. That's how markdown works. That's how every model already thinks.

Even with hand-holding, GCF wins

When we explicitly pre-encode distances as integers in the prompt ("distance 0" instead of "target"), TOON passes. But this means the caller must know TOON's internal encoding and pre-process every field before the model can write valid output.

FormatPrompt styleValid100 sym output
GCFnatural labels5/55,984 B
TOONhand-held (integers)5/58,336 B
TOONnatural labels0/5invalid
JSONnatural labels5/516,121 B

GCF works with natural language. TOON requires a preprocessing step. And even with that step, GCF output is 28% smaller.


GCF Works Without Training

No model has seen GCF in training. And yet:

This happens because GCF is aligned with patterns LLMs already understand:

The format was designed for the machine's native expression patterns. TOON was designed for human readability. JSON was designed for human readability. Neither format was designed for the reader that's actually doing the work.


TOON's Own Benchmark: GCF Wins All 6 Datasets

We forked TOON's benchmark repository, added a GCF formatter, and ran their datasets with their tokenizer and their methodology.

DatasetGCF tokensTOON tokensResult
Semi-uniform event logs108,158154,032GCF 42% smaller
E-commerce orders61,59373,246GCF 19% smaller
Deeply nested config616618GCF 0.3% smaller
Employee records49,05549,966GCF 2% smaller
Analytics time-series8,3989,127GCF 8% smaller
GitHub repos8,5768,744GCF 2% smaller

TOON's home turf. TOON's datasets. TOON's methodology. GCF wins every single one.

Even on flat tabular employee records, the dataset TOON was literally designed for, GCF is smaller. On semi-uniform data where structures vary, the gap blows open to 42%.


Session Statefulness: The Compounding Advantage

GCF has a feature no other format supports: session statefulness. Symbols seen in prior tool calls are referenced by ID instead of re-serialized.

First call: full payload. Second call: only new symbols, plus @ref IDs for previously-seen ones. By the 5th call in a conversation: 92.7% token savings.

TOON and JSON re-serialize everything on every call. There is no mechanism for cross-call deduplication. Every tool response pays full price regardless of what the model already knows.

This is where GCF's advantage compounds over a session. The per-call savings (32–79% vs JSON) multiply across 5–10 tool calls in a typical agent interaction.

JSON was designed for humans to read. TOON was designed as a compromise between humans and machines. GCF was designed for the machine. The consumer of your API responses is not a human. Optimize for the reader that matters.


Reproduce Everything

The eval is open source. Every result is committed. Every log file is in the repository.

git clone https://github.com/blackwell-systems/gcf-go
cd gcf-go/eval

# Comprehension (any provider)
GOWORK=off go test -run TestComprehension -v -timeout 0
EVAL_BACKEND=openai OPENAI_API_KEY=... EVAL_MODEL=gpt-5.5 GOWORK=off go test -run TestComprehension -v -timeout 0
EVAL_BACKEND=google GOOGLE_API_KEY=... EVAL_MODEL=gemini-2.5-flash GOWORK=off go test -run TestComprehension -v -timeout 0

# Generation (all three formats)
GOWORK=off go test -run "TestGeneration$|TestGenerationTOON|TestGenerationJSON" -v -timeout 0

# Token efficiency (TOON's benchmark)
git clone https://github.com/blackwell-systems/toon.git
cd toon && git checkout gcf-comparison && cd benchmarks && pnpm install && pnpm benchmark:tokens

Run it yourself. The numbers don't change.