GCF is Better Than JSON

LLM Wire Format Benchmark: Which Format Can AI Actually Read and Write?

Every LLM wire format claims token savings. Nobody proves whether AI models can actually comprehend the format at scale, or produce valid output in it. We did both: 1,300+ LLM evaluations across 10 models from Anthropic, OpenAI, and Google. Deterministic ground truth, no LLM judge, reproducible from one command.

The results are unambiguous. JSON breaks at 500 records. GPT-5.5 returns empty strings: it can't even attempt an answer at 53,000 tokens of repeated field names. Opus spends 143 lines manually enumerating symbols to count them and still gets the wrong answer. The format designed for "human readability" is incomprehensible to the systems actually reading it.

TOON is worse than it looks. Its official decoder rejects LLM-generated output on 7 of 9 models tested. Claude Opus scores 0/5 on TOON generation. GPT-5.4: 0/5. Gemini 3.1 Pro: 0/5. The error is always the same: toon: cannot assign string to int. The model writes "target" in the distance column because that's what it was told. TOON expects 0. The format's flat tabular design forces an encoding step that no model performs unprompted. This is a structural design flaw, not a training problem.

GCF wins both dimensions on every model tested. Four models achieve 100% comprehension: Claude Sonnet, Gemini 2.5 Pro, Gemini 3.1 Pro, and Gemini 3.5 Flash. Every frontier model produces valid GCF at 5/5 from a 3-line primer. No model has ever been trained on GCF. The format didn't exist until we built it, and every model speaks it natively because the structure aligns with how LLMs already process information.

1,300+

LLM evaluations

79%

fewer input tokens than JSON

22/23

comprehension runs won

5/5

generation on every frontier model

Comprehension: 500 Symbols, 13 Questions, Zero Instructions

A 500-symbol, 200-edge code graph. Encoded in GCF, TOON, and JSON. 13 structured extraction questions. The model gets the payload and a question. No format instructions. No system prompt. No hints.

23 runs. 22 wins. 0 losses.

Model	Runs	GCF avg	TOON avg	JSON avg	GCF margin
Claude Opus 4.6	2	96.2%	84.6%	73.1%	+11.6 vs TOON
Claude Sonnet 4.6	2	100%	73.1%	53.8%	+26.9 vs TOON
Claude Haiku 4.5	2	96.2%	69.2%	57.7%	+27.0 vs TOON
GPT-5.5	5	84.1%	67.7%	45.8%	+16.4 vs TOON
GPT-5.4	4	76.4%	56.0%	44.1%	+20.4 vs TOON
GPT-5.4-mini	2	71.8%	64.1%	54.2%	+7.7 vs TOON
Gemini 2.5 Flash	3	80.6%	54.6%	57.0%	+26.0 vs TOON
Gemini 2.5 Pro	1	100%	76.9%	58.3%	+23.1 vs TOON
Gemini 3.1 Pro	1	100%	76.9%	46.2%	+23.1 vs TOON
Gemini 3.5 Flash	1	100%	61.5%	46.2%	+38.5 vs TOON

GCF > TOON > JSON on every model from every provider. No exceptions. Four models achieve 100%: Claude Sonnet, Gemini 2.5 Pro, Gemini 3.1 Pro, Gemini 3.5 Flash.

Token cost for the same payload

Format	Tokens	vs JSON
GCF	11,090	79% fewer
TOON	16,378	69% fewer
JSON	53,341	baseline

GCF is the cheapest format. It's also the most accurate. Usually you trade cost for quality. Not here.

How JSON Dies at Scale

At 8 symbols, JSON scores 100%. Everything works. At 500 symbols, it falls apart.

GPT-5.5 returns empty strings. Not wrong answers. Nothing. The model receives 53,341 tokens of {"qualifiedName": "...", "kind": "...", "score": ..., "provenance": "...", "distance": ...} repeated 500 times and cannot produce any response. Ask "how many symbols?" and it returns "". The attention mechanism drowns in 2,500 identical field-name tokens.

Claude Opus enumerates 143 symbols by hand. Asked "how many related symbols?" (answer: 167), Opus responds with:

Let me count precisely by going through the list:

1. handler.Response.Notify
2. model.SubscribeConfig
3. service.PublishOptions
...
143. store.DispatchConfig

So: 143.

143 lines of output tokens. Wrong answer. This happened on two separate runs with different payloads (143 on run 1, 119 on run 2). The most capable model in the world cannot count JSON objects because the structural noise overwhelms the signal. GCF answers the same question from a 3-character header: [167].

Every model fails distance filtering. "How many symbols have distance 0?" requires parsing 500 JSON objects, reading the distance field on each, and counting matches. Correct answer: 166. Opus answers 200 (read the edge count instead). GPT-5.4 answers 300–404. GPT-5.4-mini answers 300.

JSON repeats "qualified_name":, "kind":, "score":, "provenance":, "distance": on every single record. That's 2,500 structurally identical tokens carrying zero semantic content. They exist for human readability. The consumer isn't a human.

JSON failure taxonomy

Failure type	Count	Models	Cause
Empty string response	33	GPT-5.5	53k tokens of repeated field names overwhelms attention. Model produces nothing.
Massive undercount	9	Opus/Sonnet, Haiku, GPT-5.4, mini	Field-name repetition dilutes signal. Model loses count mid-scan.
Distance filter failure	29	Opus/Sonnet, Haiku, GPT-5.4, mini	Must parse JSON objects AND filter by field value. Fails consistently.
Field confusion	3	GPT-5.4	Reads edge type instead of symbol kind.

JSON median error magnitude: 56. GCF median error magnitude: 4.

How TOON Fails on Grouping

TOON does better than JSON on counting — it gets symbol_count=500 correct. But it fails on anything that requires filtering by column value.

Distance grouping fails on every model. "How many targets (distance 0)?" requires scanning 500 TOON rows and filtering by the last column. Correct answer: 166.

Opus: 107
Haiku: 100, 200, 214
GPT-5.4: 169, 229, 200
GPT-5.4-mini: 26, 28

The answers are wildly inconsistent across runs. The models aren't wrong in a systematic way — they're guessing. TOON has no section headers for distance groups. The only way to answer "how many targets?" is to scan every row and count. At 500 rows, models give up and guess round numbers.

Attention decays by row 500. "What kind is the last symbol?" should be trivial. TOON answers "method" instead of "interface" on multiple models. By the time the model reaches row 500 of a flat table, attention has diluted to noise.

TOON failure taxonomy

Failure type	Count	Models	Cause
Distance grouping failure	25	Opus/Sonnet, Haiku, GPT-5.4, mini	Must scan 500 rows and filter by distance column. Wildly inconsistent answers.
Round-number guessing	7	Haiku, mini	Model gives up counting and guesses "100".
Attention decay (last row)	5	Opus/Sonnet, Haiku, GPT-5.4	last_symbol_kind wrong. Loses track at row 500.
Empty response	20	GPT-5.5	Context overwhelm. Same as JSON.

TOON median error magnitude: 53.

How GCF Solves Both Problems

GCF answers are structural, not computational.

"How many symbols?" Read the header: symbols=500. Done.

"How many edges?" Read the section header: ## edges [200]. Done.

"How many targets?" Count lines in ## targets. The section boundary gives the grouping for free. No column filtering. No scanning 500 rows.

"What kind is the last symbol?" The last line in ## extended is the last symbol. The model reads the last line of the last section. No attention decay across 500 flat rows.

One design decision creates this gap: hierarchical sections vs flat tabular. GCF groups data by category. TOON and JSON present flat lists and force the model to compute groupings from raw values. At scale, that computation fails.

GCF failure taxonomy (precision errors only)

Failure type	Count	Models	Cause
Off-by-1–2 header misread	5	Haiku, GPT-5.4, mini	Header says [167], model reads 166. Tokenization artifact.
Column scan miscount	10	GPT-5.4, mini	Must scan `fn` kind across rows. Deterministic: function_count=84 every run.
Field confusion	2	GPT-5.4, mini	Read symbol count instead of edge count.
Empty response	10	GPT-5.5	Context overwhelm at 53k+ input tokens (JSON payload size).

GCF median error magnitude: 4. GCF failures on Claude are near-zero. GCF failures on OpenAI are deterministic and repeatable — same wrong number every run — suggesting a tokenizer-level parsing difference, not a comprehension failure.

Generation: TOON is Broken

We asked every model to produce structured output in each format. 3-line primer in the prompt. Output validated through the real decoder. No hand-holding.

9 models. 3 providers. GCF is the only format that works everywhere.

Model	GCF	TOON (natural)	JSON
Claude Opus 4.6	5/5	0/5	5/5
Claude Sonnet 4.6	5/5	2–3/5	5/5
Claude Haiku 4.5	5/5	1–3/5	5/5
GPT-5.5	4–5/5	1–2/5	5/5
GPT-5.4	5/5	0/5	5/5
GPT-5.4-mini	5/5	0/5	5/5
Gemini 2.5 Pro	5/5	1/5	5/5
Gemini 3.1 Pro	5/5	0/5	5/5
Gemini 3.1 Flash Lite	4–5/5	0/5	4/5
Gemini 3.5 Flash	3/5	1/5	3/5
Gemini 2.5 Flash	2–3/5	0–4/5	0–3/5

No model has ever been trained on GCF. It didn't exist before we built it. Yet every frontier model (Opus, Sonnet, GPT-5.5, Gemini 2.5 Pro, Gemini 3.1 Pro) produces valid, decoder-parseable output on first exposure with a 3-line primer.

TOON has been published for months. It has documentation, examples, a playground, SDK implementations. And Claude Opus scores 0/5. Gemini 3.1 Pro scores 0/5. GPT-5.4 scores 0/5.

The exact failure

Every TOON generation failure produces the same error:

INVALID: symbols: index 0: distance: toon: cannot assign string to int

The model writes:

symbols[5]{name,kind,score,provenance,distance}:
  pkg/api.HandleRequest,function,0.95,lsp_resolved,target

TOON expects:

symbols[5]{name,kind,score,provenance,distance}:
  pkg/api.HandleRequest,function,0.95,lsp_resolved,0

The model is told "this symbol is a target." It writes target. TOON's decoder rejects it because it expects the integer 0. The model would need to know, unprompted, that "target" maps to 0, "related" maps to 1, "extended" maps to 2. No model does this.

This isn't a training problem. This is a design flaw. TOON's flat tabular format encodes semantic categories as integers. The model has to perform a mapping step that has no structural cue in the format itself. When does a column value need to be an integer? When is a string acceptable? TOON gives no signal. The model guesses wrong.

GCF never has this problem

GCF expresses distance through section placement:

## targets
@0 fn pkg.HandleRequest 0.95 lsp_resolved
## related
@1 type pkg.ProcessResponse 0.74 ast_inferred
## extended
@2 method pkg.ValidateConfig 0.52 structural

The model is told "this symbol is a target." It writes it in ## targets. No integer mapping. No encoding step. The format aligns with how the model naturally expresses grouped data. Sections are categories. That's how markdown works. That's how every model already thinks.

Even with hand-holding, GCF wins

When we explicitly pre-encode distances as integers in the prompt ("distance 0" instead of "target"), TOON passes. But this means the caller must know TOON's internal encoding and pre-process every field before the model can write valid output.

Format	Prompt style	Valid	100 sym output
GCF	natural labels	5/5	5,984 B
TOON	hand-held (integers)	5/5	8,336 B
TOON	natural labels	0/5	invalid
JSON	natural labels	5/5	16,121 B

GCF works with natural language. TOON requires a preprocessing step. And even with that step, GCF output is 28% smaller.

GCF Works Without Training

No model has seen GCF in training. And yet:

Claude Opus 4.6: 5/5 valid (zero variance across 2 runs)
Claude Sonnet 4.6: 5/5 valid (zero variance across 2 runs)
Claude Haiku 4.5: 5/5 valid (2 runs)
GPT-5.5: 4–5/5 valid
GPT-5.4: 5/5 valid
GPT-5.4-mini: 5/5 valid (zero variance across 2 runs)
Gemini 2.5 Pro: 5/5 valid (zero variance across 2 runs)
Gemini 3.1 Pro: 5/5 valid
Gemini 3.1 Flash Lite: 4–5/5 valid (zero variance across 3 runs)

This happens because GCF is aligned with patterns LLMs already understand:

## section_name is a markdown header. Every model knows this.
@0 fn pkg.Auth 0.78 lsp_resolved is positional. One token per field. No ambiguity.
@1<@0 calls is 4 tokens. Self-contained. No nested objects.

The format was designed for the machine's native expression patterns. TOON was designed for human readability. JSON was designed for human readability. Neither format was designed for the reader that's actually doing the work.

TOON's Own Benchmark: GCF Wins All 6 Datasets

We forked TOON's benchmark repository, added a GCF formatter, and ran their datasets with their tokenizer and their methodology.

Dataset	GCF tokens	TOON tokens	Result
Semi-uniform event logs	108,158	154,032	GCF 42% smaller
E-commerce orders	61,593	73,246	GCF 19% smaller
Deeply nested config	616	618	GCF 0.3% smaller
Employee records	49,055	49,966	GCF 2% smaller
Analytics time-series	8,398	9,127	GCF 8% smaller
GitHub repos	8,576	8,744	GCF 2% smaller

TOON's home turf. TOON's datasets. TOON's methodology. GCF wins every single one.

Even on flat tabular employee records, the dataset TOON was literally designed for, GCF is smaller. On semi-uniform data where structures vary, the gap blows open to 42%.

Session Statefulness: The Compounding Advantage

GCF has a feature no other format supports: session statefulness. Symbols seen in prior tool calls are referenced by ID instead of re-serialized.

First call: full payload. Second call: only new symbols, plus @ref IDs for previously-seen ones. By the 5th call in a conversation: 92.7% token savings.

TOON and JSON re-serialize everything on every call. There is no mechanism for cross-call deduplication. Every tool response pays full price regardless of what the model already knows.

This is where GCF's advantage compounds over a session. The per-call savings (32–79% vs JSON) multiply across 5–10 tool calls in a typical agent interaction.

JSON was designed for humans to read. TOON was designed as a compromise between humans and machines. GCF was designed for the machine. The consumer of your API responses is not a human. Optimize for the reader that matters.

Reproduce Everything

The eval is open source. Every result is committed. Every log file is in the repository.

git clone https://github.com/blackwell-systems/gcf-go
cd gcf-go/eval

# Comprehension (any provider)
GOWORK=off go test -run TestComprehension -v -timeout 0
EVAL_BACKEND=openai OPENAI_API_KEY=... EVAL_MODEL=gpt-5.5 GOWORK=off go test -run TestComprehension -v -timeout 0
EVAL_BACKEND=google GOOGLE_API_KEY=... EVAL_MODEL=gemini-2.5-flash GOWORK=off go test -run TestComprehension -v -timeout 0

# Generation (all three formats)
GOWORK=off go test -run "TestGeneration$|TestGenerationTOON|TestGenerationJSON" -v -timeout 0

# Token efficiency (TOON's benchmark)
git clone https://github.com/blackwell-systems/toon.git
cd toon && git checkout gcf-comparison && cd benchmarks && pnpm install && pnpm benchmark:tokens

Run it yourself. The numbers don't change.

Resources

Get Started

Documentation Playground Benchmarks GCF vs TOON

Libraries

Go TypeScript Python Rust Swift Kotlin

Data

Spec + eval logs Eval source code MCP Proxy