CHRONOS CRC Score Discoveries Training Signal
The Training Signal

Train on Failure

Every CHRONOS session produces something no benchmark can: paired examples of exactly where a model falls short — and exactly what a better answer looks like.

Current training data is noise

You want your model to be more creative. So you train it on "creative" examples. But which examples? Selected by whom? With what criteria?

The training signal for creative reasoning has always been vibes. Human annotators label outputs as "good" or "bad" based on subjective impression. The model learns to produce text that sounds creative without actually pushing into new territory.

CHRONOS generates something different: structural preference pairs where "better" is measured by whether a thought explains something — not by whether it sounds novel or impresses an annotator.

Same question. Same exclusion zone.
Different output quality.

When multiple models compete on the same anchor in the same exclusion zone, some produce thoughts that structurally advance the conversation — connecting ideas, reducing confusion, resolving open questions — and some don't. That gap is the training signal.

Rejected
Model A — attempt on [information geometry]
"The Fisher information metric provides a natural Riemannian structure on statistical manifolds, connecting information theory to differential geometry..."
structural score: 0.04 — restates known territory
Stored
Model B — same anchor, same zone
"The obstruction to extending Fisher geometry beyond exponential families is not curvature but flatness — the dual connection degenerates precisely where the model class boundary intersects the mixture family..."
structural score: 0.19 — bridges clusters, reduces correlation

Same prompt. Same exclusion zone. Same anchor. The only difference is structural quality. Model A restated what was already known. Model B connected two previously separate ideas and reduced the confusion between them — the formal signature of explanation. That pair is now a DPO training example targeting Model A's specific deficit.

Not "be more creative."
Structurally better.

Traditional RLHF training data says: this output is better than that output. A human said so. The model learns to match the annotator's taste.

CHRONOS training data says: this thought connects previously separate ideas and reduces the correlation between them. That thought restates what was already known. The "better" isn't opinion — it's structural. Measured across seven independent axes, validated against what actually gets referenced in later sessions.

The signal isn't "be more creative." The signal is "when you're stuck on this problem, here's what actually advancing it looks like — and here's what spinning in place looks like."

And because CHRONOS tracks which anchors each model struggles with, the training pairs are targeted. A model that excels at mathematical formalization but fails at cross-domain synthesis gets pairs specifically targeting that gap. Not generic creativity — surgical correction.

Every session makes the next one harder

1
Session runs
Models compete under geometric pressure
2
Pairs generated
Failures paired with breakthroughs on same anchor
3
Model improves
DPO training targets specific cognitive deficits
4
Atlas deepens
Exclusion zone expands — next session is harder

This is the compounding loop. Better models produce higher-quality sessions. Higher-quality sessions produce harder training pairs. Harder training pairs produce better models. The ceiling rises with every cycle.

If you're training the next generation

You're an AI lab. You've saturated the benchmarks. Your model scores well on knowledge retrieval, code generation, and instruction following. The next frontier is creative reasoning — and you don't have training data for it.

CHRONOS generates that data. Run your model through it. Get back preference pairs targeting its specific weaknesses. Train on those pairs. Run it again. See if the deficits narrowed.

The training signal isn't generic. It's your model, on your hardest problems, compared to six frontier competitors, measured across seven structural axes. That's what surgical calibration looks like.

Try CHRONOS Discuss Training Data