Are You Smarter Than a 0.5B Model?

Your estimate - First interval appears after five probes.

IKP score 0.0% 0 probes answered

0.5B benchmark - Beat roughly 26.5% IKP score.

Current streak 0 Correct answers in a row

T1 Loading probes... 0 / 1400

Loading the benchmark...

Your answer

Verdict

IKP calibration

Each dot is an evaluated LLM. Your dot moves after every answer.

Known size Estimated You

About this game

This game uses the 1,400-question probe set from Bojie Li's paper, Incompressible Knowledge Probes, and the public IKP repository.

The paper calibrates factual probe performance against model size with a log-linear curve. Here, your automatically graded answers are mapped onto that same curve, so the result is an LLM-equivalent factual capacity estimate, not a claim about human brains.

The original benchmark uses an LLM judge with strict rules. This site uses a deterministic approximation: normalized exact/fuzzy matching, numeric tolerance, alternate-answer handling, and refusal detection.

The displayed IKP score is penalized, not raw accuracy: correct answers count as +1, refusals as 0, and wrong guesses as -0.5, then the displayed score is floored at 0. The paper's evaluation code uses this hallucination penalty and floors tier scores at zero before averaging tiers; this game applies the same penalty to your answered sample.

The estimate starts from a deliberately conservative prior: before seeing your answers, the 90% equivalent-parameter interval is 10M to 1B. After each answer, your observed IKP score is mapped through log10(params_B) = 6.790 * score - 0.899. The score uncertainty from your answered sample and the paper's approximate calibration error are combined with that prior in log-parameter space. The displayed 90% CI (equiv) is the resulting posterior interval.

Your first estimate

Answer more questions to tighten your confidence interval.

Report a question

What is wrong?

Correct answer or suggested fix

Submit to leaderboard

Display name

Ranking uses the lower bound of your interval, so a tighter 0.5B run can outrank a loose 1B run.