Understanding Power and Sample Size in Benchmark Evaluations: Why a Non-Significant Result Might Be Misleading

Introduction: The Hidden Pitfall in Benchmark Interpretations

When comparing model performance on a benchmark, seeing one result as statistically significant and another as not can seem clear-cut. However, this apparent clarity often masks a critical issue: the benchmark's ability to detect the effect sizes you actually care about. A non-significant result doesn't necessarily mean there's no improvement—it may simply indicate that the benchmark lacks the statistical power to reliably measure small but meaningful gains.

Understanding Power and Sample Size in Benchmark Evaluations: Why a Non-Significant Result Might Be Misleading — Source: dev.to

In a recent evaluation, Delta A was reported as –2.34 points (95% CI [–11.09, +6.20], p = 0.71) and deemed not significant, while Delta B showed +22.18 points (95% CI [+14.43, +29.82], p = 0.0) and was considered significant. At first glance, this suggests one model clearly outperforms the other. But when the benchmark consists of only 216 binary pass/fail tasks, the picture changes drastically. This article explores two practical questions: what size improvement can this benchmark reliably detect, and why reporting p = 0.0 is statistically invalid.

The Gap Between Significance and Power

A p-value tells you how unusual the observed data are under a null hypothesis—it does not tell you whether your experiment was capable of detecting a true effect. A result like p = 0.71 can mean either that there is truly no effect, or that there is a small effect but the benchmark has low power to detect it. These two interpretations lead to very different decisions for model iteration and development, making power analysis an essential step.

Minimum Detectable Effect (MDE) in the 216-Task Benchmark

Using a standard two-proportion planning approximation with a baseline pass rate of ≈74%, alpha = 0.05 (two-sided), power = 0.80, and n = 216 tasks, the minimum detectable effect (MDE) is approximately +10.9 percentage points. This means the benchmark can reliably detect only improvements of about 11 points or more. If your practical target is a lift of +3 to +5 points, a 216-task benchmark is simply too small to have a reasonable chance of finding it.

Reevaluating Non-Significant Results with Power in Mind

Given this power profile, Delta A with p = 0.71 should be interpreted as inconclusive for small effects, not as definitive evidence of no improvement. The approximate detection probabilities for the 216-task benchmark are:

True +3 point effect → ~11% detection chance
True +5 point effect → ~23% detection chance
True +8 point effect → ~52% detection chance

Thus, failing to reject the null for small lifts is expected most of the time. A non-significant p-value should never be read as proof of no effect when statistical power is low.

Designing a Benchmark with Adequate Statistical Power

To reliably detect specific effect sizes at 80% power, the required number of tasks (at the same baseline and test settings) are:

To detect +3 points: ≈3,226 tasks
To detect +5 points: ≈1,128 tasks
To detect +8 points: ≈420 tasks

These numbers highlight the importance of aligning benchmark size with your minimal meaningful improvement. If +5 points is your threshold, version 2.0 of the benchmark should aim for ~1,100+ tasks. If +3 matters, you need multi-thousand scale. Only when you care exclusively about large lifts like +8 points can a 400+ task benchmark suffice.

Fixing Common Bootstrap p-Value Reporting Errors

Reporting p = 0.0 is invalid when using bootstrap or Monte Carlo resampling with a finite number of samples (e.g., B = 2,000). The correct empirical p-value is calculated using the formula: p = (r + 1) / (B + 1), where r is the count of resamples at least as extreme as the observed statistic. With B = 2,000 and r = 0, the correct p = 1 / 2001 ≈ 0.00050. Acceptable reporting options include: bootstrap p ≈ 0.0005, bootstrap p ≤ 1/2001, or bootstrap p < 0.001.

Practical Recommendations for Reporting Benchmark Results

A defensible rewrite of the original report would explicitly address power and correct the bootstrap p-value. For example: “Under a standard two-proportion planning approximation with baseline pass rate ≈74%, alpha = 0.05, and power = 0.80, the minimum detectable effect for 216 tasks is about +10.9 percentage points. Therefore, the non-significant result for Delta A is inconclusive for improvements below this threshold. Additionally, the reported p = 0.0 is corrected to bootstrap p ≈ 0.0005.” This approach encourages more nuanced interpretation and better experimental design in future benchmarks.