Is your A/B test actually a winner?

Free statistical significance calculator. Real Z-test for two proportions — the same math used by Optimizely, VWO, and Google Optimize. Plus a sample size calculator so you know when to stop a test before you start one.

Control
A · baseline
Conversion rate: 5.00%
Variant
B · challenger
Conversion rate: 6.00%
Verdict
Variant B is the winner
A 20% lift with 99.8% confidence. The result is statistically significant — you can roll out Variant B.
P-value
0.0019
below 0.05 = significant
Confidence
99.81%
probability B is real
Lift
+20.00%
relative improvement
Z-score
3.10
|z| ≥ 1.96 → significant
95% Confidence Interval
The true difference between Variant B and Control is between +0.37% and +1.63% in absolute terms. If you ran this test infinite times, 95% of intervals would contain the true effect.

Before you launch a test, figure out how big it needs to be. Underpowered tests fail to detect real winners — you ship the loser thinking nothing happened.

%
%
You need at least
31,234
visitors per variant
Total: ~62,468 visitors across both variants
How long will this take?
At 1,000 daily visitors per variant, this test would run for about 31 days. Adjust your traffic estimate to see how long you'd actually wait.

Advertisement

What does "statistical significance" actually mean?

When you run an A/B test and Variant B converts at 6% versus Control's 5%, you have two possibilities. Either B is genuinely better, or the difference is random noise — the kind of variation you'd see by flipping the same coin twice and getting slightly different results.

Statistical significance is the formal answer to: "How sure am I that this isn't noise?" The convention across the industry is 95% confidence. If your test result is significant at 95%, it means there is less than a 5% probability that the difference you observed could have happened by random chance alone.

That 5% threshold is called the significance level (or alpha). It is also where your p-value comes in. The p-value is the probability of seeing a result this extreme — or more extreme — if there were actually no difference between variants. A p-value of 0.02 means "if A and B were truly identical, only 2% of tests would produce a result this strong." That's strong enough to declare a winner at 95%.

How to read this calculator's output

The big verdict at the top is your bottom line. Below it, four numbers explain why:

The confidence interval at the bottom tells you the range your true effect probably lives in. If the interval crosses zero (one bound is negative, the other positive), the test is not significant — you can't even rule out that B is worse.

One-tailed vs two-tailed: which to use?

By default, this calculator uses a two-tailed test. That means it asks: "Is B different from A?" — either better or worse. This is the safer choice and the standard in academic research.

A one-tailed test asks instead: "Is B better than A?" It treats "B is worse" and "no difference" as the same outcome. One-tailed tests need a smaller sample to reach significance for the same lift, which is tempting. But they're only valid if you've decided in advance that you'll ship B no matter what, even if it underperforms. In practice, almost no one actually behaves that way. They'd kill B if it tanked. So they should be using two-tailed tests.

Use one-tailed only when your test has a true directional hypothesis you've committed to before launch — for example, a regulatory change that you must implement, and you're just verifying it doesn't hurt revenue too much.

Sample size: the hidden killer of A/B tests

Most A/B tests fail not because the variant is bad, but because the test was too small to detect the real winner. This is called being underpowered. With 500 visitors per variant and a 1 percentage point lift, you'd need to triple the underlying effect just to clear the 95% threshold. Real-world lifts are rarely that large.

The Sample Size mode of this calculator (toggle above) tells you the minimum visitors per variant you need before you start, based on:

Practical rule of thumb: if you have less than 1,000 conversions per variant, you can only reliably detect lifts of 20% or more. Smaller lifts are real but invisible to your test.

Common A/B testing mistakes that even big teams make

Peeking at results early. The single most expensive mistake. If you check significance daily and stop the test the moment p drops below 0.05, you've turned a 5% false positive rate into something closer to 25-30%. Decide on your sample size, run the full test, then look at the result. If you must monitor, use sequential testing (Bayesian methods or Always Valid Inference) — not naive frequentist checks.

Stopping at "almost significant" (p = 0.06). There is no almost. The 0.05 threshold is somewhat arbitrary, but once you allow yourself to fudge it, the entire statistical framework collapses. If your test is at p = 0.06, you either need more data or you need to accept the result is inconclusive.

Running too many concurrent tests. If you run 20 tests simultaneously at 95% confidence, you'd expect one false positive purely by chance. This is the multiple comparisons problem. Solutions: stricter significance levels (Bonferroni correction), or simply running fewer tests at any one time.

Mistaking lift for impact. A 50% lift on a button that converts 0.1% of users adds 0.05 percentage points. A 5% lift on a 30% checkout flow adds 1.5 percentage points. Always translate lift into absolute conversion rate before you celebrate.

Novelty effects. New designs often outperform in week 1 and regress to baseline by week 3, because users notice the change and engage more. Run your tests for at least one full business cycle (usually 2 weeks) before reading the result.

Benchmarks: what conversion rates and lifts to expect

Funnel stepTypical conversion rateRealistic A/B lift
Homepage → Signup CTA click3-8%5-20%
Signup form completion30-60%3-10%
Free trial → paid conversion (SaaS)15-25%5-15%
Add-to-cart → checkout (e-commerce)50-70%3-8%
Checkout → purchase60-80%2-7%
Email open rate15-30%10-30%
Email click rate2-5%10-25%
Onboarding completion (mobile app)20-40%5-15%

If your test is showing a lift outside the top end of these ranges (say, +60% on signup completion), be suspicious. Either you discovered something brilliant, or there's a bug in your tracking, or the test wasn't randomized properly. Investigate before shipping.

The math behind this calculator

For the significance test, this calculator uses a two-sample Z-test for proportions with a pooled standard error. The Z-statistic is calculated as:

Z = (p_B - p_A) / SE, where SE = √(p_pool × (1 - p_pool) × (1/n_A + 1/n_B))

The p-value comes from the standard normal distribution. This is the same approach used by Optimizely, VWO, AB Tasty, and Google's old Optimize tool. It works well when conversion rates are between 1% and 99% and sample sizes per variant exceed about 100.

For the sample size calculator, we use the standard formula based on the normal approximation:

n = (Z_α/2 × √(2 × p̄ × (1 - p̄)) + Z_β × √(p_A × (1 - p_A) + p_B × (1 - p_B)))² / (p_B - p_A)²

For most use cases this is accurate within 1-2%. If you need exact answers — for very small samples or extreme conversion rates near 0% or 100% — use Fisher's exact test or a Bayesian framework instead. Those require more setup and aren't worth it for typical marketing A/B tests.

Advertisement

Related calculators

other tools on gpt-cost
growth metrics

Marketing ROI Calculator

CAC, ROAS, LTV, LTV:CAC ratio
AI / API costs

LLM Token Cost Calculator

21 models · GPT, Claude, Gemini
catalog

All LLM Pricing

$0.10 to $180 per million tokens