What does "statistical significance" actually mean?

When you run an A/B test and Variant B converts at 6% versus Control's 5%, you have two possibilities. Either B is genuinely better, or the difference is random noise — the kind of variation you'd see by flipping the same coin twice and getting slightly different results.

Statistical significance is the formal answer to: "How sure am I that this isn't noise?" The convention across the industry is 95% confidence. If your test result is significant at 95%, it means there is less than a 5% probability that the difference you observed could have happened by random chance alone.

That 5% threshold is called the significance level (or alpha). It is also where your p-value comes in. The p-value is the probability of seeing a result this extreme — or more extreme — if there were actually no difference between variants. A p-value of 0.02 means "if A and B were truly identical, only 2% of tests would produce a result this strong." That's strong enough to declare a winner at 95%.

How to read this calculator's output

The big verdict at the top is your bottom line. Below it, four numbers explain why:

P-value: Your central evidence. Below 0.05 = significant at 95%. Below 0.01 = significant at 99%. Above 0.05 = inconclusive, you can't declare a winner.
Confidence: Just 1 minus the p-value, expressed as a percent. Easier to read at a glance — "99.8% confident" lands harder than "p = 0.002."
Lift: The relative improvement of B over A. A 5% to 6% jump is a +20% lift, not a "1% lift." Marketers care about relative lift; statisticians care about absolute difference. Both matter.
Z-score: The standardized distance between A and B. |Z| ≥ 1.96 means you've crossed the 95% threshold. |Z| ≥ 2.58 means you've crossed 99%. This is the raw signal-to-noise ratio of your test.

The confidence interval at the bottom tells you the range your true effect probably lives in. If the interval crosses zero (one bound is negative, the other positive), the test is not significant — you can't even rule out that B is worse.

One-tailed vs two-tailed: which to use?

By default, this calculator uses a two-tailed test. That means it asks: "Is B different from A?" — either better or worse. This is the safer choice and the standard in academic research.

A one-tailed test asks instead: "Is B better than A?" It treats "B is worse" and "no difference" as the same outcome. One-tailed tests need a smaller sample to reach significance for the same lift, which is tempting. But they're only valid if you've decided in advance that you'll ship B no matter what, even if it underperforms. In practice, almost no one actually behaves that way. They'd kill B if it tanked. So they should be using two-tailed tests.

Use one-tailed only when your test has a true directional hypothesis you've committed to before launch — for example, a regulatory change that you must implement, and you're just verifying it doesn't hurt revenue too much.

Sample size: the hidden killer of A/B tests

Most A/B tests fail not because the variant is bad, but because the test was too small to detect the real winner. This is called being underpowered. With 500 visitors per variant and a 1 percentage point lift, you'd need to triple the underlying effect just to clear the 95% threshold. Real-world lifts are rarely that large.

The Sample Size mode of this calculator (toggle above) tells you the minimum visitors per variant you need before you start, based on:

Baseline conversion rate: Your current rate before testing.
Minimum detectable effect (MDE): The smallest lift worth caring about. A 1% lift on a 5% baseline (so 5% → 5.05%) requires enormous samples. A 20% lift (5% → 6%) is much faster.
Statistical power: Your probability of detecting a real effect if it exists. 80% is conventional; 90% is recommended for high-stakes decisions.
Significance level: The same 95% threshold as before, applied to the test design rather than the result.

Practical rule of thumb: if you have less than 1,000 conversions per variant, you can only reliably detect lifts of 20% or more. Smaller lifts are real but invisible to your test.

Common A/B testing mistakes that even big teams make

Peeking at results early. The single most expensive mistake. If you check significance daily and stop the test the moment p drops below 0.05, you've turned a 5% false positive rate into something closer to 25-30%. Decide on your sample size, run the full test, then look at the result. If you must monitor, use sequential testing (Bayesian methods or Always Valid Inference) — not naive frequentist checks.

Stopping at "almost significant" (p = 0.06). There is no almost. The 0.05 threshold is somewhat arbitrary, but once you allow yourself to fudge it, the entire statistical framework collapses. If your test is at p = 0.06, you either need more data or you need to accept the result is inconclusive.

Running too many concurrent tests. If you run 20 tests simultaneously at 95% confidence, you'd expect one false positive purely by chance. This is the multiple comparisons problem. Solutions: stricter significance levels (Bonferroni correction), or simply running fewer tests at any one time.

Mistaking lift for impact. A 50% lift on a button that converts 0.1% of users adds 0.05 percentage points. A 5% lift on a 30% checkout flow adds 1.5 percentage points. Always translate lift into absolute conversion rate before you celebrate.

Novelty effects. New designs often outperform in week 1 and regress to baseline by week 3, because users notice the change and engage more. Run your tests for at least one full business cycle (usually 2 weeks) before reading the result.

Benchmarks: what conversion rates and lifts to expect

Funnel step	Typical conversion rate	Realistic A/B lift
Homepage → Signup CTA click	3-8%	5-20%
Signup form completion	30-60%	3-10%
Free trial → paid conversion (SaaS)	15-25%	5-15%
Add-to-cart → checkout (e-commerce)	50-70%	3-8%
Checkout → purchase	60-80%	2-7%
Email open rate	15-30%	10-30%
Email click rate	2-5%	10-25%
Onboarding completion (mobile app)	20-40%	5-15%

If your test is showing a lift outside the top end of these ranges (say, +60% on signup completion), be suspicious. Either you discovered something brilliant, or there's a bug in your tracking, or the test wasn't randomized properly. Investigate before shipping.

The math behind this calculator

For the significance test, this calculator uses a two-sample Z-test for proportions with a pooled standard error. The Z-statistic is calculated as:

Z = (p_B - p_A) / SE, where SE = √(p_pool × (1 - p_pool) × (1/n_A + 1/n_B))

The p-value comes from the standard normal distribution. This is the same approach used by Optimizely, VWO, AB Tasty, and Google's old Optimize tool. It works well when conversion rates are between 1% and 99% and sample sizes per variant exceed about 100.

For the sample size calculator, we use the standard formula based on the normal approximation:

n = (Z_α/2 × √(2 × p̄ × (1 - p̄)) + Z_β × √(p_A × (1 - p_A) + p_B × (1 - p_B)))² / (p_B - p_A)²

For most use cases this is accurate within 1-2%. If you need exact answers — for very small samples or extreme conversion rates near 0% or 100% — use Fisher's exact test or a Bayesian framework instead. Those require more setup and aren't worth it for typical marketing A/B tests.

Is your A/B test actually a winner?

What does "statistical significance" actually mean?

How to read this calculator's output

One-tailed vs two-tailed: which to use?

Sample size: the hidden killer of A/B tests

Common A/B testing mistakes that even big teams make

Benchmarks: what conversion rates and lifts to expect

The math behind this calculator

Related calculators

Marketing ROI Calculator

LLM Token Cost Calculator

All LLM Pricing