Skip to content
Go back
miscplaybook

A/B testing for marketing teams in 2026: when it works, when it doesn't

Alejandro Rioja
Alejandro Rioja
Updated:
· 5 MIN
A/B testing for marketing teams in 2026: when it works, when it doesn't

Most marketing A/B testing advice assumes you have Netflix-scale traffic. You don’t. A typical B2B site gets 2,000-10,000 monthly visitors. A typical D2C brand gets 30,000-100,000. At those volumes, almost no test you run will reach statistical significance in a reasonable window — and pretending otherwise wastes months.

This is the honest framework we use across our portfolio. It’s less about p-values and more about deciding which tests are worth running at all.

Test only when the delta could be material

The first filter: would a 20% improvement in this metric actually matter to the business? If the answer is no, don’t test.

Examples of tests that meet this bar:

Examples that almost never meet this bar:

The button-color tests became internet famous because Google ran them at scale. At your scale, even a real 20% lift would take 6 months to detect with confidence, by which point the underlying market has changed.

Sample size math, in plain English

To detect a 10% relative lift on a 5% baseline conversion rate at 80% statistical power, you need roughly 30,000 visitors per variant. 60,000 total. If your site gets 5,000 monthly visitors total, that’s a year-long test.

The rule of thumb that’s served us across 11 brands: if you can’t ship the test result in 6 weeks at a 95% confidence band, you don’t have enough traffic to test that hypothesis. Make a decision instead.

Bar chart titled "Months to detect a 10% lift at 95% confidence." Five bars from left to right: 2K visitors per month requires 30 months; 10K per month requires 6 months; 50K per month requires 1.2 months; 200K per month requires 0.3 months; 1M per month finishes in under one week. A green dashed horizontal line marks a 6-week ship threshold. Takeaway noted at the bottom: under 50K monthly visits, ship the variant your judgment says wins and revisit in 4 weeks.

Use Evan Miller’s calculator to sanity-check sample size before starting. Most teams skip this and learn 8 weeks in that they were never going to reach significance.

What to do when you can’t test

This is the most under-discussed reality of small-team A/B testing: most decisions should be made via judgment, not stats.

Three frameworks we use when we can’t reach significance:

Look at the qualitative signal first. Run session recordings (we use Microsoft Clarity, which is free). If the variant fixes a problem you can SEE — users hesitating, scrolling past the CTA, abandoning a form — that’s a stronger signal than a 60%-confidence quant result.

Use external benchmarks. If your homepage conversion is 1.2% and similar B2B SaaS sites benchmark at 2.5%, you have headroom regardless of which variant tests “better” on your tiny sample.

Make a reversible decision and revisit. Ship the variant your judgment says wins. Watch the topline metric for 4 weeks. If it doesn’t get worse, keep it. If it does, revert. This is faster than testing for most small teams.

What to actually test

When you DO have the traffic and the delta would be material, here’s the priority order we use:

  1. Pricing page. Highest leverage. A 10% lift here moves topline revenue more than any other test.
  2. Top-of-funnel CTA copy and placement. A bigger top-of-funnel lifts every subsequent metric.
  3. Onboarding step 1. Activation drops here are catastrophic — fix this before testing anything downstream.
  4. Email subject lines (on lists over 50K). Highest-velocity test you can run.
  5. Paid ad creative. If you’re spending over $10K/mo on paid, creative tests pay for themselves fast.

Things we explicitly de-prioritize:

Tool choice

For small teams we use:

For enterprise we use Optimizely or LaunchDarkly (the latter is feature-flagging primarily, but their experimentation works well).

How to read results when you don’t have significance

You’ll often hit a 90% confidence interval after 3-4 weeks but never close the last 10%. Two options:

Ship the leading variant if the cost of being wrong is low. A change to a homepage CTA that’s marginally better than the control is fine to ship at 80% confidence. You can revert.

Don’t ship if the cost of being wrong is high. A pricing page change at 80% confidence isn’t safe. Either run longer or change the test design.

The single most expensive A/B testing mistake

Running too many tests in parallel on the same audience. If you’re testing the homepage hero AND the pricing page AND the email subject line on the same week, the variants interact and you can’t attribute the lift to any one change. We’ve seen teams burn 3 months chasing a result that turned out to be one of three concurrent tests they didn’t realize were colliding.

Sequence, don’t parallelize. One major test at a time.

What we run for clients

For the clients we work with, we typically run 2-3 tests per quarter on the highest-leverage surfaces. Quality of judgment matters more than test volume.

If you want help figuring out which tests are worth running on your funnel — or whether you should be testing at all — tell us what you’re working on.

Related reading on this site:

NEWSLETTER

Get next week's playbook in your inbox.

Biweekly. Operator-grade. No spam.

Alejandro Rioja
// Written by

Alejandro Rioja

Operator who builds and sells marketing-focused brands. Founder of Pickleland, founder of Flux.LA, writing about AI SEO + GEO at alejandrorioja.com.

Keep reading

Search everything

esc to close · ↑↓ to navigate