Design of experiments

Statistical foundations for causal inference

Alex Deng alexdeng.github.io/ab-stats

Forked from Lukas Vermeer's simulation based presentation
@lukasvermeerlukasvermeer.nl/ab-stats

Making good decisions needs more than correlation

Americans and English eat a lot of fat food. There is a high rate of cardiovascular diseases in US and UK.
French eat a lot of fat food, but they have a low(er) rate of cardiovascular diseases.
Americans and English drink a lot of alcohol. There is a high rate of cardiovascular diseases in US and UK.
Italians drink a lot of alcohol but, again, they have a low(er) rate of cardiovascular diseases.

Conclusion? Eat and drink what you want. And you have a higher chance of getting a heart attack if you speak English!

Drinking could help you live longer. According to a study, people who live to 90 or older often drink moderately.

Counterfactual framework (Rubin Causal Model)

The fundamental problem of

causal inference

  1. Missing Data: We cannot expose units to both treatments simultaneously. We don't observe the counterfactual
  2. Noises: We don't directly observe underlying probabilities. Observations are noisy.
  3. Sampling: We don't observe everyone. Only a sample.

What we can measure (Rubin Causal Model)

What randomization gives us (Rubin Causal Model)

Repeating the same experiment (Expectation)

We want to reject the

null hypothesis

The null hypothesis assumes average treatment effect is zero; any difference we observe is simply due to chance.

If we could reasonably rule out chance, we might reject the null and consider this to be evidence for the alternative hypothesis.

We compute a

p-value

We assume the null hypothesis is true and compute the p-value.

Assuming there is no effect, the p-value is the probability of seeing a particular result or more extreme by chance.

How likely is this (or more extreme) result assuming the null is true? Reject null if p-value is below a threshold.

Two types of

errors

  1. Type-I (False Positive) is the incorrect rejection of a true null hypothesis; we cried wolf when there was none
  2. Type-II (False Negative) is the failure to reject a false null hypothesis; we failed to detect a real effect

p-value is the Type-I error assuming the null is true; the threshold we set for p-value controls Type-I error

Repeating the same experiment (No effect)

Repeating the same experiment (Small effect)

The importance of

Statistical power

Statistical power is the probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true.
(1 - Type-II error rate)

Two main things affect statistical power:

  • Sample size (more is better)
  • Effect size (more is better)

Repeating the same experiment (More power)

Repeating the same experiment (More power from larger effect)

Another Two types of

errors

  1. Type-M (Magnitude) is the expected ratio between the estimated effect to the true effect given rejection of the null
  2. Type-S (Sign) is the probability that our estimate is of different sign than the true effect given rejection of the null

Type-S and Type-M: moderate power

Type-S and Type-M: very low power

Type-S and Type-M: high power

Type-S and Type-M: very high power

The importance of

Statistical power, AGAIN

Type-M error always exist in NHST framework. Estimates is more likely to overestimate the true effect.

Type-M error could still be high for moderate power like 50%. Although still a fair chance to reject the null, the estimated effects are on average exaggerated by 50%. (Winner's Curse)

Type-M error is less severe for power above 80%

Fortunately, Type-S error is very rare

Jiannan Lu, Yixuan Qiu, Alex Deng 2018 "A note on type S/M errors in hypothesis testing"

About Peeking and multiple testing

The methods described assume the PROTOCOL of testing one hypothesis with one analysis using one set of data; violations of protocol such as peeking and multiple testing increase the type-I error rate

Peeking twice

Peeking 100x

Reasons for violating

Protocol

More flexible protocols may be desirable

  • early stopping rules to mitigate damage
  • early shipping to minimize opportunity cost
  • multiple variants to test several alternatives
  • multiple metrics to guard business KPIs

All these require protocol adjustments

References

  1. Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688–701. (link)
  2. Goodman, Steve 2008. “A dirty dozen: twelve P-value misconceptions.” Seminars in Hematology, 45 (2008), pp. 135-140. (link)
  3. Kohavi, R., Longbotham, R., Sommerfield, D. et al. 2009 “Controlled experiments on the web: survey and practical guide” Data Min Knowl Disc (2009) 18: 140. (link)
  4. Alex Deng, Tianxi Li, Yu Guo 2014 “Statistical Inference in Two-Stage Online Controlled Experiments with Treatment Selection and Validation” WWW '14. 609–618.(link)
  5. Jiannan Lu, Yixuan Qiu, Alex Deng 2018 "A note on type S/M errors in hypothesis testing"(link)
  6. Alex Deng, Jiannan Lu, Shouyuan Chen 2016 "Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing"(link)