Americans and English eat a lot of fat food. There is a high rate of cardiovascular diseases in US and UK. French eat a lot of fat food, but they have a low(er) rate of cardiovascular diseases. Americans and English drink a lot of alcohol. There is a high rate of cardiovascular diseases in US and UK. Italians drink a lot of alcohol but, again, they have a low(er) rate of cardiovascular diseases.
Conclusion? Eat and drink what you want. And you have a higher chance of getting a heart attack if you speak English!
Drinking could help you live longer. According to a study, people who live to 90 or older often drink moderately.
Counterfactual framework (Rubin Causal Model)
The fundamental problem of
causal inference
Missing Data: We cannot expose units to both treatments simultaneously. We don't observe the counterfactual
Noises: We don't directly observe underlying probabilities. Observations are noisy.
Sampling: We don't observe everyone. Only a sample.
What we can measure (Rubin Causal Model)
What randomization gives us (Rubin Causal Model)
Repeating the same experiment (Expectation)
We want to reject the
null hypothesis
The null hypothesis assumes average treatment effect is zero; any difference we observe is simply due to chance.
If we could reasonably rule out chance, we might reject the null and consider this to be evidence for the alternative hypothesis.
We compute a
p-value
We assume the null hypothesis is true and compute the p-value.
Assuming there is no effect, the p-value is the probability of seeing a particular result or more extreme by chance.
How likely is this (or more extreme) result assuming the null is true?Reject null if p-value is below a threshold.
Two types of
errors
Type-I (False Positive) is the incorrect rejection of a true null hypothesis; we cried wolf when there was none
Type-II (False Negative) is the failure to reject a false null hypothesis; we failed to detect a real effect
p-value is the Type-I error assuming the null is true; the threshold we set for p-value controls Type-I error
Repeating the same experiment (No effect)
Repeating the same experiment (Small effect)
The importance of
Statistical power
Statistical power is the probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true. (1 - Type-II error rate)
Two main things affect statistical power:
Sample size (more is better)
Effect size (more is better)
Repeating the same experiment (More power)
Repeating the same experiment (More power from larger effect)
Another Two types of
errors
Type-M (Magnitude) is the expected ratio between the estimated effect to the true effect given rejection of the null
Type-S (Sign) is the probability that our estimate is of different sign than the true effect given rejection of the null
Type-S and Type-M: moderate power
Type-S and Type-M: very low power
Type-S and Type-M: high power
Type-S and Type-M: very high power
The importance of
Statistical power, AGAIN
Type-M error always exist in NHST framework. Estimates is more likely to overestimate the true effect.
Type-M error could still be high for moderate power like 50%. Although still a fair chance to reject the null, the estimated effects are on average exaggerated by 50%. (Winner's Curse)
Type-M error is less severe for power above 80%
Fortunately, Type-S error is very rare
Jiannan Lu, Yixuan Qiu, Alex Deng 2018 "A note on type S/M errors in hypothesis testing"
About Peeking and multiple testing
The methods described assume the PROTOCOL of testing one hypothesis with one analysis using one set of data;violations of protocol such as peeking and multiple testing increase the type-I error rate
Peeking twice
Peeking 100x
Reasons for violating
Protocol
More flexible protocols may be desirable
early stopping rules to mitigate damage
early shipping to minimize opportunity cost
multiple variants to test several alternatives
multiple metrics to guard business KPIs
All these require protocol adjustments
References
Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688–701. (link)
Goodman, Steve 2008. “A dirty dozen: twelve P-value misconceptions.” Seminars in Hematology, 45 (2008), pp. 135-140. (link)
Kohavi, R., Longbotham, R., Sommerfield, D. et al. 2009 “Controlled experiments on the web: survey and practical guide” Data Min Knowl Disc (2009) 18: 140. (link)
Alex Deng, Tianxi Li, Yu Guo 2014 “Statistical Inference in Two-Stage Online Controlled Experiments with Treatment Selection and Validation” WWW '14. 609–618.(link)
Jiannan Lu, Yixuan Qiu, Alex Deng 2018 "A note on type S/M errors in hypothesis testing"(link)
Alex Deng, Jiannan Lu, Shouyuan Chen 2016 "Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing"(link)