Chapter 1 Introduction

Causality concerns about causal relationship between two things. If there is a causal relationship between two things, one thing is responsible for causing the other thing but the reverse is not true. Causality is fundamental to all natural sciences, and accounts for majority of what we mean by knowledge. In classical physics, every law of physics is a form of deterministic causality. That is, the world at the next moment can be exactly predicted by the world at this moment because the predictive nature of cause and effect. The arrow of time can also been seen as an experience of cause and effect as the causal relationship is not reversible. A shattered vase coming together from pieces without intervention is surely against the law of physics.

In a deterministic world, establishing causality sounds straightforward. Any law of physics should be able to replicate, as many time as we want. In this view causality is from repeated experience. “Knowledge is based on experience,” as an empiricist like David Hume might say. When there exist significant noises or the measurement of interest is intrinsically stochastic, we have to define causality using the probabilistic statements. Consider two random variables \(X\) and \(Y\). Roughly speaking, \(X\) has a causal effect on \(Y\) if changing \(X\) through intervention will change the distribution of \(Y\). Causal inference is about answering whether there is a causal link between two random variables and how to quantitatively identify the changes in the distribution of \(Y\) when we change \(X\) from one state to another.

An clear understanding of changing through intervention in the above verbal definition of causal effect is fundamental. Unfortunately (multivariate) statistics based on (joint) probability distribution only concerns about association and correlation, but not causation. Brand new frameworks are required to describe causality in a rigorous way. Two most influential frameworks emerged in the second half of the last century are potential outcomes framework, also known as the Neyman-Rubin Causal Model (or simply Rubin Causal Model), and Judea pearl’s causal graphical model. Both frameworks focus on valid statistical inference with an intervention taking place.

Regardless of the choices of causal framework, real life causal studies are categorized into two different types: experimental and observational. Experimental study requires investigators of a causal interest to design and conduct an experiment. Therefore controlled intervention is part of the process that generates data. For the purpose of causal inference, randomization is the key controlled intervention employed by experimenters. On the contrary, the data generating process of an observational study either involves no intervention at all, or an intervention that cannot be controlled, such as the effect of a new policy impacting all population.

This Part offers a brief overview of causal inference in the language of statistics, introducing only the most fundamental and useful concepts. We introduce both the Rubin Causal Model and the Causal Graphical Model as the foundation for both experimental and observational studies. The core problem we focus on in this Part is the identification strategy of a causal effect. That is, whether we can unbiasedly estimate a causal effect using observed data, and if we can, how.

There are further inferences beyond the unbiased point estimation of causal effect, such as the topics of hypothesis testing as well as efficiency and accuracy, e.g. variance and confidence interval. Sometimes unbiasedness need to be sacrificed for accuracy as a direct implication of bias-variance trade-off. We only touch this topics but leave discussions in more detail for later chapters. Nevertheless, unbiased estimation and identification strategy is the foundation and it encapsulates concepts such as selection-bias, confounders and inverse propensity score weighting that we deem essential for all data scientist. These concepts also find their values in many other areas beyond causal inference, including machine learning and AI.

Key concepts covered in this part include:

  1. Simpson’s paradox. Correlation/Association does not imply causation.
  2. Potential outcomes and counterfactual.
  3. Randomization, unconfoundedness and naive estimation.
  4. Matching and conditional unconfoundedness.
  5. Covariate balancing and propensity.
  6. Weighted sample and reweighting.
  7. Missing data perspective of potential outcomes.
  8. Causal Graphical Models/Causal Diagrams.
  9. do operator as the meaning of change through intervention.
  10. Back-door criterion, Front-door criterion and more general rules.