Chapter 5 Causal Graphical Model

Need more work to turn notes into readable chapter. Here is the outline of this section.

Law of physics is a set of formulas that allow us to model the causes and effects in a deterministic world. Structural equation model is the natural analogy that describes the data-generating-process of a set of random variables using a set of equations.
These models are often Markovian, i.e. each variable only affects its decedents and there is no cycles or loopy effect. Equivalently, any Markovian model can be factorized into a set of conditional probability models that corresponds to each structural equation.
Many key information in a structural equation model can be visually encoded into a directed acyclic graph/Bayesian network, including many conditional and unconditional dependence relationships among these random variables. A useful tool called d-separation allows us to easily identify conditional and unconditional independence from the graph.
The effect of changing a variable by intervention is called a do operation. do operation changes the factorized Markovian model into a truncated factorization. Knowing the truncated factorization allows us to study the effect of the do operation. The analysis of do operation is called do calculus. The goal of do calculus is to turn an expression with do operation into a do-free expression involving only observable variables so we can come up with an estimator.
Front-door and Back-door criteria represents two different identification strategies and the morals of these two criteria can be extended to build up identification strategies for more complex graphs.

5.1 Structural Equation Model, Causal Diagram and d-separation

A directed graph \(G\) represents a set of vertices with arrowed connections. Mathematically it is a set of vertices \(V\) and an edge set \(E\) of ordered pairs of vertices where the ordering represents the direction of the arrow. Each vertex or node will correspond to a random variable. If the ordered pair \((X, Y)\in E\) then there is an arrow pointing from \(X\) to \(Y\), and we say \(X\) is a parent of \(Y\) and \(Y\) is a child of \(X\). A directed path between two vertices is a path with arrows all pointing to the same direction. \(Z\) is said to be a descendant of \(X\) if there is a directed path leading from \(X\) to \(Z\). A directed path that starts and ends with the same node is called a cycle. A directed graph is acyclic if it has no cycles. A directed graph \(G\) without cycle is called a directed acyclic graph or DAG.

For this chapter we forbid any cycles (loopy effect) and leave the analysis of feedback loops outside of our scope. From now on we only focus on DAG. A DAG \(G\) with vertices \(V=(V_1,\dots,V_k)\) is often associated with a Markovian model defined in the following.

Definition 5.1 (Markovian Model/Bayesian Belief Network) If \(P\) is distribution on \(V\) with probability function \(f\), we say \(P\) is Markov to \(G\), or that \(G\) represents \(P\), if \[\begin{equation} f(v) = \prod_i f(v_i|\pi_i) \tag{5.1} \end{equation}\] where \(\pi_i\) is are parents of \(V_i\).

Markovian models are also called Bayesian Networks or Belief Networks.

When a distribution \(P\) is Markov to a DAG \(G\) ,the graph encodes part of the conditional independence structures. This means we can read conditional independences in the graph structure and these conditional independences must be true for the distribution. (We will show how to do that soon.) Unfortunately graph has limited expressiveness so it is not true that all conditional independences in a distribution \(P\) can be represented in a graph, since some might depend on the special structure of \(f(v_i|p_i)\). Technically speaking, this makes the graph \(G\) an I-Map of \(P\) (D. Barber 2012), meaning that the conditional independences induced by the graph is a subset of the set of all conditional independences for \(P\). We will shortly see that by adding more edges into a graph we will only reduce the set of conditional independences. So for any distribution \(P\) we would like to find the minimal graph that is an I-Map of \(P\). Such graph is called a minimal I-Map — a graph that contains the most complete conditional independence relationships of \(P\).

The type of conditional independences that can be expressed by the graph is the most important because it only requires the distribution to be factored as in Definition 5.1 without extra assumptions on the specific factors \(f(v_i|\pi_i)\) look like. In this sense the independence relationships represented by the graph is more general and robust.

The complete study of DAG or even the loopy graph is the subject of graphical models. Interested readers with a taste for machine learning can refer to D. Barber (2012) or Murphy (2012) for a more detailed treatment of the subject. Wasserman (2003), chap. 17 also provides a concise introduction. Lauritzen (1996) is more mathematics-heavy.

The factorization property (5.1) is equivalent to the following Directed Local Markov Property (Lauritzen 1996).

Theorem 5.1 (Directed Local Markov Property) A distribution \(P\) is Markov to a DAG \(G\) if and only if for any variable \(V\) \[\begin{equation} V \perp \! \! \! \! \perp nd(V) | \pi_V \tag{5.2} \end{equation}\] where \(\pi_V\) is parents of \(V\) and \(nd(V)\) is all other variables that are not \(V\)’s descendants.

A full proof can be found in (Lauritzen 1996). But (5.2) is intuitively easy to understand. It confirms \(V\) can only influence its descendants and also its dependency to its ancestors are only through its direct parents \(\pi_V\). So given \(\pi_V\), \(V\) is independent of its non-descendants. Note that it is trivial that (5.2) is equivalent to \[ V \perp \! \! \! \! \perp nd(V)\backslash \pi_V | \pi_V \] since \(\pi_V\) is already given. We will give a proof of the only if part using d-separation in the end of this section.

Next we will see how we can read conditional independences from a graph. No mater how complicated the DAG is, there are only 3 core patterns we need to know: The chain, the fork and the collider.

The simplest form is the chain (Figure 5.1), also known as “Mediation.”

Figure 5.1: The chain pattern.

In a chain as in Figure 5.1, all information in A for predicting C has been included in B. We have the following conditional independence: \[ A \perp C | B. \]

The second form is the fork (Figure 5.2). also known as “Mutual Dependence.”

Figure 5.2: The fork pattern.

We have already seen Figure 5.2 earlier in an explanation of Simpson’s paradox where \(A\) is a confounder. In a fork, conditioning on the common dependence/confounder \(A\), \(B\) and \(C\) are conditionally independent. But they are not marginally independent. In other words, \(B \perp C | A\) but \(B \perp C\) is generally not true. This is an example of

Conditional Independence \(\nRightarrow\) Marginal Independence!

The last pattern is collider (Figure 5.3), also known as “Mutual Causation.” In Figure 5.3, \(A\) and \(B\) are marginally independent and \(C\)’s distribution is affected by values of \(A\) and \(B\) together.

Figure 5.3: The collider pattern.

This pattern is the most interesting as it provides a result that is a bit counter intuitive: \(A\) and \(B\) are not independent conditioned on \(C\) although without observing \(C\) they are independent. This is an example of

Marginal Independence \(\nRightarrow\) Conditional Independence!

The interpretation of why \(A\) and \(B\) are conditionally dependent is often referred to as he “explain away” effect. We use an interesting example from Jordan (2004) to make the explanation more palatable.

Your friend appears to be late for a meeting with you. There are two explanations: she was abducted by aliens or you forgot to set your watch ahead one hour for daylight savings time. Figure 5.4 shows this in a collider pattern.

Figure 5.4: The alien abduction example (Jordan 2004).

In a world where Alien abduction is possible although unlikely. Let’s say that the probability of your friend showing up late due to alien abduction is extremely small. However, if you also realized that you have forgotten to set your watch to adjust for daylight saving time, you will be almost certain that this is the reason why your friend showing up “late” and the probability of your friend being abducted by alien is further reduced. The fact that your forgot to set your watch explained away the other potential cause of an alien abduction. On the contrary, if you check your watch and realize that you have already set your watch, and for illustration purpose let’s assume there is no other possible cause of your friend being late except alien abduction, then you are certain that your friend has been abducted by alien, no matter how small the chance it originally was. :

Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth.

— Sherlock Holmes

In this example, conditioned on the fact that your friend showed up late (\(C\)), information of whether you have set your watch (\(A\)) changes your belief of whether your friend has been taken by aliens (\(B\)). This means the conditional distribution of \(B\) given \(A\) and \(C\) is different from the distribution of \(B\) given \(C\) alone — \(A\) and \(B\) is not independent given \(C\)!

These three patterns are all we need to identify conditional and unconditional dependence in a Causal Diagram, as shown in the next Theorem.

Theorem 5.2 (d-separation) A set \(S\) of vertices is said to block a path \(p\) if either of the following two condition is satisfied:

\(p\) contains at least one arrow-emitting node (middle node in a chain or a fork pattern) that is in \(S\).
\(p\) contains at least one collision node (middle node in a collider pattern) that is outside \(S\) and has no descendant in \(S\).

If \(S\) blocks all paths from \(X\) to \(Y\), it is said to “d-separate \(X\) and \(Y\).” Two set \(A\) and \(B\) are said to be d-separated by \(S\) if any pair \(X\) and \(Y\) from the two sets are d-separated by \(S\). If so, then we have the following directed global Markov property: \[\begin{equation} A \perp \! \! \! \! \perp B | S \tag{5.3} \end{equation}\]

\(S\) can be an empty set and the independence becomes unconditional.

The first condition of d-separation basically says information can flow through a chain or a fork and the only way to block it is to condition on the arrow-emitting node. The second condition is somewhat the opposite, it says for a collider pattern information is naturally blocked unless the collision node or any of its descendant is observed and creates the “explained away” effect previously introduced. For make two sets of variable \(X\) and \(Y\) independent we need to condition on a blocking set \(S\) that blocks all paths from \(X\) to \(Y\). d-separation is hence very natural and intuitive once we master the three basic pattern and understand the special case of the collider pattern.

It can be shown (Lauritzen 1996) that the global Markov property (5.3) and the local Markov property (5.2) are all equivalent to the factorization property (5.1). Here we show Theorem 5.2 implies (5.2) to illustrate the usage of d-separation.

Let \(\widetilde{V}\) be \(nd(V) \backslash \pi_V\). Any path connecting \(\widetilde{V}\) and \(V\) must be either through a back-door path (starting with an arrow pointing to \(V\)), or a front-door path (starting with an arrow emitting from \(V\)). All back-door paths are blocked by \(\pi_V\) already. For front-door paths, since \(\widetilde{V}\) has no descendants of \(V\), there is no directed path from \(V\) to \(\widetilde{V}\). Because front-door paths start with arrow emitting from \(X\), any path that is not directed connecting \(V\) to \(\widetilde{V}\) must have a collider vertex which is a descendant of \(V\). This vertex and all its descendants are also descendants of \(V\) hence cannot be part of \(\widetilde{V}\). So all front-door paths are also d-separated. We have shown all paths connecting \(V\) and \(\widetilde{V}\) are d-separated by \(\pi_V\). By Theorem 5.2, this proves \(V \perp \! \! \! \! \perp nd(V)\backslash \pi_V | \pi_V\) which is equivalent to (5.2).

5.2 the do operator

The central difference of the causal graphical model (CGM) and the potential framework is how the concept of change with intervention is modeled. In potential framework, we augment the probability distribution by hypothesizing a counterfactual pair that represents all potential outcomes when different interventions were to be applied. This way we conceptually circumvent the issue of understanding how forcing a change through intervention will affect the joint distribution \(P\) itself. In CGM, because the factorization of a distribution \(P\) encodes how random variables locally affects each other, we can directly analyze the probability distribution generated by the intervention. We call this generated distribution the post-intervention distribution. The intervention operation is called the do operator.

If we know how to handle the do operator and can calculate \(P(Y|do(X=x))\) for all \(x\), then we can answer any causal question about the change of \(X\) to \(Y\). For example, ATE of changing \(X\) from \(X_0\) to \(X_1\) is just \[ E(Y|do(X=x_1)) - E(Y|do(X=x_0)). \] The calculations involving the do operator is called do calculus. The goal of do calculus is to transform an expression with do operator into a do-free expression which only involves observable variables. This is analogous to potential outcomes framework where we need to transform expressions involving counterfactuals into a counterfactual-free expression.

To handle the do operator, we need to understand the post-intervention distribution. The following theorem characterizes the post-intervention distribution \(P(V|do(X=x))\), i.e. the new probability distribution governs \(V\) after we manually fix \(X\) to be \(x\).

Theorem 5.3 (Truncated factorization) For any distribution \(P\) (with probability function \(f\)) Markov to a DAG \(G\), the post-intervention distribution \(P^*\) generated by an intervention \(do(X=x)\) has its probability function \(f^*\) given by the truncated factorization \[\begin{equation} f^*(v_1,\dots,v_k) =\prod_{i|V_i\notin X} f(v_i|\pi_i)|_{X=x}, \tag{5.4} \end{equation}\] where \(f(v_i|\pi_i)|_{X=x}\) is the original conditional probability of \(v_i\) given its parents \(\pi\) with any parents in \(X\) is fixed according to the intervention \(X=x\).

Graphically, the effect of \(do(X=x)\) is to remove all arrow from \(X\)’s parents to \(X\) in \(G\) and then condition on \(X=x\). (5.4) is equivalent to the following \[\begin{equation} P(v_1,\dots,v_k|do(X=x)) = \frac{P(v_1,\dots,v_k)}{P(X=x|\pi_X)}, \tag{5.5} \end{equation}\] where the denominator \(P(X=x|\pi_X)\) is the removed arrow.

With Theorem 5.3, it seems the problem of causal inference in a DAG has been solved because we have derived the post-intervention distribution only using the factors \(f(v_i|\pi_i)\) in the original distribution. If we know these factors, we can perform any do calculus with ease.

Conceptually the answer is yes. But practically Theorem 5.3 alone can rarely be useful as a formula for \(P(Y|do(X=x))\). To use the truncated factorization, we need to observe all random variables in the DAG or at least the sub graph that contains all ancestors and descendants of \(X\) and \(Y\)¹³. But in reality we are always limited by the amount of measurements we can gather and there are almost always nodes in the DAG that are not observed. Many vertices and edges are added to the graph to account for unknown dependency links and to register the existence of certain unknown confounder. It is often true that for the purpose of studying causal relationship for two nodes \(X\) and \(Y\), many details in the graph are irrelevant and can be simplified or ignored. For this reason, a simple identification strategy requiring only a small set of observations is desired.

In the potential outcome framework, we have seen that the key is to condition on a set of covariates \(X\) that meets the conditional unconfoundedness condition (Definition 4.3). In CGM, with a DAG and the understanding of the post-intervention distribution, a similar idea is to find a set \(S\) such that \(P(Y|do(X=x),S=s) = P(Y|X=x, S=s)\). This is called the back-door criterion. A different strategy called the front-door criterion identifies a set \(S\) such that \(P(Y|do(X=x),S=s) = P(Y|do(S=s))\) where \(P(Y|do(S=s))\) can be easily identified.

The back-door and front-door criterion are special cases or more general methods based on rules of do operations. We introduce back-door and front-door criterion in the next two sections. These two criterion have conceptual importance and can be directly applied for causal effect identification in many simple graphs. They also help building up intuitions to facilitate our discussion of more general strategies.

5.3 The Back-door Criterion

We first introduce the Back-door criterion which is closely related to conditional unconfoundedness in the potential outcome framework.

Equation (5.5) shows the only changes a do operation make on \(P\) beyond conditioning \(X\) on a fixed value is to remove all arrows from \(X\)’s parents \(\pi_X\) to \(X\). Denote the post-intervention distribution by \(P(\cdot|do(X=x))\). There are two simple facts we can directly derive from (5.5):

\(P(\pi_X=t|do(X=x)) = P(\pi_X=t)\);
\(P(\cdot|do(X=x),\pi_X = t) = P(\cdot|X=x, \pi_X=t)\), where \(P(\cdot|do(X=x),\pi_X = t)\) is the conditional distribution of the post-intervention distribution given \(\pi_X=t\): \[P(\cdot|do(X=x),\pi_X = t) = \frac{P(\cdot,\pi_X=t|do(X=x))}{P(\pi_X=t|do(X=x))}.\]

The first result trivially implies the do operation has no effect on its parents. The second says if we further condition on \(\pi_X\), then the conditional distribution \(P(\cdot|do(X=x), \pi_X=t)\) is the same as \(P(\cdot|X=x, \pi_X=t)\) because the existence of arrows between \(\pi_X\) and \(X\) has no effect when we are already given the values of \(\pi_X\) and \(X\).

If we are interested in \(P(Y|do(X=x)\), then \[\begin{align*} P(Y|do(X=x)) & = \sum_{t} P(Y|do(X=x),\pi_X = t)\times P(\pi_X=t|do(X=x)). \end{align*}\] Together with \(P(Y|do(X=x),\pi_X=t) = P(Y|X=x,\pi_X=t)\) and \(P(\pi_X=t|do(X=x)) = P(\pi_X=t)\), we have proven the following:

Theorem 5.4 (Adjustment by Parents) \[\begin{equation} P(Y|do(X=x)) = \sum_t P(Y|X=x,\pi_X = t) \times P(\pi_X=t) \end{equation}\]

Theorem 5.4 successfully turns a do expression into a do-free expression by conditioning/adjustment on the set \(\pi_X\). Paths connecting \(X\) to \(Y\) starting with an arrow pointing to \(X\) from one of its parents are called back-door paths. Back-door paths are all blocked by \(\pi_X\) because the path through \(\pi_X\) can be either a chain or a fork and not a collider. Turns out, together with d-separation, this result can be easily generalized from using all parents of \(X\) as the adjustment set into using any sets that blocks all back-door paths without involving any descendants of \(X\). This is the Back-door Criterion.

Theorem 5.5 (Back-door Criterion) A set \(S\) in a DAG \(G\) is admissible or sufficient for adjustment if two conditions hold:

No vertex in \(S\) is a descendant of \(X\)
\(S\) blocks all back-door paths (all paths end with an arrow pointing to \(X\)) from \(X\) to \(Y\).

For any admissible set \(S\), \[\begin{equation} P(Y|do(X=x)) = \sum_s P(Y|X=x, S=s)P(S=s) \tag{5.6} \end{equation}\] provides an identification strategy for \(P(Y|do(X=x))\).

In particular, when \(S\) is the empty set, \(P(Y|do(X=x)) = P(Y|X=x)\).

Proof. By Theorem 5.4 we can always adjust by parents of \(X\). We can always further conditioning on \(S\). This leads to \[\begin{align} & P(Y|do(X=x)) \notag \\ = &\sum_t P(Y|X=x, \pi_X=t)P(\pi_X=t) \notag \\ = &\sum_t\left[\left(\sum_s P(Y|X=x, \pi_X=t, S=s) P(S=s|X=x, \pi_X=t)\right) P(\pi_X=t) \right]. \tag{5.7} \end{align}\]

We make the following two claims (proofs will follow):

\(X\) and \(S\) are d-separated by \(\pi_X\),
\(\pi_X\) and \(Y\) are d-separated by \(S\) and \(X\).

So by Theorem 5.2,

\[ P(S=s|X=x,\pi_X=t) = P(S=s|\pi_X=t), \] \[ P(Y|X=x,\pi_X=t, S=s) = P(Y|X=x, S=s) \]

Plugging the two above into (5.7) and switching the order of two summations leads to \[\begin{align*} & P(Y|do(X=x)) = \sum_t\left[(\sum_s P(Y|X=x, S=s) P(S=s|\pi_X=t)) P(\pi_X=t)\right] \\ = & \sum_s P(Y|X=x, S=s) \sum_t \left[P(S=s|\pi_X=t)P(\pi_X=t)\right] \\ = & \sum_s P(Y|X=x, S=s) P(S=s). \end{align*}\]

The proof is complete if we can prove the two claims above. The first is a direct result of the first Back-door condition that \(S\) does not contain any descendants of \(X\) and the Markov Property (5.2) saying \(X\) is conditionally independent of all its non-descendants given its parents.

For the second claim, if \(Y\) and a vertex \(R\) in \(\pi_X\) are d-connected by a path \(p\). Then the path \[ X \longleftarrow R \leftarrow \overset{p}{---} \rightarrow Y \] must also d-connect \(X\) with \(Y\) and it is a back-door path. This contradicts the second condition of \(S\) blocking all back-door paths and complete the proof.

We use Graph 5.5 to illustrate how to use the back-door criterion. The double arrow dashed line means we don’t assume any details of the path between two vertices other than they are connected. There are two admissible sets \(S\) possible in this case. First choice is to block on \(C\) by noticing all back-door paths must go through \(C\) to reach \(Y\) and the fact that \(C\) has a emitting arrow to \(Y\) so the path through \(C\) to \(Y\) can only be a chain or a fork. Another choice is to block on both \(A\) and \(B\). \(B\) blocks all back-door paths through \(B\) because it is not a collider. \(A\) blocks all back-door paths through \(A\) because \(A\) cannot be a collider for paths like \(X \longleftarrow A \leftarrow --- \rightarrow C \longrightarrow Y\). There are no other possible back-door paths so they are all blocked.

Figure 5.5: A causal diagram illustrating the back-door criterion. The two double arrow dashed line means we don’t assume any details of the path between two vertices other than they are connected. Here all back-door paths from \(X\) to \(Y\) can be blocked by either conditioning on \(C\) or both \(A\) and \(B\).

When using the back-door criterion, beyond the obvious need for observing \(X\) and \(Y\), we only need to further observe \(C\) or \(A\) and \(B\) to identify the causal effect of \(X\) on \(Y\).

Next we prove a simple result which says do operation on a vertex only have effects on its descendants. In other words, the arrows in the DAG dictates the direction of possible causal effect.

Theorem 5.6 (The Arrow of Causal Effect) If there is no directed path from \(X\) to \(Y\). Then \[ P(Y|do(X=x)) = P(Y). \] That is, \(X\) has no causal effect on \(Y\). So for \(X\) to have causal effect on \(Y\), \(X\) must be \(Y\)’s ancestor.

Proof. By the directed local Markov Property (5.2) and the fact that \(Y\in nd(X)\), \[ X \perp \!\!\!\! \perp Y | \pi_X. \] So \(P(Y|X=x, \pi_X=t) = P(Y|\pi_X=t)\) and \[\begin{align*} & P(Y|do(X=x)) = \sum_t P(Y|X=x, \pi_X=t)P(\pi_X=t) \\ = & \sum_t P(Y|\pi_X=t)P(\pi_X=t) = P(Y). \end{align*}\]

It should be clear that the back-door adjustment by conditioning on a set of variables \(S\) such that \[ Y \perp \!\!\!\! \perp X | S \] is very similar to the conditional unconfoundedness assumption in the potential outcome framework which posits \[ (Y(1),Y(0)) \perp \!\!\!\! \perp X | S. \] The requirement of \(S\) not containing any descendant of \(X\) together with Theorem 5.6 guarantees \(X\) cannot have causal effect on any variable in \(S\) so \(S\) are indeed covariates.

5.4 Causal Mechanism and the Front-door Criterion

Back-door criterion does not directly work for the following basic confounder pattern when confounder \(U\) is not observed.

Figure 5.6: Back-door path not blocked when \(U\) is not observed. Dashed vertex and edges indicates unobservable confounding effect.

In fact, if the diagram 5.6 is all we know about the relationship of \(X\) and \(Y\) and we know there exists some confounder \(U\) that is not observable, there is fundamentally no way for us to be able to differentiate the true causal effect from the confounding effect. This pattern, also called the bow pattern (Pearl 1995), is the basis of Fisher’s defense of smoking not necessarily causing lung cancer even though a 1950 study showed a positive association between smoking tobacco and lung cancer. He hypothesized a confounder (of a genetic trait) that is linked to lung cancer but also make one more likely to enjoy smoking. The causal link between tobacco usage and lung cancer was finally recognized by the medical community only after people discovered the tobacco residue, also known as tar, when in lung, coats the cilia causing them to stop working and eventually die and allows toxic particles in tobacco smoke to enter the alveoli directly. These toxic particles including carcinogens such as benzene, acrylamide and acrylonitrile that can be commonly found in tar. In other words, people have discovered the mechanism of how tobacco smoking can cause lung cancer, which is through the forming of toxic tar in the lung. It is even believed that 90 percent of lung cases could be attributed to smoking.¹⁴

The above deduction using mechanism can be abstracted into the following diagram. Here \(X\) take an effect on \(Y\) exclusively through \(M\).

Figure 5.7: \(X\) take an effect on \(Y\) exclusively through \(M\) and all back-door paths from \(M\) to \(Y\) are blocked by \(X\).

The identification strategy can be described in two steps. Recall \(U\) is not observed in Figure 5.7. But there is no back-door path from \(X\) to \(M\) and the only back-door path from \(M\) to \(Y\) can be blocked by \(X\). In other words, both \(P(M|do(X=x))\) and \(P(Y|do(M=m))\) can be identified using back-door criterion. The second step relies on the argument that \(X\) can impact \(Y\) only through \(M\), therefore intuitively \[\begin{equation} P(Y|do(X=x)) = \sum_m P(M=m|do(X=x))P(Y|do(M=m)). \tag{5.8} \end{equation}\] With (5.8), an do-free expression of \(P(Y|do(X=x))\) can be obtained by replacing \(P(M=m|do(X=x))\) and \(P(Y|do(M=m))\) with their corresponding do-free expressions. We call (5.8) do propagation.

Theorem 5.7 (Front-door Criterion) A set of nodes \(M\) is said to satisfy the front-door criterion for an ordered pair \((X,Y)\) if:

\(M\) intercepts all directed paths from \(X\) to \(Y\);
all back-door paths from \(M\) to \(Y\) are blocked by \(X\).
there is no back-door paths from \(X\) to \(M\);

When such a set \(M\) exists, \[ P(Y|do(X=x)) =\sum_m P(M=m|X=x)\left(\sum_x' P(Y|X=x',M=m) P(X=x')\right) \] is a valid identification strategy.

Condition 2 and 3 ensures the two sub-mechanisms \(M \rightarrow Y\) and \(X\rightarrow M\) are identifiable by back-door criterion. It remains to prove (5.8). The following Lemma shows condition 1 and 2 together justify do propagation (5.8).

Lemma 5.1 (do propagation) For a DAG \(G\) there is a directed path from \(X\) to \(M\) and then \(Y\). A sufficient condition for \[ P(Y|do(X=x)) = \sum_m P(M=m|do(X=x))P(Y|do(M=m)) \] is that

\(M\) intercepts all directed paths from \(X\) to \(Y\);
all back-door paths from \(M\) to \(Y\) are blocked by \(X\)

Proof. Because post-intervention distribution is still a probability distribution which satisfies the law of total probability: \[ P(Y|do(X=x)) = \sum_m P(M=m|do(X=x))P(Y|do(X=x),M=m). \] Lemma 5.1 is proven once we show the following: \[ P(Y|do(X=x),M=m) = P(Y|do(X=x),do(M=m)) = P(Y|do(M=m)). \]

The first equality follows from applying the back-door criterion on an empty set to identify the causal effect of \(M\) on \(Y\) under the post-intervention distribution \(P(\cdot|do(X=x)\). In the \(do(X)\) distribution, all back-door arrows of \(X\) is already removed, and because of condition 2, there is no back-door path remains from \(M\) to \(Y\). So without any adjustment, we can replace do operator \(do(M=m)\) by simple observation conditioning.

The second equality is a result of Theorem 5.6 and condition 1. In the post \(do(M=m)\) distribution, \(Y\) is no longer a descendant of \(X\) because \(M\) intercepts all directed path from \(X\) to \(Y\) and these paths are broken with all back-door edges of \(M\) removed.

In the do-calculus above we use the back-door criterion Theorem 5.5 to show the equivalence of the operation \(do(M=m)\) and conditioning on \(M=m\). We further use Theorem 5.6 to remove the do operation \(do(X=x)\). These are two of Pearl’s three rules of do calculus(Pearl 1995) and the other one is related to d-separation. See Section 5.5.

Front-door criterion, and more generally, the idea of identifying causal effect using combinations of do propagation together with a chain of identifiable do expressions, are very inspiring. It allows us to search for identification strategy in very complicated graphs using divide and conquer. If we can find a chain \(M_0=X,M_1,\dots,M_K=Y\) with every \(P(M_k|do(M_{k-1}=m))\) identifiable and do propagation can be recursively applied to transform \(P(Y|do(X=x))\) into only using sub-mechanisms \(P(M_k|do(M_{k-1}=m))\), then \(P(Y|do(X=x))\) can be identified.

Some authors call Condition 1 in Theorem 5.7 the exhaustion condition and the other two conditions isolation condition (Morgan and Winship 2014). Although the two isolation conditions imply identification of the two sub-mechanisms \(P(Y|do(M=m))\) and \(P(M|do(X=x))\), it is not generally true that the the exhaustion condition alone guarantees the do propogation equation (5.8). This is why we need one additional condition beyond the exhaustion condition in Lemma 5.1. Figure 5.8 presents a example with only the exhaustion condition. But there is a back-door path from \(M\) to \(Y\) that is not blocked by \(X\). As a result, it is no longer true that \(P(Y|do(X=x),do(M=m)) = P(Y|do(X=x), M=m)\). On the other hand, \(P(Y|do(X=x),do(M=m)) = P(Y|do(M=m))\). Hence \(P(Y|do(M=m))\neq P(Y|do(X=x), M=m)\).

Figure 5.8: \(M\) satisfies the exhaustion condition but the do propagation (5.8) is not true.

It is worthwhile to note that \(M\) in the front-door criterion can be more than one node. In Figure 5.9, effect of \(X\) on \(M\) and \(N\) both can be identified and all back-door path from \(M\) or \(N\) to \(Y\) are through \(X\). If \(M\) is observed but not \(N\), then the front-door mechanism breaks and we can only learn the partial effect of \(X\) on \(Y\) through \(M\). Partial effect can sometimes be interesting in its own right.

Figure 5.9: A causal diagram illustrating the front-door criterion. Here the causal mechanism is \(X\) affecting \(M\) and \(N\) and then to \(Y\). All back-door path from \(M\) to \(Y\) and \(N\) to \(Y\) are blocked by \(X\).

5.5 General Identification Strategy

A do expression has a form \(P(\cdot|\text{actions and observations})\) where actions are \(do\) operations and observations are just conditioning. Pearl (1995) introduced do calculus with three fundamental rules. These rules defines necessary conditions for three basic transformations on a do expression. Any identification strategy for a causal effect \(P(Y|do(X=x)\) must follow from a series of these three operations together with basic probability equalities that transforms \(P(Y|do(X=x)\) into a do free expression.

The three basic operations are:

Insertion/deletion of observations
Action-observation exchange
Insertion/deletion of actions

We have already seen theorems justifying these operations in previous sections. For example, d-separation 5.2 allows us to delete observation from a conditional probability expression via conditional independence. Back-door criterion 5.5 allows us to replace a do operation with observation. The Arrow of Causal Effect Theorem 5.6 dictates action can be removed for any causal effect on a non-descendant. These three theorems are all special cases of Pearl’s three rules, and interestingly, contains the essence of the three rules. We introduce Pearl’s three rules and explain why they are simple extensions of the three theorems.

Let \(X\), \(Y\), \(Z\) be any disjoint set of vertices in a causal diagram \(G\). Denote by \(G_{\overline{X}}\) the graph obtained by removing all back-door arrows from parents of \(X\) pointing to \(X\), and denote by \(G_{\underline{X}}\) the graph obtained by removing all front-door arrows of \(X\) that are emitting from \(X\). \(G_{\overline{X}\underline{Z}}\) is the graph obtained by removing all back-door arrows of \(X\) and front-door arrows of \(Z\). For any DAG \(G\), \((X\perp \!\!\!\!\perp Y |Z)_G\) is interpreted as \(X\) and \(Y\) are d-separated given \(Z\) in \(G\).

Theorem 5.8 (3 Rules of do calculus) Rule 1 (Insertion/deletion of observations): \[ P(Y|do(X=x),Z=z,W=w) = P(Y|do(X=x),W=w) \quad \text{if} \ (Y\perp \!\!\!\! \perp Z|X,W)_{G_{\overline{X}}} \]

Rule 2 (Action-observation exchange): \[ P(Y|do(X=x),do(Z=z),W=w) = P(Y|do(X=x),Z=z,W=w) \quad \text{if} \ (Y\perp \!\!\!\! \perp Z|X,W)_{G_{\overline{X}\underline{Z}}} \]

Rule 3 (Insertion/deletion of actions): \[ P(Y|do(X=x),do(Z=z),W=w) = P(Y|do(X=x),W=w) \quad \text{if} \ (Y\perp \!\!\!\! \perp Z|X,W)_{G_{\overline{X}\overline{Z(W)}}} \] where \(Z(W)\) is the set of \(Z\) vertices that are not ancestors of \(W\) in \(G_{\overline{X}}\).

Rule 1 is basically d-separation in the post \(do(X=x)\) distribution.

Rule 2 is back-door criterion applied to the post \(do(X=x)\) distribution. To see that, post \(do(X=x)\) distribution is Markov to \(G_{\overline{X}}\) plus observation \(X=x\). Compare to \(G_{\overline{X}}\), \(G_{\overline{X}\underline{Z}}\) further removes front-door edges of \(Z\). \(Y\) and \(Z\) d-separated in \(G_{\overline{X}\underline{Z}}\) by \(X\) and \(W\) means all back-door path from \(Z\) to \(X\) are blocked by \(X\) and \(W\). By the back-door criterion, we can exchange \(do(Z=z)\) with observation \(Z=z\) in the post \(do(X=x)\) distribution with observations of \(X=x\) and \(W=w\).

Rule 3 is generalization of the arrow of causal effect 5.6. Again we start with the post \(do(X=x\) distribution represented by \(G_{\overline{X}}\). First let’s assume \(W\) is empty set. \(Y\) and \(Z\) d-separated in \(G_{\overline{X}\overline{Z}}\) by \(X\) means there is no front-door path from \(Z\) to \(Y\) after conditioning by \(X\). This means \(Y\) cannot be a descendant of \(Z\) after conditioning by \(X\), otherwise there must be a directed path from \(Z\) to \(Y\) that is not blocked by \(X\) which has to be also a front-door path. Theorem 5.6 says there is no additional effect of \(do(Z=z)\) when \(do(X=x)\) is already there. Now if we add additional observations \(W\), will the previous argument still be valid?

Figure 5.10 explains the danger of conditioning on any descendant of \(Z\). In this simple graph of \(Y \rightarrow Z \rightarrow W\), it is clear that \(P(Y|do(Z=z),W=w) = P(Y)\). However, \(P(Y|W=w)\neq P(Y)\). The reason is that observation \(W\) will impact \(Y\) (back-propagation of information), but \(do(Z=z)\) will block this effect.

On the other hand, for any \(W\) that is not descendant of \(Z\), then any path between \(Z\) to \(W\) must have a collider pattern in the middle, or \(W\) is an ancestor of \(Z\) and \(Z\) itself must be a collider. Therefore \(Y\) and \(W\) becomes d-separated and \(P(Y|W=w)=P(Y) = P(Y|do(Z=z),W=w)\).

The discussion above explains why we need \(Z(W)\) in Rule 3. The arrow of causal effect holds when conditioning on \(W\) that are not descendants of \(Z\). It will break if any \(W\) conditioned is a descendant of \(Z\). \(Z(W)\) is a subset of \(Z\) so the difference of \(G_{\overline{X}\overline{Z(W)}}\) and \(G_{\overline{X}\overline{Z}}\) is the former could have certain back-door edges of \(Z\) from ancestors of \(W\) not removed. In other word, \(W\) might activates some back-door paths from \(Z\) to \(Y\) that should otherwise be blocked.

Figure 5.10: Illustration of why in Rule 3 we need to be careful about conditioning on descendants of \(Z\). \(P(Y|do(Z=z),W=w) eq P(Y|W=w)\) in this case even though \(Y\) is not a descendant of \(X\).

Now let’s talk about general identification strategy for \(P(Y|do(X=x))\). If there is no back-door path from \(X\) to \(Y\), we can just apply Rule 2 to change the above into a conditional probability. If not, we can pick any set \(S\) such that \(P(S|do(X=x))\) is identifiable and use law of total probability \[ P(Y|do(X=x)) = \sum_s P(Y|do(X=x),S=s)P(S=s|do(X=x)). \] We have an identification strategy if we can change \(P(Y|do(X=x),S=s)\) into something identifiable. We then have three choices,

Back-door adjustment to exchange action-observation using Rule 2: \(P(Y|do(X=x),S=s) = P(Y|X=x,S=s)\)
Front-door mechanism with Rule 2 and Rule 3 to facilitate the following transition \(P(Y|do(X=x),S=s) = P(Y|do(X=x),do(S=s)) = P(Y|do(S=s)\).
Expand \(P(Y|do(X=x),S=s)\) further using the law of total probability.

In all options, we breaks \(P(Y|do(X=x))\) up into smaller problems of identification which should be easier. We repeat this procedure recursively.

As an example, let’s look at the graph in Figure 5.11. We immediately identify \(P(M|do(X=x))\) identifiable using front-door criterion with \(N\) being the only mechanism. We can let \(M\) to be our \(S\) when using law of total probability. We then focus on \(P(Y|do(X=x),M=m)\). It is also obvious that \(M\) blocks all back-door path from \(X\) to \(Y\), which means \(P(Y|do(X=x),M=m) = P(Y|X=x,M=m)\) by Rule 2. Putting front-door identification of \(P(M|do(X=x))\) and \(P(Y|X=x,M=m)\) together solves the problem.

Figure 5.11: A case where front-door criterion and back-door criterion are combined together.

Note that Figure 5.11 is very similar to the back-door criterion, except that \(M\) blocks all back-door path from \(X\) to \(Y\) but \(M\) is also a descendant of \(X\) itself. Since back-door criterion is closely related to conditional unconfoundedness in potential outcomes framework where adjustment can only be taken on covariates unaffected by the change. Here \(X\) do have effect on \(M\) and yet adjustment by \(M\) is still useful. We see that do calculus can provide identification strategy that the potential outcome framework cannot, with extra assumption of a Markovian model.

In Figure 5.12, both \(P(M|do(X=x)\) and \(P(Y|do(M=m))\) can be identified using front-door criterion. However, because of the back-door path \(M\leftarrow W \rightarrow Y\), we cannot apply do propagation (5.8) to combine the two sub-mechanisms into an identification strategy for \(P(Y|do(X=x))\). Also, because of the unblockable back-door path \(X \leftarrow U \rightarrow Y\), there exists no simple back-door adjustment. Nonetheless, we can first condition on \(W\). \(P(W|do(X=x)) = P(W)\) by rule 3 since \(W\) is ancestor of \(X\). For \(P(Y|do(X=x),W=w)\), we can identify sub-mechanisms \(P(Y|do(M=m),W=w)\) and \(P(M|do(X=x),W=w)\) using rule 2 because \(X\) and \(W\) blocks all back-door paths from \(M\) to \(Y\) and \(W\) blocks all from \(X\) to \(M\). We claim the do propagation is now valid since the only back-door path from \(M\) to \(Y\) not through \(X\) is now blocked by \(W\). To be specific, \(P(Y|do(X=x),M=m,W=w) = P(Y|do(X=x),do(M=m),W=w)\) by rule 2. And then \(P(Y|do(X=x),do(M=m),W=w) = P(Y|do(M=m),W=w\) by rule 3.

Figure 5.12: Front-door criterion applies only with conditioning.

To close our discussion on CGM and identification strategy, an interesting result shows that the key to the identification problem is to make sure there exists no bow pattern from \(X\) to all its local children.

Theorem 5.9 (Tian and Pearl) A sufficient condition for identifying the causal effect \(P(y|do(x))\) is that every path between \(X\) and any of its children traces at least one arrow emanating from a measured variable.

Test for existence of any identification strategy in a CGM is a solved problem in the literature. A necessary and sufficient condition and algorithm for identification is discovered in Shpitser and Pearl (2012) together with computer algorithm to find the strategy when exists.

5.6 RCM vs. CGM

To do: Discuss pros and cons

In fact only ancestors of \(Y\) is needed, see Theorem 5.6.↩︎
https://en.wikipedia.org/wiki/Tar_(tobacco_residue)↩︎