L6: Propensity Scores

class: center, middle, inverse, title-slide

# L6: Propensity Scores
### Jean Morrison
### University of Michigan
### 2022-01-31 (updated: 2022-04-05)

---

# Outline

Plan:

This Lecture:
- Propensity scores and several methods that make use of them:
  + Subclassification/standardization
  + Double robust methods
  + Matching
  
Coming up:
  - G-Estimation (Ch. 14)
  - Instrumental Variable Analysis (Ch. 16)
  - Survival (Ch. 17)

`$\newcommand{\ci}{\perp\!\!\!\perp}$`

---
# Recap: IP Weighting

- Last lecture we looked at parametric versions of IP weighting and standardization.

- In IP weighting, we propose a parametric estimator of `$\pi(L) = P[ A = 1 \vert L ]$`

- The IP weight for unit `$i$` is `$1/\pi(L_i)$` if `$A_i = 1$` or `$1/(1-\pi(L_i))$` if `$A_i = 0$`.

- We use this weight to make our sample look like randomized sample where treatment is independent of confounders.

- We then use a marginal structural model to estimate a causal parameter.

- Recall a marginal structural model relates a potential outcome to our observed variables
$$
E[Y(a)] = \beta_0 + \beta_1 a
$$

---
# Recap: Outcome Regression

- The alternative to specifying a model for `$\pi(L)$` is to specify a model for the outcome, `$E[Y \vert A=a, L] = b_a(L)$`

- If `$b_a(L)$` is correctly specified, we can estimate `$E[Y(a)]$` by standardization.

- In some cases, the coefficient on `$A$` is directly interpretable as a causal parameter. 
  + This happens when there are no product terms between `$A$` and `$L$`.

---

# Outcome Regression

- Last time, we fit a linear regression that included a product term between the treatment (quitting smoking) and smoking intensity.

$$
E[Y \vert A, L] = \beta_0 + \beta_1 A + \beta_2 A L_1 + \beta_3 L_1 + \dots 
$$

- Using standardization, we can still use this model to obtain the marginal effect.

- Without standardization, we can interpret `$\beta_1 + \beta_2 l$` as the average effect of quitting smoking for individuals with initial smoking intensity `$l$`.

---
# Drawbacks of Outcome Regression

- It is hard to correctly specify the `$E[Y \vert A, L]$`.

- This problem gets harder as the dimension of `$L$` increases.

- The form of effect modification by `$L$` could be very complex.

- We will probably be tempted to fit an overly simple model, which may give biased inference.

---
# Propensity Scores

- `$\pi(L) = P[A = 1 \vert L]$`, the probability of receiving treatment conditional on confounders `$L$` is the *propensity score*.

- Within strata of `$\pi(L)$`, the distribution of `$L$` will be the same in cases and controls. 
  - `$\pi(l)$` is a *balancing score*. 
  
$$
A \ci L \vert\ \pi(L)
$$

- The propensity score is the coarsest possible balancing score: 
  + If `$b(L)$` is a balancing score then `$b(L) = f(\pi(L))$` for some function `$f$`. 
  + The finest possible balancing score is `$A$`.

---
# Propensity Scores are Sufficient for Confounder Adjustment

- If `$Y(a) \ci A \vert L$`, positivity holds, and `$b(L)$` is a balancing score,

- Then `$Y$` and `$A$` are exchangeable conditional on `$b(L)$`, `$Y(a) \ci A \vert\ b(L)$`.
  
- The propensity score is like an intermediate node in the DAG mediating all backdoor paths.

<center>
<div id="htmlwidget-e38709c28d6a84e7573a" style="width:504px;height:180px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-e38709c28d6a84e7573a">{"x":{"diagram":"digraph {\n\ngraph [layout = \"neato\",\n       outputorder = \"edgesfirst\",\n       bgcolor = \"white\"]\n\nnode [fontname = \"Helvetica\",\n      fontsize = \"10\",\n      shape = \"circle\",\n      fixedsize = \"true\",\n      width = \"0.5\",\n      style = \"filled\",\n      fillcolor = \"aliceblue\",\n      color = \"gray70\",\n      fontcolor = \"gray50\"]\n\nedge [fontname = \"Helvetica\",\n     fontsize = \"8\",\n     len = \"1.5\",\n     color = \"gray80\",\n     arrowsize = \"0.5\"]\n\n  \"1\" [label = \"A\", fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", height = \"0.3\", fontcolor = \"black\", color = \"black\", shape = \"ellipse\", fixedsize = \"FALSE\", fillcolor = \"#FFFFFF\", pos = \"0,0!\"] \n  \"2\" [label = \"P[A | L]\", fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", height = \"0.3\", fontcolor = \"black\", color = \"black\", shape = \"ellipse\", fixedsize = \"FALSE\", fillcolor = \"#FFFFFF\", pos = \"-1,0!\"] \n  \"3\" [label = \"Y\", fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", height = \"0.3\", fontcolor = \"black\", color = \"black\", shape = \"ellipse\", fixedsize = \"FALSE\", fillcolor = \"#FFFFFF\", pos = \"1,0!\"] \n  \"4\" [label = \"L\", fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", height = \"0.3\", fontcolor = \"black\", color = \"black\", shape = \"ellipse\", fixedsize = \"FALSE\", fillcolor = \"#FFFFFF\", pos = \"0.5,0.5!\"] \n\"1\"->\"3\" [color = \"black\"] \n\"2\"->\"1\" [color = \"black\"] \n\"4\"->\"2\" [color = \"black\"] \n\"4\"->\"3\" [color = \"black\"] \n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>
</center>

---

# What to Include in a Propensity Score Model

- We have seen that all variables needed to block backdoor paths should be included in the propensity score model.

- We should not include colliders (or children of colliders), even if they improve our predictions of `$A$`.

- What about other variables?

---

# Example

- Should we include `$L_1$` in in the predictive model of `$A$`? 
<div id="htmlwidget-b86700be1d1a41d81e55" style="width:504px;height:180px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-b86700be1d1a41d81e55">{"x":{"diagram":"digraph {\n\ngraph [layout = \"neato\",\n       outputorder = \"edgesfirst\",\n       bgcolor = \"white\"]\n\nnode [fontname = \"Helvetica\",\n      fontsize = \"10\",\n      shape = \"circle\",\n      fixedsize = \"true\",\n      width = \"0.5\",\n      style = \"filled\",\n      fillcolor = \"aliceblue\",\n      color = \"gray70\",\n      fontcolor = \"gray50\"]\n\nedge [fontname = \"Helvetica\",\n     fontsize = \"8\",\n     len = \"1.5\",\n     color = \"gray80\",\n     arrowsize = \"0.5\"]\n\n  \"1\" [label = \"A\", fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", fontcolor = \"black\", color = \"black\", shape = \"circle\", fillcolor = \"#FFFFFF\", pos = \"0,0!\"] \n  \"2\" [label = \"Y\", fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", fontcolor = \"black\", color = \"black\", shape = \"circle\", fillcolor = \"#FFFFFF\", pos = \"1,0!\"] \n  \"3\" [label = <L<FONT POINT-SIZE=\"8\"><SUB>1<\/SUB><\/FONT>>, fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", fontcolor = \"black\", color = \"black\", shape = \"circle\", fillcolor = \"#FFFFFF\", pos = \"-0.5,0.5!\"] \n\"1\"->\"2\" [color = \"black\"] \n\"3\"->\"1\" [color = \"black\"] \n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>
--

- Suppose that `$P[A = 1 \vert L = 0] = 0.01$` and `$P[A = 1 \vert L_1 = 1]= 0.99$` and we condition on `$L_1$` to compute `$\hat{\pi}(L)$`.

- 1% of our sample will have a weight of 100, while the remaining 99% will have a weight close to 1. 
- The weighted estimator will have a much higher variance than the (also valid) unweighted estimator.

---

# Example

- Should we include `$L_2$` in in the predictive model of `$A$`?

<div id="htmlwidget-f2c63bdf2225a78bd737" style="width:504px;height:180px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-f2c63bdf2225a78bd737">{"x":{"diagram":"digraph {\n\ngraph [layout = \"neato\",\n       outputorder = \"edgesfirst\",\n       bgcolor = \"white\"]\n\nnode [fontname = \"Helvetica\",\n      fontsize = \"10\",\n      shape = \"circle\",\n      fixedsize = \"true\",\n      width = \"0.5\",\n      style = \"filled\",\n      fillcolor = \"aliceblue\",\n      color = \"gray70\",\n      fontcolor = \"gray50\"]\n\nedge [fontname = \"Helvetica\",\n     fontsize = \"8\",\n     len = \"1.5\",\n     color = \"gray80\",\n     arrowsize = \"0.5\"]\n\n  \"1\" [label = \"A\", fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", fontcolor = \"black\", color = \"black\", shape = \"circle\", fillcolor = \"#FFFFFF\", pos = \"0,0!\"] \n  \"2\" [label = \"Y\", fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", fontcolor = \"black\", color = \"black\", shape = \"circle\", fillcolor = \"#FFFFFF\", pos = \"1,0!\"] \n  \"3\" [label = <L<FONT POINT-SIZE=\"8\"><SUB>2<\/SUB><\/FONT>>, fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", fontcolor = \"black\", color = \"black\", shape = \"circle\", fillcolor = \"#FFFFFF\", pos = \"0.5,0.5!\"] \n\"1\"->\"2\" [color = \"black\"] \n\"3\"->\"2\" [color = \"black\"] \n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

---
# Conditionality Prinicipal

- In a randomized experiment, `$Y(a) \ci A$` unconditionally *on average*.

- So the true value of `$\pi(L)$` is constant for all individuals.

- However, in any given realization of the experiment, there may be imbalances in some confounders. 
  + These imbalances can lead to bias even though they are "accidental". 
  
- Adjusting for the estimated value, `$\hat{\pi}(L)$` can eliminate this bias.

- Even if we knew the true value of `$\pi(L)$`, it would be better to use the estimates.

---
# Simulations

- Data are generated from the graph: 
<div id="htmlwidget-2560e5268c201b75fdca" style="width:504px;height:180px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-2560e5268c201b75fdca">{"x":{"diagram":"digraph {\n\ngraph [layout = \"neato\",\n       outputorder = \"edgesfirst\",\n       bgcolor = \"white\"]\n\nnode [fontname = \"Helvetica\",\n      fontsize = \"10\",\n      shape = \"circle\",\n      fixedsize = \"true\",\n      width = \"0.5\",\n      style = \"filled\",\n      fillcolor = \"aliceblue\",\n      color = \"gray70\",\n      fontcolor = \"gray50\"]\n\nedge [fontname = \"Helvetica\",\n     fontsize = \"8\",\n     len = \"1.5\",\n     color = \"gray80\",\n     arrowsize = \"0.5\"]\n\n  \"1\" [label = \"A\", fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", fontcolor = \"black\", color = \"black\", shape = \"circle\", fillcolor = \"#FFFFFF\", pos = \"0,0!\"] \n  \"2\" [label = \"Y\", fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", fontcolor = \"black\", color = \"black\", shape = \"circle\", fillcolor = \"#FFFFFF\", pos = \"1,0!\"] \n  \"3\" [label = <L<FONT POINT-SIZE=\"8\"><SUB>2<\/SUB><\/FONT>>, fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", fontcolor = \"black\", color = \"black\", shape = \"circle\", fillcolor = \"#FFFFFF\", pos = \"0.5,0.5!\"] \n\"3\"->\"2\" [color = \"black\"] \n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

- A randomized trial with `$N/2$` units each in treatment and control groups.

- `$A$`, `$Y$`, and `$L$` are all binary.

- `$P[Y = 1 \vert L ] = 0.05 + 0.5 L$`; `$P[L = 1] = 0.5$`.

- In each simulation, we sample 250 units from this model and compute 
  + `$\hat{\beta}$`, the sample difference in means and 
  + `$\hat{\beta}_S$`, the stratified difference in means estimate.

---
# Simulation Results

- Over 1000 simulations,

- The average bias times `$\sqrt{N}$` is
  + `$\hat{\beta}$`: 0.04, `$\hat{\beta}_S$`: 0.03

- The mean squared error times `$N$` is 
  + `$\hat{\beta}$`: 0.92, `$\hat{\beta}_S$`: 0.64

- The Monte Carlo variance of each estimator times `$\sqrt{N}$` is 
  + `$\hat{\beta}$`: 0.34, `$\hat{\beta}_S$`: 0.22

- In 60% of simulations, the stratified estimator is less biased than the non-stratified estimator.

- The adjusted estimator has lower bias and is more precise.

---
# Propensity Scores and Positivity

- If positivity holds, propensity scores should be bounded away from 0 and 1:
  
  + Everyone should have some chance of having receiving either. 
  + Propensity scores close to 0 and 1 happen when there is perfect separation by one or a combination of confounders. 
  + There may be structural positivity violations.

- Propensity scores should have approximately the same range in both groups.

+ Non-overlapping ranges suggest random positivity violations (could also be structural). 
  + Ok for IP weighting but a problem for matching and subclassification.

---
# Propensity Scores and Positivity

---

# Outcome Regression with Propensity Scores

- Because `$\pi(L)$` is sufficient to control confounding, we could propose a model for `$E[Y(a) \vert L]$` in terms of one-dimensional `$\pi(L)$` rather than high dimensional `$L$`.

- Valid inference will depend on 
  + Correctly estimating `$\pi(L)$`
  + Correctly specifying the functional form of `$E[Y \vert A , \pi(L)]$`
  
  
- A simple model like `$$E[Y \vert A, \pi(L)] = \beta_0 + \beta_1 A + \beta_2 \pi(L)$$` could be insufficient to control for confounding.

---
# Outcome Regression with Propensity Scores

- Because `$\pi(L)$` is one-dimensional, it is easy to propose a flexible model.

- Models which include no effect modification of `$A$` by `$L$` could include: 
  + Fitting a cubic spline for effect of `$\pi(L)$`
  + Dividing `$\pi(L)$` into deciles and fitting an effect for nine binary indicators decile membership. 
  
$$
E[Y \vert A, L] = \beta_0 + \beta_1 A + g(\pi(L), \boldsymbol{\alpha})
$$
---

# Subclassification/Stratification

- Stratification over quantiles of `$\hat{\pi}(L)$` allows us to estimate a flexible model *with* effect modification by `$\pi(L).$`

- Let `$\hat{C}_k = [\hat{c}_{k-1}, \hat{c}_k)$`, `$k = 1, \dots, K$` be non-overlapping intervals dividing the observed range of `$\hat{\pi}(L)$`.

- Within each interval, we estimate

`$$\hat{E}[Y(a) \vert \hat{\pi} \in \hat{C}_i] = E[Y \vert A = a, \hat{\pi} \in \hat{C}_i] \\\
= \frac{1}{n_{1,k}}\sum_{i=1}^{N}A_i Y_i 1_{\hat{\pi} \in \hat{C}_i} - \frac{1}{n_{0,k}}\sum_{i=1}^{N}(1-A_i) Y_i 1_{\hat{\pi} \in \hat{C}_i}$$`

- This is the same as fitting the saturated linear regression

`$$E[Y \vert A, \hat{\pi}(L)] = \beta_0 + \beta_1 A + \sum_{j=2}^{K}\beta_j 1_{\hat{\pi} \in \hat{C}_j} + \sum_{j =2}^{K} \beta_{K-1 + j} A 1_{\hat{\pi} \in \hat{C}_j}$$` 
---

# Subclassification and Standardization

- We can then compute an estimate of `$E[Y(a)]$` using our standardization formula

`$$\hat{\Delta}_S = \sum_{k = 1}^K \frac{n_k}{N} \left\lbrace \frac{1}{n_{1,k}}\sum_{i=1}^{N}A_i Y_i 1_{\hat{\pi} \in \hat{C}_i} - \frac{1}{n_{0,k}}\sum_{i=1}^{N}(1-A_i) Y_i 1_{\hat{\pi} \in \hat{C}_i} \right \rbrace$$`

---
# Subclassification Estimate in NHEFS

```r
delta_s <- function(df){
  # Estimate propensity scores
  fit <- glm(qsmk ~ sex + race + age + I(age ^ 2) +
    as.factor(education) + smokeintensity +
    I(smokeintensity ^ 2) + smokeyrs + I(smokeyrs ^ 2) +
    as.factor(exercise) + as.factor(active) + wt71 + I(wt71 ^ 2),
    family = binomial(), data = df)
  df$ps <-predict(fit, type = "response")
  # Deciles of PS
  df$ps_dec <- cut(df$ps, breaks=c(quantile(df$ps, probs=seq(0,1,0.1))),
                    labels=seq(1:10),
                    include.lowest=TRUE)
  # Fit the model for E[Y | A, pi]
  fit_psdec <- glm(wt82_71 ~ qsmk*as.factor(ps_dec), data = df)
  # Standardized Estimate
  df0 <- df1 <- df; df0$qsmk <- 0; df1$qsmk <- 1
  Y0 <- predict(fit_psdec, newdata = df0, type = "response")
  Y1 <- predict(fit_psdec, newdata = df1, type = "response")
  return(mean(Y1) -mean(Y0))}
```

```
## [1] 3.411467
```
---
# Bootstrap Variance

```r
set.seed(1)
samples <- replicate(n = 10, 
                     expr = sample(seq(nrow(dat)), 
                                   size = nrow(dat), replace = TRUE))
res <- apply(samples, 2, FUN = function(ix){
  boot_dat <- dat[ix,]
  return(delta_s(boot_dat))
})
se <- sd(res)
bhat <- delta_s(dat)
ci <- bhat + c(-1, 1)*qnorm(0.975)*se 
cat(bhat, "(", ci, ")")
```

```
## 3.411467 ( 2.614487 4.208448 )
```
---
# Number and Size of Divisions

- Rosenbaum and Rubin (1983) suggest using sample quintiles ( `$K=5$` ). 
  + This generally leads to residual bias.

- More intervals leads to less bias but increased variance.

- The optimal choice of `$K$` and interval endpoints is an open problem.

- Intervals must be large enough that individuals with both treatments are in every interval.

---
# `$\hat{\Delta}_S$` is a Coarsened Horvitz-Thompson Estimator

- Recall the IP weighted Horvitz-Thompson estimator:

`$$\hat{\Delta}_{HT} = \frac{1}{N}\sum_{i=1}^N \frac{A_i Y_i}{\hat{\pi}_i} - \frac{(1-A_i)Y_i}{1-\hat{\pi}_i}$$`
- Within each interval, estimate a "subclass-specific" propensity score, `$\hat{p}_k = P[A_i = 1 \vert \hat{\pi}_i \in \hat{C}_k] = \frac{n_{1,k}}{n_k}$`.

- For each individual, set `$\hat{\pi}^{C}_i = \sum_k 1_{\hat{\pi}_i \in \hat{C}_k}\hat{p}_{k}.$`

- Plug these propensity scores into the formula for `$\hat{\Delta}_{HT}$` and get `$\hat{\Delta}_S.$`

---

# Asymptotic Results

- For a fixed value of `$K$`, some bias will always remain.

+ As `$N$` increases, `$\hat{\Delta}_S$` is not consistent for the ATE.

- Suppose that we have a rule for picking the number divisions that is a function of the data.

- Is there a rule such that `$\hat{\Delta}_S$` is root-N consistent for the ATE?

- Wang et al (2016) show that

- In order for `$\hat{\Delta}_S$` to be well defined, `$K(N)$` must grow slowly enough that there are always units with both values of `$A$` in every division.
  + `$[ K(N)log(K(N))]/N \rightarrow 0$` as `$N \rightarrow \infty$`

- But consistency requires that the binwidth get smaller as `$N$` increases:
  + `$K(N)/\sqrt{N} \rightarrow \infty$` as `$N \rightarrow \infty$`

---
# Full Subclassification

- Wang et al (2016) propose a strategy for allowing `$K$` to grow with `$N$` called full subclassification.

- Choose the largest number `$K$` such that every `$1/K$` quantile of `$\hat{\pi}$` has at least one case and at least one control.

- In simulations, the full subclassification estimator has smaller bias than `$\hat{\Delta}_{S}$`.

---

# Comparison with IP Weighting

- Both IP weighting and outcome regression with PS require a correct model for `$E[A = 1 \vert L ]$`.

- Both require a correct marginal model for the relationship between `$Y(a)$` and `$a$` (easy if `$A$` is dichotomous).

- Outcome regression with propensity scores additionally requires correctly specifying the relationship between `$Y(a)$` and `$\pi(L)$`.

- Subclassification allows us to use a very flexible model for the `$Y$` - `$\pi$` relationship. 
  - Some bias may remain if intervals are too wide. 
  
---

# Comparison with IP Weighting

- IP Weighting:
  + Asymptotically unbiased
  + High variance if some weights are large
  + Very sensitive to misspecification of model for `$\pi(L)$`
  
  
  
- Subclassification
  + Biased for fixed value of `$K$`
  + Lower variance in some cases 
  + Less sensitive to misspecification of model for `$\pi(L)$`
  
- In subclassification, we only rely on `$\hat{\pi}(L)$` to group units with similar propensities.

---
# Within Stratum Regression Modeling

- Bias in `$\hat{\Delta}_S$` occurs because `$L$` and `$A$` may not be independent within classes.

- One strategy to deal with this is to fit an outcome regression model for `$E[Y \vert A, L]$` *within* each `$\hat{C}_k$`.

- Let `$\hat{Y}_{a,i,k}$` be the estimate of `$E[Y \vert A = a, L = L_i]$` from the model fit within `$\hat{C}_k$`.

`$$\hat{\Delta}_{OR,k} = \frac{1}{n_k}\sum_{i = 1}^{N}1_{\hat{\pi_i} \in \hat{C}} \left(\hat{Y}_{1,i,k} - \hat{Y}_{0, i, k}\right)$$`

`$$\hat{\Delta}_{SR} = \sum_{k=1}^{K} \frac{n_k}{N}\hat{\Delta}_{OR,k}$$`

- If the form of the outcome-regression is correct, this strategy will be consistent.

---
# Double Robust Methods

- Double robust methods combine propensity weighting and outcome regression approaches.

- DR methods are consistent if *either* the propensity score model or the outcome regression model is correctly specified.

- They can still be biased if both models are misspecified.

- Technically, `$\hat{\Delta}_{SR}$` is not doubly robust because it is inconsistent for fixed `$K$` if the outcome model is wrong.

---
# Doubly Robust Estimation

- Suppose we have an outcome regression model.

- For each unit, we can estimate `$\hat{Y}_{a, i} = E[Y \vert A = a, L = L_i]$`.

- We can construct estimators for `$E[Y(1)]$` and `$E[Y(0)]$` as

`$$\hat{\Delta}_{DR,1} = \frac{1}{N}\sum_{i = 1}^N \left \lbrace  \frac{A_i Y_i}{\hat{\pi}(L_i)} - \left(\frac{A_i}{\hat{\pi}(L_i)} - 1 \right)\hat{Y}_{1,i} \right \rbrace$$`

`$$\hat{\Delta}_{DR,0} = \frac{1}{N}\sum_{i = 1}^N \left \lbrace  \frac{(1-A_i) Y_i}{1-\hat{\pi}(L_i)} - \left(\frac{1-A_i}{1-\hat{\pi}(L_i)} - 1 \right)\hat{Y}_{0,i} \right \rbrace$$`
 - Cassel, Särndal, and Wretman (1976); Robins, Rotnitzky, and Zhao (1994)
---

# Doubly Robust Estimation

`$$\hat{\Delta}_{DR,1} = \frac{1}{N}\sum_{i = 1}^N \left \lbrace  \frac{A_i Y_i}{\hat{\pi}(L_i)} - \left(\frac{A_i}{\hat{\pi}(L_i)} - 1 \right)\hat{Y}_{1,i} \right \rbrace \\\
= \frac{1}{N} \sum_{i = 1}^N \left \lbrace \hat{Y}_{1,i} + \frac{A_i}{\hat{\pi}(L_i)}\left(Y_i - \hat{Y}_{1,i} \right) \right \rbrace$$`

- If the model for `$\hat{\pi}$` is correct then `$E\left [\frac{A_i}{\hat{\pi}(L_i)}\right] = 1$`.

- If the model for `$\hat{Y}$` is correct then `$E[A_i(Y_i - \hat{Y}_{1,i})] = 0$`.

- Also called bias corrected OLS (Cassel et al 1976)

---
# Doubly Robust Estimation

- `$\hat{\Delta}_{DR} = \hat{\Delta}_{DR,1} - \hat{\Delta}_{DR,0}$` is a locally semiparametrically efficient estimator.

- If both models are correctly specified, then,

- Asymptotically, `$\hat{\Delta}_{DR}$` has the smallest variance among estimators in the defined by Robins and Ronitzky (1995).

- `$\hat{\Delta}_{DR}$` is an augmented inverse probability weighted estimator (AIPW).

---
# Regression with Propensity Score

- Another method to form a DR estimator is to fit an outcome regression model adding some function(s) of `$\hat{\pi}$` as covariates.

- Suppose that our outcome regression model is `$E[Y \vert A, L] = s(A, L, \beta)$`, where `$\beta$` are parameters to be estimated.

- We then fit the augmented mode `$E[Y \vert A, L] = s(A, L, \beta) + \phi f(A, L)$`. 
- We then define `$\tilde{Y}_{a,i} = s(a, L_i, \hat{\beta}) + \hat{\phi}f(a, L_i)$`

- And estimate
`$$\hat{\Delta}_{DR2} = \frac{1}{N}\sum_i \tilde{Y}_{1,i} - \tilde{Y}_{0,i}$$`

---
# Scharfstein Estimator

- Scharfstein et al (1999) show that including `$\frac{A_i}{\hat{\pi}_i}$` as a covariate is sufficient to achieve double robustness for estimation of `$Y(A = 1)$`.

- If we want to estimate both `$Y(A = 1)$` and `$Y(A = 0)$`, we can add two covariates, `$\frac{A_{i}}{\hat{\pi}_i}$` and `$\frac{1-A_i}{1- \hat{\pi}_i}$`.

---
# Bang and Robins Estimator

- Bang and Robins (2005) show that we do not actually need two separate coavariates.

- The single covariate `$f(A, L) = \frac{A_1}{\hat{\pi_i}} - \frac{1-A_i}{1-\hat{\pi}_i}$` also gives a DR estimator.

- This estimator is more efficient than the Scharfstein et al estimator when only the outcome model is correct.

---
# Bias

- The bias of all doubly robust methods is a function of the product of the bias of `$1/\hat{\pi}(L)$` and the bias of `$\hat{b}(L) = \hat{E}[Y \vert A = a, L]$`.

- The large sample bias of `$\hat{\Delta}_{DR}$` is

`$$E \left[\pi(L)\left(\frac{1}{\pi(L)} - \frac{1}{\pi^*(L)} \right)\left(b(L) - b^*(L) \right) \right]$$`

- `$b^*$` and `$\pi^*$` are the limiting expectations of `$\hat{b}$` (i.e. `$\hat{Y}$`) and `$\hat{pi}$`.

- Doubly robust methods are often lest biased than IP weighting or outcome regression.

- Kang and Schaffer (2007) demonstrate a simulated example where the bias of the DR estimator is larger than the bias of the OLS estimate when both models are misspecified.

---

# Variance 
- Lunceford and Davidian (2004) derive large-sample variances for the IPW estimators we have seen as well as for `$\hat{\Delta}_{DR}$`.

- `$\hat{\Delta}_{DR}$` has smaller variance than the IPW estimators.

- The variance of the IPW estimators is smaller if we estimate `$\hat{\pi}$` than if we use the true `$\hat{\pi}$`.

- Estimating `$\hat{\pi}$` does not impact the variance of `$\hat{\Delta}_{DR}$` asymptotically.

- The sandwich variance estimates for IPW estimators are more stable than plug-in estimates of the large sample variance.

- `$\hat{\Delta}_{S}$`, and `$\hat{\Delta}_{SR}$` both converge to normal distributions but the variance is hard to express and depends on the density of the propensity score.

---
# Censoring

- If `$L$` are sufficient to adjust for both censoring and confounding, then any of the methods we have discussed so far are sufficient to adjust for selection bias.

- Many of them were developed originally in the context of accounting for missing data rather than accounting for confounding.

---
# Kang and Schaffer (2007)

- A missing data setting where our goal is to estimate `$E[Y(C = 0)]$`.

- Data are generated as 
`$$Y_i = \alpha_0 + \sum_{j = 1}^4 \alpha_i Z_{j,i}\qquad \pi_i = P[C_i = 0 ] = expit(\sum_{j = 1}^{4} \beta_j Z_{j,i})$$`
- Rather than observing `$Z_1, \dots, Z_4$`, we observe non-linear transformations

`$$X_{1,i} = e^{Z_{1,i}/2} \qquad X_{2,i} = \frac{Z_{2,i}}{1 + e^{Z_{1,i}}} + 10 \\\
X_{3,i} = (Z_{1,i} Z_{3,i}/25 + 0.6)^3 \qquad X_{4,i} = (Z_2 + Z_4 + 20)^2$$`

- On average, half the data are missing.

- `$E[Y(C = 0)] = 210$`, `$E[Y \vert C = 0] = 220$` and `$E[Y \vert C = 1] = 200$`

---
# Kang and Schaffer (2007)

- Relationships between outcome and observed covariates look about linear.
<center>

</center>
---
# IPW vs Subclassification

</center>

---
# OLS and Bias Corrected OLS

</center>

---
# Scharfstein Estimator and Flexible `$\pi$`-Model

---

# Matching

- Recall our previous discussion of matching for a single variable.

- For binary or discrete `$L$`, for each person with `$A = 1$`, we select a control with matching `$L$` value.

- We leave out any samples that cannot be matched.

- Using this new matched sample, we can compute the average treatment effect among the treated.

- Alternatively, we could match to the untreated population or any other distribution of `$L$`.

- Today, we assume we are matching to the treated population.

- Content in this section draws heavily from Stuart (2010).

---

# Matching when `$L$` is High Dimensional

- For high dimensional `$L$`, we probably cannot find exact matches.

- Instead, we can define a distance measure and match cases with "close" controls.

- We could define our distance using the full vector of `$L$`
  + E.g the Mahalanobis distance `$D_{i,j} = (L_i - L_j)^\top \Sigma^{-1} (L_i - L_j)$`
  + `$\Sigma$` is the sample covariance of `$L$` in controls if matching to the treated.

- Or we could define our distance based on the propensity score: 
  + Absolute difference `$D_{i,j} = \vert \hat{\pi}_i - \hat{\pi}_j \vert$`
  + Difference in logits `$D_{i,j} = \vert logit(\hat{\pi}_i) - logit(\hat{\pi}_j )\vert$`

---

# Nearest Neighbor Matching

---

# Nearest Neighbor Matching

---

# NN Matchig with Replacement

---
# Matching with Replacement

- Matching with replacement generally improves the quality of matches.

- It is good to check that we are not relying on only a small number of controls.

- If we match with replacement, we need to use weights in the analysis.
  + If one control is matched to two cases, it should count as two controls. 
  + Typically, control weights are scaled so that the total weight assigned to all controls is equal to the total number of controls included.

---
# NN Matching with a Caliper

---
# Nearest Neighbor Matching

- Nearest neighbor matching with no caliper can lead to bad matches.

- Adding a caliper leads us to drop some cases.

- King, Lucas, and Nielsen (2017) use the terminology *feasible* and *non-feasible* set.

- The sample average treatment effect in the treated (SATT) is a weighted average of the FSATT and the NFSATT. 
  + If some of the sample is non-feasible we must either assume that the FSATT is the same as the SATT or 
   + Be willing to extrapolate estimated values of `$\hat{Y}(0)$` into regions of confounder space with no observed controls.

---

# 1:k Matching

---
# 1:k Matching

- Increasing the number of matches will decrease the quality of matches.

- But will increase the number of controls we retain in the analysis.

- Another bias-variance trade-off.

---
# Nearest Neighbor Variations

- Optimal matching: Rather than "greedily" choose the match closest to each case one at a time, try to minimize a global distance measure.

- Matching with replacement removes the need for optimal matching.

- Variable ratio matching allows different individuals to have different numbers of matches. (Requires weighting in analysis)

---
# Coarsened Exact Matching

- Divide continuous covariates into bins (for matching only).

- Perform exact matching on binned values:

- Cases and controls within a matching bin are weighted according to their frequency. 
  + If a bin contains two cases and three controls, each control receives weight `$2/3$`.

---

# Coarsened Exact Matching

---
# Subclassification

- The subclassification estimator `$\hat{\Delta}_S$` is a coarsened exact matching estimator using the propensity score.

- The only difference between the  `$\hat{\Delta}_S$` we saw previously and a CEM estimator is the relative weight of cases and controls.

- The `$\hat{\Delta}_S$` estimator weights the estimate within each stratum by the proportion of total individuals in that stratum.

- We could instead weight by the proportion of treated individuals in order to estimate the ATT.

---
# Full Matching

- Rather than pre-selecting the number of classes, use the maximum number that allow us to have at least one case and one control in each class.

- We identify subclasses such that:

- Every subclass contains at least one case and one control.
  + Can be 1:1, 1:many, or many:1

- A measure of global distance is minimized over all possible full matches.

- Full matching can be performed on the PS or a different measure of distance.

---
# Full Matching

---

# Analysis Using the Matched Sample

- Matching is a form of pruning and/or re-weighting the data to achieve balance.

- Once the data are balanced, the analysis method does not need to account for matched pairs.

+ We can forget the pairings and simply treat the weighted data like data from a randomized trial.
  + i.e. Just run the analysis we would have run anyway but with weights.

---
# Analysis Using the Matched Sample

- If we are confident that matching has achieved balance, we could simply estimate the difference in means.

- For additional robustness, we could fit an outcome regression model using the weighted matched sample.
  
+ Fitting an outcome regression model in a matched sample is analogous to the double robustness strategy.

+ If the data still have some imbalance after matching, but the outcome regression is corrrect, inference will be valid.

---
# Outcome Regression after Full Matching

- Recall that in subclassification, we could fit an outcome regression within each subclass (the `$\hat{\Delta}_{SR}$` estimator).

- In full matching, we can't fit a regression within each subclass.

- Instead we fit a model to all data shared coefficients for covariates but subclass specific coefficients for treatment.

`$$E[Y_{i,k} \vert A, L] = \beta_{0,k} + \beta_{1,k}A_{i,k} + \gamma L_{i,k}$$`
- We then average subclass effects to obtain 
`$$\hat{\beta} = \frac{n_k}{N}\sum_{k} \hat{\beta}_{1,k}$$`

---
# Variance Estimation

- Rubin and Thomas (1996) and Rubin and Stuart (2006) argue that we don't need to account for uncertainty in in estimating propensity scores because not doing so will be conservative.

- Another alternative is the bootstrap.

- Abadie and Imbens (2006) provide a large sample estimator of the variance for 1:k matching with replacement.

---
# Implementation

- Unlike some of our other methods, matching is not trivial to implement.

- The R package `Matching` implements NN matching strategies including a genetic algorithm to automatically find a balanced match. 
  + Includes Abadie and Imbens standard errors for treatment effects. 
  + Provides a function to generate balance statistics.

- R package `MatchIt` includes more matching strategies than `Matching` including full matching, coarsened exact matching, and others.

---
# Assessing Matches

- If matching is successful, then all variables in `$L$` should be balanced between cases and controls in the matched population.

- To determine if matching is adequate to control confounding, we need to check that this is true.

- Ideally we could compare the full multivariate distribution of `$L$` between cases and controls, but this is too complicated.

- Instead we look at each variable individually. 
  + Stuart (2010) says we should also look at products and squares. 
  
---
# Assessing Matches

- Rubin (2001) suggests three balance measures:

1. Standardized difference in means of the propensity score.(Should be  < 0.25)
1. Ratio of the variances of the propensity score in treated and control. (Should be between 0.5 and 2)
1. For each covariate, ratio of the variance after regressing out the propensity score. (Should be between 0.5 and 2)

---
# Visually Assessing Balance

- Q-Q plots comparing quantiles in treated to quantiles in untreated.

</center>

---
# Visually Assessing Balance

- Q-Q plots comparing quantiles in treated to quantiles in untreated.

</center>

---
# p-Values are Unsuitable to Assess Balance

- Balance is an in-sample property, it doesn't make sense to test a super-population hypothesis.

- The `$p$`-value may increase after matching simply because the power has decreased due to throwing out samples.

---

# Assessing Match Quality

- Both `MatchIt` and `Matching` have built in functions for reporting various measures of balance.

- If balance is bad, we should try again with a different strategy.

- As long as we don't consult the outcome, this isn't data dredging.

- More balance will always reduce bias.

---
# Balance vs Independence

- We sometimes treat the completely randomized trial as the "gold standard" for causal inference.

+ In a randomized trial, everyone has the same propensity score. 
  + Treatment is independent of counfounders *on average*. 
  
- In any finite randomized trial, some important predictors of `$Y$` might be imbalanced between treatment and control groups.

- This imbalance will lead to confounding.
  + Recall that we saw this before, this is why it is better to estimate the propensity score than to use the true propensity score. 
  
  
---
# Balance vs Independence
  
- There is an even better design than the completely randomized trial -- the *fully blocked experiment*.

- In the fully blocked experiment, participants are put into categories based on their values of confounders.

- Within each block treatment is assigned randomly to half the participants.

- This design guarantees balance for the variables defining blocks, eliminating bias if these variables are sufficient to control for confounding.

- Matching is attempting to emulate the fully blocked design.

- Inverse probability weighting is attempting to emulate the completely randomized design. 
  + Exact matching is equivalent to using a fully saturated model for the propensity score or a fully saturated model for outcome regression.

---
# Balance and Model Dependence

- With perfectly balanced data, the model we fit doesn't matter.

- `$X$` and `$A$` are exactly orthogonal in sample.

- Any outcome model plus standardization will give us the same estimate of the ATE as the difference in means estimator.

<img src="6_propensity_scores_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" />
---
# Balance and Model Dependence

- The more imbalance in the data, the more our results will depend on modeling choices.

- Estimates of the ATE in the data below using standardization are:
  + `$E[Y \vert A, X] = \beta_0 + \beta_1 A + \beta_2 X$`: 0.34
  + `$E[Y \vert A, X] = \beta_0 + \beta_1 A + \beta_2 X + \beta_3 X^2$`: 0.13
  + `$E[Y \vert A, X] = \beta_0 + \beta_1 A + \beta_2 X + \beta_3 X^2 + \beta_4 X A + \beta_5 X^2A$`: 0.22

<img src="6_propensity_scores_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" />
---

# The Propensity Score Paradox

- King and Nielsen (2019) point out that the propensity score is insufficient for optimizing on balance.

- Even exact matching on the propensity score would only leave us with a completely randomized study.

- The propensity score can't get us to the equivalent of a fully blocked design.

---

# The Propensity Score Paradox

- Suppose we perform 1:1 matching on the propensity score and then begin to prune observations.

+ We repeatedly remove the pair of points that are the worst match. 
  
- How will balance change as we have fewer and fewer observations?

- The balance will likely improve after the initial matching compared to the original sample.

- After a certain point, balance will begin to get worse.

---
# Propensity Score Paradox

- Imagine that we already have a completely randomized sample and that `$X$` has finite support.

- Suppose we start randomly removing pairs of cases and controls.

- The data will get sparser and therefore less balanced.

- Abadie and Imbens (2006) show that the if points in `$K$` dimensions are randomly sampled from a region with bounded support, then the distance between any point and its nearest neighbor is order `$n^{-1/k}$`
  
  
<center>

</center>
---
# Propensity Score Paradox

</center>

---
# Bias Caused by Imbalance

- Theory tells us that completely randomized trials are unbiased *on average*.

- So propensity score matched analyses should be unbiased on average if the propensity score model is correct.

- However, King and Nielsen (2019) argue that lack of balance creates model dependency.

- In practice, researchers never fit just one model.

- Even researchers with the best intentions may harbor subconscious biases about the outcome they expect that guide their modeling choices.

---
# Bias Caused by Imbalance

</center>

---

# Summary of Propensity Score Methods

- IP Weighting:
  + Asymptotically unbiased if `$\pi$` model is correct. 
  + Can have high variance if weights are large.

- Subclassification:
  + More robust than IP weighting and with lower variance. 
  + Biased asymptotically.

- Subclassification regression
  + Includes robustness by incorporating outcome regression.
  + Technically not doubly robust, not consistent if `$b$` model misspecified. 
  
---
# Summary of Propensity Score Methods

- Doubly robust methods:
  + Consistent if either `$\pi$` or `$b$` models are correct. 
  + Many assymptotically equivalent DR estimators. 
  + Can have big performance differences in finite samples. 
  
- Matching:
  + May or may not use propensity score. 
  + Seeks to create perfect balance -- emulating a fully matched block design. 
  + Better balance = lower bias