L7: Matching

class: center, middle, inverse, title-slide

.title[
# L7: Matching
]
.author[
### Jean Morrison
]
.institute[
### University of Michigan
]
.date[
### 2023-02-06 (updated: 2023-02-05)
]

---

`$\newcommand{\ci}{\perp\!\!\!\perp}$`

## Lecture Plan

1. Matching

- Content in this lecture is from Stuart (2010) and King and Nielsen (2019).

---

## Matching

- In our discussion of the ATT, we mentioned that matching was one strategy for estimating the ATT.

- Suppose that `$Y(a) \ci A \vert L$` where `$A$`, `$Y$` and `$L$` are binary.

- For each person with `$A = 1$`, we select a control with matching `$L$` value.

- We leave out any samples that cannot be matched.

- The new matched sample has `$2\cdot n_1$` units with half receiving `$A =1$` and the other half receiving `$A = 0$` .

- In this sample we can estimate

`$$\begin{split}
E[Y(1) | A = 1] = &\frac{1}{n_1}\sum_{i = 1}^{2n_1} A_i Y_i\\
E[Y(0) \vert A = 1] =& \frac{1}{n_1}\sum_{i = 1}^{2n_1}(1-A_i)Y_i
\end{split}$$`

---
## Matching to Target the ATU or ATE

- The method we just described is called **exact pair** matching.
  + **Exact** because the covariates match exactly within pairs. 
  + **Pair** because we created one pair for each case.
  
- If we want to estimate the ATU, we can simply switch the reference group and find once case for each control.

- As we have described the procedure so far, if we are targeting the ATT, we need to have `$n_0 \geq n_1$`. 
  + If we are targeting the ATU, we neet `$n_1 \geq n_0$`.
  
- We haven't said what to do if there is no match.

- The method doesn't easily extend to targeting the ATE.

- We will see some variations that handle these situations.

---

## Matching when `$L$` is High Dimensional

- For high dimensional `$L$`, we probably cannot find exact matches.

- Instead, we can define a distance measure and match cases with "close" controls.

- We could define our distance using the full vector of `$L$`
  + E.g the Mahalanobis distance `$D_{i,j} = (L_i - L_j)^\top \Sigma^{-1} (L_i - L_j)$`
  + `$\Sigma$` is the sample covariance of `$L$` in controls if matching to the treated.

- Or we could define our distance based on the propensity score: 
  + Absolute difference `$D_{i,j} = \vert \hat{\pi}_i - \hat{\pi}_j \vert$`
  + Difference in logits `$D_{i,j} = \vert logit(\hat{\pi}_i) - logit(\hat{\pi}_j )\vert$`

---

## Nearest Neighbor Matching

- Without replacement: Match each case to the **un-matched** control that is closest to it. 
  + Will depend on the order matches are assigned.

- With replacement: Match each case to the control that is closest to it. 
  + We may re-use some controls. 
  
---

## Nearest Neighbor Matching

---

## Nearest Neighbor Matching

---

## NN Matching with Replacement

---
## Matching with Replacement

- Matching with replacement generally improves the quality of matches.

- It is good to check that we are not relying on only a small number of controls.

- If we match with replacement, we need to use weights in the analysis.
  + If one control is matched to two cases, it should count as two controls. 
  
---
## Weighted Matching Estimators
  
- When there are matching weights, we will estimate the counterfactuals as weighted averages:

`$$\begin{split}
E[Y(1) | A = 1] = &\frac{\sum_{i = 1}^{n} A_i Y_i W_i}{\sum_{i = 1}^n A_iW_i}\\
E[Y(0) \vert A = 1] =& \frac{\sum_{i = 1}^{n} (1-A_i) Y_i W_i}{\sum_{i = 1}^n (1-A_i)W_i}
\end{split}$$`

- In our nearest neighbor with replacement example, there are 10 treated units who all receive weight 1 so `$\sum_{i = 1}^n A_i W_i = 10$`.

- There are 8 controls in the matched sample. 6 of them are used once and get a weight of 1 and two are used twice and get a weight of 2, so `$\sum_{i = 1}^n (1-A_i)W_i = 10$`.

- Note that the sum here is over the matched sample, not the original sample.

- We didn't have to remember the pairings to compute the estimates.

---
## NN Matching with a Caliper

- Exclude matches that are farther apart than a set distance.

---
## NN Matching with a Caliper

---
## Nearest Neighbor Matching

- Nearest neighbor matching with no caliper can lead to bad matches.

- Adding a caliper leads us to drop some cases.

- King, Lucas, and Nielsen (2017) use the terminology **feasible (F)** and **non-feasible (NF)** set.

- Excluding the NF samples changes the estimand, we are no longer estimating the ATT (or the ATU).

- Greifer and Stuart describe the estimand estimated by the caliper matching esimator as the ATO: The average treatment in the **equipoise** set. 
  + The equipoise set is those who are in feasible regions of covariate space, i.e. those that are shared by both treated and untreated units. 
  
---

## 1:k Matching

- Match each case with `$k$` nearest controls, with or without replacement

---

## 1:k Matching with Replacement

---
## 1:k Matching

- Increasing the number of matches will decrease the quality of matches.

- But will increase the number of controls we retain in the analysis.

- This results in a bias-variance trade-off:
  + Decreased match quality = increased bias
  + Increased number of controls = decreased variance

---
## Nearest Neighbor Variations

- Optimal matching: Rather than "greedily" choose the match closest to each case one at a time, try to minimize a global distance measure.

- Matching with replacement removes the need for optimal matching.

- Variable ratio matching allows different individuals to have different numbers of matches. (Requires weighting in analysis)

---
## Coarsened Exact Matching

- Divide continuous covariates into bins.

- Perform exact matching on binned values:

- Treated and controls in the same bin are weighted according to the frequency of treated/controls in that bin. 
  + If a bin contains two treated and three controls, each control receives weight `$2/3$` and each treated receives weight 1.

---

## Coarsened Exact Matching

---
## Subclassification

- The subclassification estimator `$\hat{\Delta}_S$` is a coarsened exact matching estimator using the propensity score.

- The only difference between the  `$\hat{\Delta}_S$` we saw previously and a CEM estimator is the relative weight of cases and controls.

- The `$\hat{\Delta}_S$` estimator weights the estimate within each stratum by the proportion of total individuals in that stratum.

- Weighting by the proportion of treated individuals gives an estimate of the ATT.

---
## Full Matching

- Rather than pre-selecting the number of classes, use the maximum number that allow us to have at least one case and one control in each class.

- We identify subclasses such that:

- Every subclass contains at least one case and one control.
  + Can be 1:1, 1:many, or many:1

- A measure of global distance is minimized over all possible full matches.

- The full subclassification estimator is like a full matching estimator matching on the PS.

---
## Full Matching

---
## Outcome Regression after Full Matching

- Recall that in subclassification, we could fit an outcome regression within each subclass (the `$\hat{\Delta}_{SR}$` estimator).

- In full matching, we can't fit a regression within each subclass.

- Instead we fit a model to all data shared coefficients for covariates but subclass specific coefficients for treatment.

`$$E[Y_{i,k} \vert A, L] = \beta_{0,k} + \beta_{1,k}A_{i,k} + \gamma L_{i,k}$$`
- We then average subclass effects to obtain 
`$$\hat{\beta} = \frac{n_k}{N}\sum_{k} \hat{\beta}_{1,k}$$`

---
## Variance Estimation

- Rubin and Thomas (1996) and Rubin and Stuart (2006) argue that, if we are matching on PS, we don't need to account for uncertainty in in estimating the PS because not doing so will be conservative.

- The usual robust sandwich variance (`vcovHC`) will not be accurate, but we can use cluster robust sandwich estimation (e.g. using `gee` or `vcovCL`).

- Another alternative is the bootstrap.

- Abadie and Imbens (2006) provide a large sample estimator of the variance for 1:k matching with replacement.

---
## Implementation

- Unlike some of our other methods, matching is not trivial to implement.

- The R package `Matching` implements NN matching strategies including a genetic algorithm to automatically find a balanced match. 
  + Includes Abadie and Imbens standard errors for treatment effects. 
  + Provides a function to generate balance statistics.

- R package `MatchIt` includes more matching strategies than `Matching` including full matching, coarsened exact matching, and others.

---
## Assessing Matches

- If matching is successful, then all variables in `$L$` should be balanced between cases and controls in the matched population.

- To determine if matching is adequate to control confounding, we need to check that this is true.

- Ideally we could compare the full multivariate distribution of `$L$` between cases and controls, but this is too complicated.

- Instead we look at each variable individually. 
  + Stuart (2010) says we should also look at products and squares. 
  
---
## Assessing Matches

- Rubin (2001) suggests three balance measures:

1. Standardized difference in means of the propensity score.(Should be  < 0.25)
1. Ratio of the variances of the propensity score in treated and control. (Should be between 0.5 and 2)
1. For each covariate, ratio of the variance after regressing out the propensity score. (Should be between 0.5 and 2)

---
## Visually Assessing Balance

- Q-Q plots comparing quantiles in treated to quantiles in untreated.

</center>

---
## Visually Assessing Balance

- Q-Q plots comparing quantiles in treated to quantiles in untreated.

</center>

---
## p-Values are Unsuitable to Assess Balance

- Balance is an in-sample property, it doesn't make sense to test a super-population hypothesis.

- The `$p$`-value may increase after matching simply because the power has decreased due to throwing out samples.

---

## Assessing Match Quality

- Both `MatchIt` and `Matching` have built in functions for reporting various measures of balance.

- If balance is bad, we should try again with a different strategy.

- As long as we don't consult the outcome, this isn't data dredging.

- More balance will always reduce bias.

---
## Balance vs Independence

- We sometimes treat the completely randomized trial as the "gold standard" for causal inference.

+ In a randomized trial, everyone has the same propensity score. 
  + Treatment is independent of counfounders *on average*. 
  
- In any finite randomized trial, some important predictors of `$Y$` might be imbalanced between treatment and control groups.

- This imbalance will lead to confounding.
  + Recall that we saw this before, this is why it is better to estimate the propensity score than to use the true propensity score. 
  
  
---
## Balance vs Independence
  
- There is an even better design than the completely randomized trial -- the **fully blocked experiment**.

- In the fully blocked experiment, participants are put into categories based on their values of confounders.

- Within each block treatment is assigned randomly to half the participants.

- This design guarantees balance for the variables defining blocks, eliminating bias if these variables are sufficient to control for confounding.

- Matching is attempting to emulate the fully blocked design.

- Inverse probability weighting is attempting to emulate the completely randomized design. 
  + Exact matching is equivalent to using a fully saturated model for the propensity score or a fully saturated model for outcome regression.

---
## Balance and Model Dependence

- With perfectly balanced data, the model we fit doesn't matter.

- `$X$` and `$A$` are exactly orthogonal in sample.

- **Any** outcome model plus standardization will give us the same estimate of the ATE as the difference in means estimator.

<img src="7_matching_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" />
---
## Balance and Model Dependence

- The more imbalance in the data, the more our results will depend on modeling choices.

- Estimates of the ATE in the data below using standardization are:
  + `$E[Y \vert A, X] = \beta_0 + \beta_1 A + \beta_2 X$`: 0.34
  + `$E[Y \vert A, X] = \beta_0 + \beta_1 A + \beta_2 X + \beta_3 X^2$`: 0.13
  + `$E[Y \vert A, X] = \beta_0 + \beta_1 A + \beta_2 X + \beta_3 X^2 + \beta_4 X A + \beta_5 X^2A$`: 0.22

<img src="7_matching_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />
---
## Assessing Balance in NHEFS

- To keep things simple, we will focus on age, weight in 1971, and the estimated propensity score.

---
## IP Weighted Cumulative Densities

- I calculated the weighted eCDF, weighted by `$W_i = \frac{A_i}{\hat{\pi}_i} + \frac{1-A_i}{1-\hat{\pi}_i}$`

- For the un-weighted CDF, of some variable `$X$`, we compute `$$\hat{P}[ X \leq x] = \frac{1}{n}\sum_{i = 1}^n I(X_i \leq x)$$`

- For the weighted CDF, we use `$$\hat{P}[ X \leq x] = \frac{1}{\sum_{i = 1}^n W_i}\sum_{i = 1}^n I(X_i \leq x)W_i$$`

---
## IP Weighted Cumulative Densities

- Inverse probability weighting is effective at achieving balance for these variables.

---
## NHEFS Matching

```r
library(geepack)
a_form <- formula(qsmk ~ sex + race + age +
                      as.factor(education) + smokeintensity + smokeyrs +
                      as.factor(exercise) + as.factor(active) + wt71 )

nhefs_match <- matchit(formula = a_form,  data = dat, method = "full", distance = "mahalanobis", estimand = "ATE" )
md <- match.data(nhefs_match)
## Compute weighted difference in means 
y1 <- with(md, sum(qsmk*wt82_71*weights)/sum(qsmk*weights)) 
y0 <- with(md, sum((1-qsmk)*wt82_71*weights)/sum((1-qsmk)*weights)) 
round(y1 - y0, digits = 2)
```

```
## [1] 3.29
```

```r
## Cluster robust standard errors
f1 <- geeglm(wt82_71 ~ qsmk, weights = weights, id = subclass, corstr = "independence", data = md)
ci <- round(f1$coefficients[2] + c(-1, 1)*qnorm(0.975)*sqrt(vcov(f1)[2,2]), digits = 2)
cat(round(f1$coefficients[2], digits = 2), " (", ci, ")")
```

```
## 3.29  ( 2.23 4.35 )
```

---
## Balance After Matching

- To me it looks like IP weighting achieved slightly better balance in these three variables than full matching with Mahalanobis distance.

- Use `summary(nhefs_match)` to see a summary of all variables.

- We could try different matching methods until we were satisfied with the balance.

---
## The Propensity Score Paradox

- King and Nielsen (2019) point out that the propensity score is insufficient for optimizing on balance.

- Even exact matching on the propensity score would only leave us with a completely randomized study.

- The propensity score can't get us to the equivalent of a fully blocked design, but matching on the full covariate vector can.

---

## The Propensity Score Paradox

- Suppose we perform 1:1 matching on the propensity score and then begin to prune observations - the equivalent of imposing a caliper on our matches.

- We would think that as the caliper gets smaller, balance would improve.

- This is correct but only up to a point. After we have removed all points that are not in shared covariate space we have arrived at the fully randomized trial.

- If we keep narrowing the caliper, we will remove points about at random. Balance will not improve but the sample will get smaller.

</center>

---
## Bias Caused by Imbalance

- Theory tells us that completely randomized trials are unbiased *on average*.

- So propensity score matched analyses should be unbiased on average if the propensity score model is correct.

- However, King and Nielsen (2019) argue that lack of balance creates model dependency.

- In practice, researchers never fit just one model.

- Even researchers with the best intentions may harbor subconscious biases about the outcome they expect that guide their modeling choices.

---
## Bias Caused by Imbalance

- In this graph King and Nielsen are demonstrating that remaining imbalance results in high model dependence.

- On the right they show the largest coefficient we could get by fitting 512 different models.

- Using Mahalanobis distance match eventually results in a very balanced data set as the caliper shrinks.

- When there is perfect balance, there is no model dependency. PS Matching cannot achieve this.

</center>

---

## Summary of Methods So Far

- IP Weighting:
  + Asymptotically unbiased if `$\pi$` model is correct. 
  + Can have high variance if weights are large. 
  
- Standardization/G-Computation:
  + Asymptotically unbiased if outcome model is correct. 
  + Sometimes more efficient than IP weighting if both methods use correct model.

---

## Summary of Methods So Far

- Subclassification:
  + More robust than IP weighting and with lower variance. 
  + Biased asymptotically.

- Subclassification regression
  + Includes robustness by incorporating outcome regression.
  + Technically not doubly robust, not consistent if `$b$` model misspecified. 
  
---
## Summary of Propensity Score Methods

- Doubly robust methods:
  + Consistent if either `$\pi$` or outcome models are correct. 
  + Many assymptotically equivalent DR estimators. 
  + Can have big performance differences in finite samples. 
  
- Matching:
  + May or may not use propensity score. 
  + Seeks to create perfect balance, emulating a fully matched block design. 
  + Better balance = lower bias