class: center, middle, inverse, title-slide .title[ # L6: IP Weighting and G-Computation ] .author[ ### Jean Morrison ] .institute[ ### University of Michigan ] .date[ ### Lecture on 2025-02-10 (updated: 2025-02-19) ] --- `\(\newcommand{\ci}{\perp\!\!\!\perp}\)` ## Lecture Plan 1. Intro 2. Inverse Probability Weighting - IP weighting for binary and continuous treatment - Marginal structural models - Censoring weights 3. G-Computation --- # 1. Introduction --- ## G-Formula - The formula we saw previously as the standardization formula is also called the g-formula $$ E[Y(a)] = \sum_l E[Y \vert A = a, L = l]P[L = l] $$ - We showed in L1 that an alternate expression of the g-formula is $$ E[Y(a)] = E\left[ \frac{I(A = a) Y}{P(A = a \vert L)}\right] $$ --- ## Non-Parametric G-Formula Estimates - In L1-L5, we always assumed that `\(L\)` and `\(A\)` were discrete and data was abundant. - This allows us to estimate the components of these formulas non-parametrically as $$ `\begin{split} &\hat{E}[Y \vert A = a, L = l] = \frac{\sum_{i} 1_{A = a, L = l} Y_i}{\sum_i 1_{A = a, L = l}}\\ &\hat{P}[L = l] = \frac{\sum_i 1_{L = l}}{n} \end{split}` $$ and $$ `\begin{split} \hat{P}[A = a \vert L = l] = \frac{\sum_{i} 1_{A = a, L = l}}{\sum_i 1_{L = l}} \end{split}` $$ - We showed in HW1 that if we plug these estimates into their respective formulas, the standardization and IP weighting formulas give exactly the same estimate. --- ## Models - In most cases, `\(L\)` is multivariate and we don't have enough data to use the non-parametric estimates. - In these cases, we will need to substitute parametric estimates. - What are some ways you know to estimate `\(E[Y \vert A = a, L = l]\)`? - What are some ways you know to estimate `\(P[ A = a \vert L = l]\)`? --- ## Models - Formally, a **model** is a class of probability distributions `\(\mathcal{M} = \lbrace P_\theta : \theta \in \Theta \rbrace\)`. - `\(\theta\)` is a (possibly infinite) vector of parameters and `\(\Theta\)` is the parameter space. - A model is called **parametric** when `\(\theta\)` is finite dimensional and `\(\mathcal{M}\)` is not the set of all possible probability distributions. - If `\(\theta\)` is infinite dimensional, `\(\mathcal{M}\)` may be called **non-parametric** or **semi-parametric**. + Semi-parametric models have a finite, structured component, and an infinite component. - In this lecture we will talk about parametric models. --- ## Parametric G-Formula Estimates - We can use the G-formula to obtain estimates of `\(E[Y(a)]\)` using parametric estimates of the components of the formulas. - If we use parametric estimates, we will generally **not** get numerically identical estimates from the two versions of the formula. + As a result, the G-formula provides us with two estimation methods with different methods. - G-computation (also called standardization): Estimate `\(E[Y \vert A, L]\)` and plug into the first version of the g-formula. - IP weighting: Estimate `\(P(A =a \vert L)\)`, the **propensity score**, and plug into the second version of the g-formula. --- # 2. Inverse Probabilty Weighting --- ## NHEFS Data - Cigarette smokers aged 25-74 were recruited around 1971 and given a survey. - Ten years later, participants were given a followup survey. - We are interested in estimating the effect of quitting smoking on weight change on the additive scale. --- ## Variables - Our exposure is binary (whether or not a person quit smoking between the first and second survey). - The outcome is the change in weight in kg. - What other features would you want to know in order to estimate the causal effect? -- - For now we will assume that the following variables are sufficient: + Sex (0: male, 1: female) + Age (in years) + Race (0: white, 1: other) + Education (5 categories) + Intensity of smoking (cigarettes per day) + Duration of smoking (years) + Physical activity in daily life (3 categories) + Recreational exercise (3 categories) + Weight (kg) --- ## Limitations - For now we will focus on individuals with complete data. + There are some people who did not fill in the second survey. + For these people we know covariates but don't know either the exposure or the outcome. + Additionally, there are 63 people who did fill in the second survey but who's weight is unknown. + More on this later. - Individuals may have quit at different times. + We could think of `\(A\)` as a time-varying treatment (coming up later). --- ## Simple Analysis - The simplest analysis is to simply compare the mean weight change between quitters and non-quitters. <table> <thead> <tr> <th style="text-align:right;"> Quit Smoking </th> <th style="text-align:right;"> Average Weight Change </th> <th style="text-align:right;"> N </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.98 </td> <td style="text-align:right;"> 1163 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4.53 </td> <td style="text-align:right;"> 403 </td> </tr> </tbody> </table> - On average, quitters gained about 5kg over the 10 year period while non-quitters only gained about 2kg. - This is the unadjusted association, `\(E[Y \vert A = 1] - E[Y \vert A = 0]\)`. - We learned previously that this is only an estimate of the causal effect of `\(Y(a) \ci A\)`. - Since we don't think that is true in these data, this is probably not a good estimate of the causal effect. --- ## Quitters v Non-Quitters <table> <thead> <tr> <th style="text-align:left;"> Variable </th> <th style="text-align:left;"> Did Not Quit Smoking </th> <th style="text-align:left;"> Quit Smoking </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:left;"> 1163 </td> <td style="text-align:left;"> 403 </td> </tr> <tr> <td style="text-align:left;"> Age, years </td> <td style="text-align:left;"> 42.8 (11.8) </td> <td style="text-align:left;"> 46.2 (12.2) </td> </tr> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:left;"> 53.4% (621) </td> <td style="text-align:left;"> 45.4% (183) </td> </tr> <tr> <td style="text-align:left;"> Non-White </td> <td style="text-align:left;"> 14.6% (170) </td> <td style="text-align:left;"> 8.9% (36) </td> </tr> <tr> <td style="text-align:left;"> Cigarettes/day </td> <td style="text-align:left;"> 21.2 (11.5) </td> <td style="text-align:left;"> 18.6 (12.4) </td> </tr> <tr> <td style="text-align:left;"> Years smoking </td> <td style="text-align:left;"> 24.1 (11.7) </td> <td style="text-align:left;"> 26 (12.7) </td> </tr> <tr> <td style="text-align:left;"> Weight, kg </td> <td style="text-align:left;"> 70.3 (15.2) </td> <td style="text-align:left;"> 72.4 (15.6) </td> </tr> <tr> <td style="text-align:left;"> College </td> <td style="text-align:left;"> 9.9% (115) </td> <td style="text-align:left;"> 15.4% (62) </td> </tr> <tr> <td style="text-align:left;"> Little exercise </td> <td style="text-align:left;"> 37.9% (441) </td> <td style="text-align:left;"> 40.7% (164) </td> </tr> <tr> <td style="text-align:left;"> Inactive life </td> <td style="text-align:left;"> 8.9% (104) </td> <td style="text-align:left;"> 11.2% (45) </td> </tr> </tbody> </table> --- ## Estimating Weights - In our data, `\(L\)` is a vector of nine measurements. - We cannot compute `\(P[A = 1 \vert L ]\)` among every (or any) stratum of `\(L\)` because every participant has their own unique value of `\(L\)`. - Instead, we will fit a parametric model! --- ## Weights Estimated by Logistic Regression - We will use a logistic regression model to predict `\(P[A \vert L]\)` - The goal is to get the best possible estimate of the probability of treatment given `\(L\)`, so we should fit a flexible model. - Once we fit the model, we can estimate the probability of quitting for each person as `\(\pi_i = P[A = 1 \vert L = L_i]\)`. + These are the probability-scale fitted values from the logistic model. + Recall, `\(\pi_i\)` is the **propensity score**. - The weight for individual `\(i\)` is `\(\frac{1}{\pi_i}\)` if `\(A_i = 1\)` (the person actually did quit) or `\(\frac{1}{1-\pi_i}\)` if `\(A_i = 0\)` (the person did not quit). --- ## Weights Estimated by Logistic Regression ``` r fit <- glm(qsmk ~ sex + race + age + I(age ^ 2) + as.factor(education) + smokeintensity + I(smokeintensity ^ 2) + smokeyrs + I(smokeyrs ^ 2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 ^ 2), family = binomial(), data = dat) dat$w <- ifelse(dat$qsmk == 0, 1/(1 - predict(fit, type = "response")), 1/(predict(fit, type = "response"))) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.054 1.230 1.373 1.996 1.990 16.700 ``` - Why is the mean value close to 2? --- ## Positivity - *Structural violations* of positivity occur when it is impossible for people with some levels of confounders to receive a particular level of treatment. + If there are structural violations, we cannot use IP weighting or standardization. + We need to restrict our inference to strata of `\(L\)` where positivity holds. - *Random violations* of positivity occur when certain combinations of `\(L\)` and `\(A\)` are missing from our data by chance. + Using a parametric propensity model, we can smooth over unobserved covariate values. + Using IP weighting with a parametric model, we are implicitly assuming that all violations of positivity are random and not structural. + We still need to be careful about predicting outside the range of the observed data. --- ## Assessing Positivity in Propensity Scores - If positivity holds, propensity scores should be bounded away from 0 and 1: + Scores very close to 0 or very close to 1 suggest that there are some strata of `\(L\)` that have no chance to receive one of the two treatments. + We might get scores close to 0 or 1 if there is perfect separation by one or a combination of confounders. + Scores close to 0 or 1 suggest there could be structural positivity violations. - Propensity scores should have approximately the same range in both groups. + Non-overlapping ranges indicate that there are some regions of confounder space with only cases or only controls in our study. + If we believe the PS model, we can trust the weights and assume this is due to random positivity violations. + In some approaches based on propensity scores, this will cause a problem (matching and subclassification). --- # Propensity Scores and Positivity <img src="5_modeling1_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Horvitz-Thompson Estimator - The *Horvitz-Thompson estimator* is one of two options for estimating `\(E[Y(a)]\)` using our estimated weights. - The Horvitz-Thompson estimator just plugs our estimated weights into our previous formula. $$ \hat{E}[Y(a)] = \hat{E}\left[\frac{I(A = a)Y}{P(A= a\vert L)} \right] = \frac{1}{n}\sum_{i = 1}^n I(A_i = a)Y_i W_i $$ --- ## Hajek Estimator - The *Hajek estimator* fits the linear model $$ E[Y \vert A] = \beta_0 + \beta_1 A $$ by weighted least squares, weighting individuals by our estimated IP weights. - It is is equivalent to `$$\frac{ \hat{E}\left[\frac{I(A = a)Y}{P(A =a\vert L)} \right]}{ \hat{E}\left[\frac{I(A = a)}{P(A =a\vert L)} \right]} = \frac{\sum_{i = 1}^n I(A_i = a)Y_i W_i}{\sum_{i = 1}^n I(A_i = a) W_i}$$` --- ## Hajek Estimator - THe Hajek estimator is an unbiased estimate of `$$\frac{ E\left[\frac{I(A = a)Y}{P(A =a\vert L)} \right]}{ E\left[\frac{I(A = a)}{P(A = a\vert L)} \right]}$$` - If positivity holds, then $$ `\begin{split} E\left[\frac{I(A = a)}{P(A = a\vert L)} \right] = & E_l \left[ E\left[\frac{I(A = a)}{P(A = a\vert L = l )} \Big\vert L \right] \right]\\ = & E_l \left[ \frac{E[I(A = a) \vert L = l]}{P(A = a\vert L = l)} \right]\\ = & E_l \left[ \frac{P(A = a \vert L = l)}{P(A = a\vert L = l)}\right] = 1 \end{split}` $$ - The Hajek estimator is guaranteed to be between 0 and 1 for dichotomous outcomes. The Horvitz-Thompson estimator is not. --- ## IP Weighted Effect Estimate in NHEFS - To compute the Hajek estimator, we fit a weighted linear regression: ``` r f_w_lm <- lm(wt82_71 ~ qsmk, data = dat, weights = w) f_w_lm ``` ``` ## ## Call: ## lm(formula = wt82_71 ~ qsmk, data = dat, weights = w) ## ## Coefficients: ## (Intercept) qsmk ## 1.780 3.441 ``` - We estimate that if nobody had quit smoking, the average weight gain would have been 1.8 kg. - If everyone had quit smoking, the average weight gain would have been 5.2 kg. - The average causal effect of quitting smoking on weight gain is 3.4 kg. --- ## Estimating the Variance of the Estimate - The raw variance estimate from weighted least squares does not account for uncertainty in the weights. - We need to account for having estimated the IP weights. - Options: 1. Derive the variance analytically 2. Bootstrap the variance 3. Use a robust sandwich variance estimate. - Option 3 is conservative but easier than options 1 and 2. - Option 3 can be achieved by fitting with GEE using and independent working correlation matrix or using the `HC0` option in `vcovHC`. --- ## Variance Estiamte in the NHEFS Data ``` r library("geepack") f_w_gee <- geeglm(wt82_71 ~ qsmk, data = dat,weights = w, id = seqn,corstr = "independence") beta <- coef(f_w_gee) SE <- coef(summary(f_w_gee))[, 2] lci <- beta - qnorm(0.975) * SE uci <- beta + qnorm(0.975) * SE cbind(beta, lci, uci) ``` ``` ## beta lci uci ## (Intercept) 1.779978 1.339514 2.220442 ## qsmk 3.440535 2.410587 4.470484 ``` ``` r library(sandwich) # Equivalent alternative using sandwich v_wlm_robust <- vcovHC(f_w_lm, type = "HC0") all.equal(v_wlm_robust, vcov(f_w_gee)) ``` ``` ## [1] TRUE ``` --- ## Marginal Structural Models - In the IP weighting strategy, we create a population in which `\(A\)` is independent of `\(L\)` and then fit the model `$$E_{ps}[Y \vert A ] = \beta_0 + \beta_1 A$$` - If `\(Y(a) \ci A \vert L\)` in the sample and `\(A \ci L\)` in the pseudo-population, then `\(Y(a) \ci A\)` in the pseudo-population. + This means that `\(E_{ps}[Y \vert A] = E[Y(a)]\)`. - So the parameter `\(\beta_1\)` is interpretable as the causal risk difference. -- - Our model for `\(E_{ps}[Y \vert A]\)` is actually a **marginal structural model (MSM)**: `$$E[Y(a)] = \beta_0 + \beta_1 a$$` - This model is *marginal* because it is not conditional on any covariates. - It is structural because it is a model for a counterfactual `\(E[Y(a)]\)`. --- ## Modeling Continuous Treatments - With a binary exposure, there is only one possible marginal structural model. + `\(E[Y(a)] = \beta_0 + \beta_1 a\)` is a fully saturated model. - For continuous (or highly polytomous) exposures, we need to use a parametric model instead. - In the NHEFS data, let `\(A\)` be the change in smoking intensity ( `\(A = 10\)` indicates that a person increased their smoking by 10 cigarettes). - We can propose a model for our *dose-response curve* $$ E[Y(a)] = \beta_0 + \beta_1 a + \beta_2 a^2 $$ --- ## Weights for Continuous Treatment - To use IP weighting with a continuous treatment, we need to estimate the conditional pdf of `\(A\)`, `\(f(A \vert L)\)`. - Estimating a pdf is a notoriously hard problem. - One solution is to assume that `\(f(A \vert L)\)` is normal with constant variance across participants, but this is a very strong assumption. - We can then estimate the mean of `\(f(A \vert L)\)` via linear regression and the variance of `\(f(A \vert L)\)` as the empirical residual variance from the linear regression. - IP weighted estimates of continuous variables can be very sensitive to these distributional assumptions. --- ## Stabilized Weights - Using weights `\(W^{A} = 1/f(A \vert L)\)`, we create a pseudo-population in which each person is matched by someone exactly like them who received the opposite treatment. - Our pseudo-population is twice as big as our actual sample so the mean of `\(W^A\)` is 2. - In the pseudo-population, the frequency of treatment `\(A = 1\)` is 50% for every stratum of `\(L\)`, so in the pseudo-population, `\(A \ci L\)` . - This also means that, in the pseudo-population `\(Y(a) \ci A\)` (unconditionally), which is why we can use the weighted mean to estimate `\(E[Y(a)]\)`. --- ## Stabilized Weights - We could have created other pseudo-populations in which `\(A\)` is independent of `\(L\)`. - For example, to create a pseudo-population the same size as our observed data with treatment probability of 0.5, we could multiply all of the weights by 0.5. - To create a pseudo-population the same size as our observed data with treatment probability of 0.25, we would multiple the weights for people with `\(A = 1\)` by 0.25 and the weights for people with `\(A = 0\)` by 0.75. - As long as we scale the weights by some quantity that doesn't depend on `\(L\)`, we create a pseudo-population in which `\(A \ci L\)`. --- ## Stabilized Weights - If we look at the formula for the Hajek estimator, we can see that multiplying the weights by a factor that only depends on `\(A_i\)` will not change the estimate. `$$\frac{ \hat{E}\left[\frac{I(A = a)Y}{P(A =a\vert L)} \right]}{ \hat{E}\left[\frac{I(A = a)}{P(A =a\vert L)} \right]} = \frac{\sum_{i = 1}^n I(A_i = a)Y_i W_i}{\sum_{i = 1}^n I(A_i = a) W_i}$$` - This is because the MSM for a binary treatment is saturated. --- ## Stabilized Weights - For a non-saturated MSM, like we constructed for smoking intensity, multiplying `\(W_i\)` by a constant depending on `\(A_i\)` **will** change the estimate, but *not* the expected value of the estimate. - As long as we multiply `\(W_i\)` by something that doesn't depend on `\(L\)`, we will get an unbiased estimate of the structural parameters. - But different choices of multipliers will result in more or less efficient estimates. - It turns out that the most efficient estimate is achieved with **stabilized weights** $$ SW^A = \frac{f(A)}{f(A \vert L)} $$ - These weights correspond to a pseudo-population in which the treatment has the same marginal distribution as in our observed data. --- ## Stabilized Weights - We saw that for binary `\(A\)`, we don't expect using stabilized weights to change the estimate. Let's confirm this using our data. - For our binary treatment example, we can compute `\(f(1)\)` as the proportion of quitters in the data. ``` r p1 <- mean(dat$qsmk); p1 ``` ``` ## [1] 0.2573436 ``` --- ## Stabilized Weights - Next we compute stabilized weights by multiplying our original weights by `\(P(A = A_i)\)`: ``` r dat <- dat %>% mutate(pa = ifelse(dat$qsmk == 0, 1-p1, p1), sw= pa*w) summary(dat$sw) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.3312 0.8665 0.9503 0.9988 1.0793 4.2977 ``` - The expected value of `\(SW^A\)` is 1 because the pseudo-population corresponding to the stabilized weights is the same size as the observed data. --- ## Estimation Using the Stabilized Weights - We can use the stabilized weights in exactly the same way we used the non-stabilized weights. ``` r f_sw_gee <- geeglm(wt82_71 ~ qsmk,data = dat, weights = sw, id = seqn,corstr = "independence") beta <- coef(f_sw_gee) SE <- coef(summary(f_sw_gee))[, 2] lcl <- beta - qnorm(0.975) * SE ucl <- beta + qnorm(0.975) * SE cbind(beta, lcl, ucl) ``` ``` ## beta lcl ucl ## (Intercept) 1.779978 1.339514 2.220442 ## qsmk 3.440535 2.410587 4.470484 ``` - As expected, these are exactly the results we saw with the non-stabilized weights. --- ## Estimating Weights for the Continuous Treatment - If we return to the continuous treatment example, to estimate stabilized weights we need to estimate `\(f(A)\)` in addition to `\(f(A \vert L)\)` - We will assume that `\(f(A)\)` is normal with constnat variance. - We can estimate the mean of `\(f(A)\)` as the overall sample mean of `\(A\)` and the variance as sample variance of `\(A\)`. - Again, we are making some very strong distributional assumptions --- ## Estimating Weights for the Continuous Treatment - To estimate the mean of `\(f(A \vert L)\)`, we fit a linear model for change in smoking intensity including: + sex, race, education, exercise, life activity + age and age squared + baseline smoking intensity and baseline smoking intensity squared + baseline weight and baseline weight squared - We can use the residual variance from this model as an estimate of the variance of `\(f(A \vert L)\)`. - Once we have the mean and variance of `\(f(A \vert L)\)` and `\(f(A)\)`, we can estimate `\(f(A = A_i \vert L = L_i)\)` and `\(f(A = A_i)\)` for individual `\(i\)` using `dnorm`. --- ## Estimating Weights for the Continuous Treatment - We will limit to only those who smoked 25 or fewer cigarettes per day at baseline ( `\(N = 1,162\)` ) ``` r dat.s <- filter(dat, smokeintensity <= 25) ## Fit model for E[A | L] fal_fit<- lm(smkintensity82_71 ~ as.factor(sex) + as.factor(race) + age + I(age ^ 2) + as.factor(education) + smokeintensity + I(smokeintensity ^ 2) + smokeyrs + I(smokeyrs ^ 2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 ^ 2), data = dat.s) fal_mean <- predict(fal_fit, type = "response") ## compute f(A = Ai | L = Li) for each individual fal <- dnorm(dat.s$smkintensity82_71, fal_mean, summary(fal_fit)$sigma) ``` --- ## Estimating Weights for the Continuous Treatment - To estimate `\(f(A)\)` we use the same procedure, except that the mean of `\(E[A]\)` is is just the average change in smoking intensity. ``` r ## Estimate f(A | L) fa_fit <- lm(smkintensity82_71 ~ 1, data = dat.s) fa_mean <- predict(fa_fit, type = "response") fa <- dnorm(dat.s$smkintensity82_71, fa_mean, summary(fa_fit)$sigma) ## compute stabilized and non-stabilized weights dat.s$sw_cont <- fa / fal dat.s$w_cont <- 1/fal ``` --- ## Estimating Weights of the Continuous Treatment - Let's compare the distribution of the stabilized and non-stabilized weights: ``` r summary(dat.s$sw_cont) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.1938 0.8872 0.9710 0.9968 1.0545 5.1023 ``` ``` r summary(dat.s$w_cont) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 25.4 26.4 30.0 1078.1 47.5 428611.4 ``` - The non-stabilized weights are much more variable! --- ## Fitting the Continuous Treatment Model - Next, we then fit the marginal structural model $$ E[Y(a)] = \beta_0 + \beta_1 a + \beta_2 a^2$$ using exactly the same strategy we used before. <table> <thead> <tr> <th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Stabilized Weights</div></th> <th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Non-Stabilized Weights</div></th> </tr> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> 95% CI </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> 95% CI </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Intercept </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> (1.4, 2.6) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 7.7 </td> <td style="text-align:left;"> (1, 14) </td> </tr> <tr> <td style="text-align:left;"> Smk Int </td> <td style="text-align:left;"> -0.11 </td> <td style="text-align:left;"> (-0.17, -0.047) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 0.1 </td> <td style="text-align:left;"> (-0.21, 0.42) </td> </tr> <tr> <td style="text-align:left;"> Smk Int Squared </td> <td style="text-align:left;"> 0.0027 </td> <td style="text-align:left;"> (-0.002, 0.0074) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> -0.01 </td> <td style="text-align:left;"> (-0.024, 0.0039) </td> </tr> </tbody> </table> --- ## Continuous Model Results <img src="5_modeling1_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> - Do you have any complaints about this plot? --- ## Issues with the Continuous Treatment Model - We had to make strong assumptions about the distribution of `\(A \vert L\)`. - There are structural positivity issues. - It is not possible for everyone to reduce their smoking by 80 cigarettes per day. In fact, we limited to ony those who smoke less than 25 cigarettes at baseline. --- ## Effect Modification in Marginal Structural Models - We might be interested in estimating the causal effect within strata of `\(V\)`. - To do this, we can propose a marginal structural model that includes `\(V\)` $$ E[Y(a) \vert V] = \beta_0 + \beta_1 a + \beta_2 a V + \beta_3 V $$ - If `\(V\)` and `\(A\)` are both dichotomous, this model is saturated. - It is not really fully marginal any more, but we still call it a marginal structural model. --- ## Interpreting Parameters in the Effect Modification Model - Suppose that `\(V\)` is sex (0: Male, 1: Female) and `\(A\)` is quitting/not quitting. $$ E[Y(a) \vert V] = \beta_0 + \beta_1 a + \beta_2 a V + \beta_3 V $$ - `\(\beta_1\)` is the causal effect of quitting for men. - `\(\beta_1+ \beta_2\)` is the causal effect of quitting for women. - What is `\(\beta_3\)`? -- - `\(\beta_3\)` is not a causal parameter because we haven't made any exchangeability assumptions about `\(V\)`. - `\(\beta_3\)` is the difference between `\(E[Y(0)]\)` in females and `\(E[Y(0)]\)` in males but is not the causal effect of sex. --- ## Stablilized Weights for the EM Model - We have two choices for stabilized weights $$ \frac{f(A)}{f(A \vert L)} \qquad \text{or} \qquad \frac{f(A\vert V)}{f(A \vert L)} $$ - The second choice is available because we only need our stabilizing factor to not depend on `\(L\)` *within* levels of `\(V\)`. - Which choice will be more efficient? -- - Conditioning on `\(V\)` in the numerator will make the numerator closer to the denominator. - The second choice of weights will be less variable, and therefore more efficient. --- ## Stablilized Weights for the EM Model - We need to compute `\(f(A \vert V)\)` in our data. - We can just use the stratified means. ``` r fav_fit <- lm(qsmk ~ sex, data = dat) dat$pav <- ifelse(dat$qsmk == 0, 1-predict(fav_fit, type = "response"), predict(fav_fit, type = "response")) dat <- mutate(dat, sw_em = pav*w) summary(dat$sw_em) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.2930 0.8751 0.9554 0.9989 1.0803 3.8011 ``` --- ## Fitting the Effect Modification Model ``` r msm_em <- geeglm( wt82_71 ~ qsmk*sex, data = dat, weights = sw_em, id = seqn, corstr = "independence") beta <- coef(msm_em) SE <- coef(summary(msm_em))[, 2] lcl <- beta - qnorm(0.975) * SE ucl <- beta + qnorm(0.975) * SE cbind(beta, lcl, ucl) ``` ``` ## beta lcl ucl ## (Intercept) 1.784446876 1.1771686 2.3917252 ## qsmk 3.521977634 2.2341453 4.8098100 ## sex -0.008724784 -0.8883884 0.8709389 ## qsmk:sex -0.159478525 -2.2097583 1.8908013 ``` - We have no strong evidence of effect modificaiton by sex. --- ## Unweighted Regression vs IP Weighting - What if we had conditioned on every variable in `\(L\)`? - The stabilized weights would be 1, so we would just be running a regular regression. - If the model for `\(E[Y(a) \vert L]\)` is correct, then the regression coefficient is interpretable as a *conditional* causal parameter. - So why not just use the "699 approach" and fit a regression model with all the confounders in it and interpret the coefficients? -- - We might really want the marginal causal effect rather than the conditional causal effect. - It might be easier to specify a good model for `\(P[A = a \vert L ]\)` than for `\(E[Y(a) \vert L]\)`. --- ## Censoring and Missing Data With IP Weights - In our analysis, we have removed 63 individuals who filled in the survey in 1982 but who have a missing value for weight. <table> <thead> <tr> <th style="text-align:right;"> Quit Smoking </th> <th style="text-align:right;"> Number Missing </th> <th style="text-align:right;"> Percent Missing </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 38 </td> <td style="text-align:right;"> 3.2 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> 5.8 </td> </tr> </tbody> </table> - We have more missing data for people who quit smoking than people who did not, so we could have selection bias. --- ## Weights for Censoring - We learned in L3 that, to correct for selection bias, we need to find a set of variables `\(L_c\)` such that `\(Y \ci C \mid A, L_c\)`. - We then need to estimate `\(W^C = 1/P[C = 0 \vert A, L_c]\)`. - We could also stabilize our censoring weights and use `$$SW^C = \frac{P[C = 0 \vert A]}{P[C = 0 \vert A, L_c]}$$` - In this example, we will let `\(L_c\)` be the same set of variables we used to control for confounding, but we didn't have to make this choice. - The total weights will be `\(SW^C \cdot SW^A\)` --- ## Weights for Censoring - We will estimate `\(P(C=0 \vert A, L)\)` using a logistic regression. - The `dat.u` data frame includes data for all participants. ``` r dat.u$cens <- ifelse(is.na(dat.u$wt82), 1, 0) fcal_fit <- glm( cens ~ as.factor(qsmk) + as.factor(sex) + as.factor(race) + age + I(age ^ 2) + as.factor(education) + smokeintensity + I(smokeintensity ^ 2) + smokeyrs + I(smokeyrs ^ 2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 ^ 2), family = binomial(), data = dat.u ) fc0al <- 1 - predict(fcal_fit, type = "response") ``` --- ## Weights for Censoring - We can estimate the numerator of the stabilized weight, `\(P(C = 0 \vert A)\)` using the average within levels of `\(A\)`. ``` r fca_fit <-glm(cens ~ as.factor(qsmk), family = binomial(), data = dat.u) fc0a <- 1 - predict(fca_fit, type = "response") sw.c <- fc0a/fc0al dat$sw.c <- sw.c[!is.na(dat.u$wt82)] dat$sw.comb <- dat$sw*dat$sw.c ``` ``` r summary(dat$sw.c) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.9442 0.9782 0.9858 0.9990 1.0035 1.7180 ``` ``` r summary(dat$sw.comb) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.3390 0.8593 0.9442 0.9974 1.0749 4.1829 ``` --- ## Combined Weights Results - Below are results using the combined weights for both censoring and confounding <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> 95% CI </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Intercept </td> <td style="text-align:left;"> 1.69 </td> <td style="text-align:left;"> (1.23, 2.14) </td> </tr> <tr> <td style="text-align:left;"> Quit Smoking </td> <td style="text-align:left;"> 3.4 </td> <td style="text-align:left;"> (2.36, 4.45) </td> </tr> </tbody> </table> - Compared with the previous results without censoring weights: <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> 95% CI </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Intercept </td> <td style="text-align:left;"> 1.78 </td> <td style="text-align:left;"> (1.34, 2.22) </td> </tr> <tr> <td style="text-align:left;"> Quit Smoking </td> <td style="text-align:left;"> 3.44 </td> <td style="text-align:left;"> (2.41, 4.47) </td> </tr> </tbody> </table> --- ## Survey Weights - Typically, nationally representative surveys won't use a completely random sampling scheme. + They will try to over-sample smaller groups. + This allows them to get better estimates for less common covariate values. - Over-sampling introduces selection bias that needs to be corrected. - If the survey was done well, a specific over-sampling rule was used. + For example, you might select census tracts randomly according to a covariate informed probability and then select houses randomly within census tract. - This allows the survey administers to compute `\(P[ \text{In Survey} \vert L]\)` for every participant. + Being in the survey is the same as *not* being censored, so we can think of this as `\(P[ C = 0 \vert L]\)`. --- ## Survey Weights - Usually, survey administers will compute `\(P[ C = 0 \vert L]\)` and distribute `\(1/P[C = 0 \vert L]\)` as "survey weights". - Our censoring weight procedure requires `\(P[ C = 0 \vert A = a, L]\)`, not `\(P[C = 0 \vert L]\)`. - However, If `\(L\)` includes every factor affecting study inclusion, then `\(C \ci A \vert L\)`, so `\(P[C = 0 \vert L] = P[ C = 0 \vert A = a, L]\)`. - This means we can use the same set of weights regardless of what exposure we are looking at. - If you have survey data that come with survey weights, be sure to read about how these are calculated. + Most likely, if you want to estimate a parameter for the larger population, you will need to use the survey weights. --- # 3. G-Computation (Standardization) --- ## Standardization - In standardization, we compute an estimate of `\(E[Y \vert A = a, L = l]\)` and plug it into the g-formula: $$ E[Y(a)] = \sum_l E[Y \vert A = a, L = l]P[L = l] $$ - If `\(L\)` is continuous, we need to integrate $$ E[Y(a)] = \int_l E[Y \vert A = a, L = l]f(l) dl $$ --- ## Outcome Modeling - In the standardization formula, we need to estimate `\(E[Y \vert A = a, L = l]\)`. - For our smoking example, we will fit a linear regression ``` r f_y <-glm( wt82_71 ~ qsmk + sex + race + age + I(age * age) + as.factor(education) + smokeintensity + I(smokeintensity * smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 * wt71) + qsmk * smokeintensity, data = dat.u) summary(f_y)$coefficients %>% head() ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.588165662 4.313035924 -0.3682245 0.7127562582 ## qsmk 2.559594093 0.809148613 3.1633177 0.0015901758 ## sex -1.430271659 0.468957553 -3.0498958 0.0023281370 ## race 0.560109604 0.581888778 0.9625716 0.3359131882 ## age 0.359635262 0.163318764 2.2020450 0.0278094360 ## I(age * age) -0.006100955 0.001726136 -3.5344591 0.0004206537 ``` --- ## Outcome Modeling - This first step of fitting the outcome model is just fitting a regular linear regression. + In a 699-type class, this is probably where you would stop. - We saw in our IP weighting discussion that, if we believe the model, the coefficient for treatment gives a *conditional* causal effect estimate within strata of `\(L\)`. - If there were no interactions between `\(A\)` and `\(L\)` in the model, the coefficient on `\(A\)` would also give the marginal treatment effect. + This is only true for a linear model. It would not be true in a model with a non-linear link. - An outcome model with no interactions assumes **homogeneity**, i.e. the average treatment effect is the same in every level of `\(L\)`. + Typically, homogeneity is considered to be a very strong assumption. --- ## Outcome Modeling - Our outcome model included an interaction between `\(A\)` and smoking intensity. - In this model, the coefficient on `\(A\)` is the average treatment effect within strata of smoking intensity. - To estimate the average (marginal) treatment effect, we need to average over the distribution of smoking intensity, or more generally the distribution of `\(L\)`. --- ## Standardizing (G-Computation) - Fortunately, we do not need to compute `\(f(L)\)`. - Using the definition of expectation, we can write $$ \int_l E[Y \vert A = a, L = l]f(l)dl = E\left[ E\left[Y \vert A = a, L = l \right]\right] $$ - So we can estimate the standardized mean as $$ \frac{1}{n}\sum_{i=1}^n\hat{E}[Y \vert A = a, L = L_i] $$ --- ## Standardizing - For each person, we need to compute two values `\(\hat{E}[Y \vert A = 0, L_i]\)` and `\(\hat{E}[Y\vert A = 1, L_i]\)` - These are predicted values that we can get out of our previous regression. - We predict each person's outcome with `\(A = 0\)` and with `\(A = 1\)`. ``` r # Make two copies of the data dat0 <- dat1 <- dat.u # In one copy everyone got treatment 0 dat0$qsmk <- 0 # In the second copy everyone got treatment 1 dat1$qsmk <- 1 Y0 <- predict(f_y, newdata = dat0, type = "response") %>% mean() Y1 <- predict(f_y, newdata = dat1, type = "response") %>% mean() cat(round(Y1, digits = 2), "-", round(Y0, digits = 2), "=", round(Y1-Y0, digits = 2)) ``` ``` ## 5.18 - 1.66 = 3.52 ``` --- ## Censoring - If the set of variables in our model for `\(E[Y \vert A, L]\)` are sufficient to adjust for censoring (meaning that they block all paths between `\(Y\)` and `\(C\)`), then we don't need to do anything extra to adjust for censoring. - However, we do need to predict `\(Y(a)\)` for *all individuals* with covariate data, not just those with non-missing outcomes. - If `\(L\)` is not sufficient to adjust for censoring, we may need to add additional variables specifically to deal with censoring. --- ## Estimating the Variance - To compute the variance of the estimate, we need to bootstrap. - We will repeatedly re-sample our data with replacement. - Each time, we re-fit the outcome model and recompute the standardized effect. - The standard deviation of our bootstrap samples consistently estimates the standard error of our estimate. --- ## Bootstrapping the Variance - Step 1: Write a function to compute the standardized mean ``` r # Step 1: Write a function compute the standardized mean std_mean_func <- function(d){ boot_fit <-glm( wt82_71 ~ qsmk + sex + race + age + I(age * age) + as.factor(education) + smokeintensity + I(smokeintensity * smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 * wt71) + qsmk * smokeintensity, data = d) d0 <- d1 <- d d0$qsmk <- 0 d1$qsmk <- 1 Y0 <- predict(boot_fit, newdata = d0, type = "response") %>% mean() Y1 <- predict(boot_fit, newdata = d1, type = "response") %>% mean() return(Y1-Y0) } ``` --- ## Bootstrapping the Variance - Step 2: Repeatedly re-sample the data and compute the standardized mean in the re-sampled data set. ``` r set.seed(1) samples <- replicate(n = 500, expr = sample(seq(nrow(dat.u)), size = nrow(dat.u), replace = TRUE)) res <- apply(samples, 2, FUN = function(ix){ boot_dat <- dat.u[ix,] boot_est <- std_mean_func(boot_dat) return(boot_est) }) se <- sd(res) eff <- Y1-Y0 ci <- eff + c(-1, 1)*qnorm(0.975)*se cat(round(eff, digits = 2), "(", round(ci, digits = 2), ")") ``` ``` ## 3.52 ( 2.58 4.46 ) ``` --- ## Effect Modification in G-Computation - If we want to estimate the average causal effect stratified by `\(V\)`, we can simply stratify by `\(V\)` before averaging our predicted values. - Here we look at effect modification by sex: ``` r dat.u$Y0 <- predict(f_y, newdata = dat0, type = "response") dat.u$Y1 <- predict(f_y, newdata = dat1, type = "response") dat.u %>% group_by(sex) %>% summarize(Y0 = mean(Y0), Y1 = mean(Y1)) ``` ``` ## # A tibble: 2 × 3 ## sex Y0 Y1 ## <dbl> <dbl> <dbl> ## 1 0 1.66 5.30 ## 2 1 1.66 5.06 ``` - We could get bootstrap confidence intervals for these estimates as well. --- ## IP Weighting vs G-Computation - If all of our models are fully saturated, IP weighting and standardization give exactly the same result (recall HW1). - With parametric models, the two methods rely on different assumptions and are not the same. - In the IP weighted analysis, we need to trust our model for `\(f(A \vert L)\)`. - In the standardized analysis, we need to trust our model for `\(E[Y \vert A, L]\)`. - In either case, if our parametric models are wrong, our predictions will be biased and so our estimators will be biased. - It can be a good idea to compute both and compare. + If results are very different, at least one of the two models is wrong. --- ## IP Weighting vs G-Computation - Chatton et al (2020) present simulation results comparing IPW, G-computation, and two methods we haven't talked about yet. - In their simulation scenario, G-computation always had a lower MSE than IPW - For IPW, they calculated variance using robust SEs (as we did earlier) and found slight over coverage (~97% for a 95% target). - They used parametric simulation to compute the variance of the G-computation estimate. This gave close to target coverage but slightly under in some cases (~94%). --- ## Chatton et al Results <img src="5_modeling1_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> --- ## IP Weighting vs G-Computation - Snowden et al (2010) describe G-Computation as rarely used in epidemiology due to lack of awareness. - However, G-Computation can be more precise than IP weighting in some circumstances. + IP weighting can be inefficient when weights are very large. - Vansteelandt and Keiding find that IPW performs better when the covariate distribution is very different in cases and controls. + Depending on the model used fo G-Computation, the dominant group may have too much influence over estimates in these regions. + G-Computation may under-estimate the uncertainty about the counterfactuals in these regions. - Some suggest that we shouldn't be trying to estimate the countefactuals in these regions at all, since this requires extrapolating into regions with low data density.