class: center, middle, inverse, title-slide .title[ # L5: IP Weighting and G-Computation ] .author[ ### Jean Morrison ] .institute[ ### University of Michigan ] .date[ ### Lecture on 2024-02-07 (updated: 2024-04-10) ] --- `\(\newcommand{\ci}{\perp\!\!\!\perp}\)` ## Lecture Plan 1. Intro 2. Inverse Probability Weighting - IP weighting for binary and continuous treatment - Marginal structural models - Censoring weights 3. G-Computation --- # 1. Introduction --- ## Modeling - So far we have learned two strategies for estimating `\(E[Y(a)]\)` from data when we need to adjust for confounding or selection bias. - In IP weighting we re-weight our observations by `\(1/P[A = A_i \vert L_i]\)`, the probability that they received the treatment that they got given confounderts `\(L_i\)`. - In standardization, we first stratify the data by `\(L\)`, compute `\(E[Y \vert A = a, L]\)` within each stratum, and then take a weighted sum of these averages. --- ## Models - Formally, a **model** is a class of probability distributions `\(\mathcal{M} = \lbrace P_\theta : \theta \in \Theta \rbrace\)`. - `\(\theta\)` is a (possibly infinite) vector of parameters and `\(\Theta\)` is the parameter space. - A model is called **parametric** when `\(\theta\)` is finite dimensional and `\(\mathcal{M}\)` is not the set of all possible probability distributions. - If `\(\theta\)` is infinite dimensional, `\(\mathcal{M}\)` may be called **non-parametric** or **semi-parametric**. + Semi-parametric models have a finite, structured component, and an infinite component. - In this lecture we will talk about parametric models. --- ## Our Models So Far - So far in examples, we have always used **saturated** models. + Saturated models contain all possible probability distributions for the observed data. - We assume that `\(A\)`, `\(Y\)`, and any confounders have a discrete number of levels and that we have enough data to estimate quantities like `\(E[Y \vert A, L]\)` in each stratum of `\(A\)` and `\(L\)`. - This has allowed us to focus on **identification**, i.e. could we estimate a parameter with infinite data. - However, in the real world, we don't get infinite data and it is impractical to assume we can always estimate `\(E[Y \vert A, L]\)` in every stratum of `\(A\)` and `\(L\)`. - We will need **modeling** to go further. --- ## G-Formula - The formula we saw previously as the "stratification" formula is also called the g-formula $$ E[Y(a)] = \sum_l E[Y \vert A = a, L = l]P[L = l] $$ - We showed in L1 that an alternate expression of the g-formula is $$ E[Y(a)] = E\left[ \frac{I(A = a) Y}{f(A \vert L)}\right] $$ - We can use either of these expressions to obtain parametric estiamtes of `\(E[Y(a)]\)`. - In G-computation (also called standardization) we estimate `\(E[Y \vert A, L]\)` and plug into the first expression. - In IP weighting we estimate `\(f(A \vert L)\)` (the propensity score) and use the second expression. --- # 2. Inverse Probabilty Weighting --- ## NHEFS Data - Cigarette smokers aged 25-74 were recruited around 1971 and given a survey. - Ten years later, participants were given a followup survey. - We are interested in estimating the effect of quitting smoking on weight change on the additive scale. --- ## Variables - Our exposure is binary (whether or not a person quit smoking between the first and second survey). - The outcome is the change in weight in kg. - What other features would you want to know in order to estimate the causal effect? -- - For now we will assume that the following variables are sufficient: + Sex (0: male, 1: female) + Age (in years) + Race (0: white, 1: other) + Education (5 categories) + Intensity of smoking (cigarettes per day) + Duration of smoking (years) + Physical activity in daily life (3 categories) + Recreational exercise (3 categories) + Weight (kg) --- ## Limitations - For now we will focus on individuals with complete data. + There are some people who did not fill in the second survey. + For these people we know covariates but don't know either the exposure or the outcome. + Additionally, there are 63 people who did fill in the second survey but who's weight is unknown. + More on this later. - Individuals may have quit at different times. + We could think of `\(A\)` as a time-varying treatment (coming up later). --- ## Simple Analysis - The simplest analysis is to simply compare the mean weight change between quitters and non-quitters. <table> <thead> <tr> <th style="text-align:right;"> Quit Smoking </th> <th style="text-align:right;"> Average Weight Change </th> <th style="text-align:right;"> N </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.98 </td> <td style="text-align:right;"> 1163 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4.53 </td> <td style="text-align:right;"> 403 </td> </tr> </tbody> </table> - On average, quitters gained about 5kg over the 10 year period while non-quitters only gained about 2kg. --- ## Quitters v Non-Quitters <table> <thead> <tr> <th style="text-align:left;"> Variable </th> <th style="text-align:left;"> Did Not Quit Smoking </th> <th style="text-align:left;"> Quit Smoking </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:left;"> 1163 </td> <td style="text-align:left;"> 403 </td> </tr> <tr> <td style="text-align:left;"> Age, years </td> <td style="text-align:left;"> 42.8 (11.8) </td> <td style="text-align:left;"> 46.2 (12.2) </td> </tr> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:left;"> 53.4% (621) </td> <td style="text-align:left;"> 45.4% (183) </td> </tr> <tr> <td style="text-align:left;"> Non-White </td> <td style="text-align:left;"> 14.6% (170) </td> <td style="text-align:left;"> 8.9% (36) </td> </tr> <tr> <td style="text-align:left;"> Cigarettes/day </td> <td style="text-align:left;"> 21.2 (11.5) </td> <td style="text-align:left;"> 18.6 (12.4) </td> </tr> <tr> <td style="text-align:left;"> Years smoking </td> <td style="text-align:left;"> 24.1 (11.7) </td> <td style="text-align:left;"> 26 (12.7) </td> </tr> <tr> <td style="text-align:left;"> Weight, kg </td> <td style="text-align:left;"> 70.3 (15.2) </td> <td style="text-align:left;"> 72.4 (15.6) </td> </tr> <tr> <td style="text-align:left;"> College </td> <td style="text-align:left;"> 9.9% (115) </td> <td style="text-align:left;"> 15.4% (62) </td> </tr> <tr> <td style="text-align:left;"> Little exercise </td> <td style="text-align:left;"> 37.9% (441) </td> <td style="text-align:left;"> 40.7% (164) </td> </tr> <tr> <td style="text-align:left;"> Inactive life </td> <td style="text-align:left;"> 8.9% (104) </td> <td style="text-align:left;"> 11.2% (45) </td> </tr> </tbody> </table> --- ## Estimating Weights - In our data, `\(L\)` is a vector of nine measurements. - We cannot compute `\(P[A = 1 \vert L ]\)` among every (or any) stratum of `\(L\)` because every participant has their own unique value of `\(L\)`. - We have to fit a parametric model. - What would you do? --- ## Weights Estimated by Logistic Regression - We will use a logistic regression model to predict `\(P[A \vert L]\)` - The goal is to get the best possible estimate of the probability of treatment given `\(L\)`, so we should fit a flexible model. - Once we fit the model, we can estimate the probability of quitting for each person as `\(\pi_i = P[A = 1 \vert L = L_i]\)`. These are the probability-scale fitted values from the logistic model. - The weight for individual `\(i\)` will be equal to `\(\frac{1}{\pi_i}\)` if `\(A_i = 1\)` (the person actually did quit) or `\(\frac{1}{1-\pi_i}\)` if `\(A_i = 0\)` (the person did not quit). --- ## Weights Estimated by Logistic Regression ```r fit <- glm(qsmk ~ sex + race + age + I(age ^ 2) + as.factor(education) + smokeintensity + I(smokeintensity ^ 2) + smokeyrs + I(smokeyrs ^ 2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 ^ 2), family = binomial(), data = dat) dat$w <- ifelse(dat$qsmk == 0, 1/(1 - predict(fit, type = "response")), 1/(predict(fit, type = "response"))) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.054 1.230 1.373 1.996 1.990 16.700 ``` - Why is the mean value close to 2? --- ## Positivity - *Structural violations* of positivity occur when it is impossible for people with some levels of confounders to receive a particular level of treatment. + If we have structural violations, we cannot use IP weighting or standardization. + We need to restrict our inference to relevant strata of `\(L\)`. - *Random violations* of positivity occur when certain combinations of `\(L\)` and `\(A\)` are missing from our data by chance. + We are using a parametric model, so we are able to smooth over unobserved covariate values. + We are able to predict for strata that were not observed in our data. + We should be careful about predicting outside the range of the observed data. --- ## Assessing Positivity in Propensity Scores - If positivity holds, propensity scores should be bounded away from 0 and 1: + Scores very close to 0 or very close to 1 suggest that there are some strata of `\(A\)` and `\(L\)` that have no chance to receive one of the two treatments. + We might get scores close to 0 or 1 if there is perfect separation by one or a combination of confounders. + Scores close to 0 or 1 suggest there could be structural positivity violations. - Propensity scores should have approximately the same range in both groups. + Non-overlapping ranges indicate that there are some regions of confounder space with only cases or only controls in our study. + If we believe the PS model, we can trust the weights and assume this is due to random positivity violations. + In some approaches based on propensity scores, this will cause a problem (matching and subclassification). --- # Propensity Scores and Positivity <img src="5_modeling1_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Horvitz-Thompson Estimator - The *Horvitz-Thompson estimator* is one of two options for estimating `\(E[Y(a)]\)` using our estimated weights. - The Horvitz-Thompson estimator just plugs our estimated weights into our previous formula. $$ \hat{E}[Y(a)] = \hat{E}\left[\frac{I(A = a)Y}{f(A\vert L)} \right] = \frac{1}{n}\sum_{i = 1}^n I(A_i = a)Y_i W_i $$ --- ## Hajek Estimator - The *Hajek estimator* fits the linear model $$ E[Y \vert A] = \theta_0 + \theta_1 A $$ by weighted least squares, weighting individuals by our estimated IP weights. - It is is equivalent to `$$\frac{ \hat{E}\left[\frac{I(A = a)Y}{f(A\vert L)} \right]}{ \hat{E}\left[\frac{I(A = a)}{f(A\vert L)} \right]} = \frac{\sum_{i = 1}^n I(A_i = a)Y_i W_i}{\sum_{i = 1}^n I(A_i = a) W_i}$$` --- ## Hajek Estimator - Hajek estimator is unbiased for `$$\frac{ E\left[\frac{I(A = a)Y}{f(A\vert L)} \right]}{ E\left[\frac{I(A = a)}{f(A\vert L)} \right]}$$` - If positivity holds, then $$ E\left[\frac{I(A = a)}{f(A\vert L)} \right] = 1 $$ - The Hajek estimator is guaranteed to be between 0 and 1 for dichotomous outcomes. The Horvitz-Thompson estimator is not. --- ## IP Weighted Effect Estimate in NHEFS - This code computes the Hajek estimator: ```r f_w_lm <- lm(wt82_71 ~ qsmk, data = dat, weights = w) f_w_lm ``` ``` ## ## Call: ## lm(formula = wt82_71 ~ qsmk, data = dat, weights = w) ## ## Coefficients: ## (Intercept) qsmk ## 1.780 3.441 ``` - We estimate that if nobody had quit smoking, the average weight gain would have been 1.8 kg. - If everyone had quit smoking, the average weight gain would have been 5.2 kg. - The average causal effect of quitting smoking on weight gain is 3.4 kg. --- ## Estimating the Variance of the Estimate - The raw variance estimate from weighted least squares does not account for uncertainty in the weights. - We need to account for having estimated the IP weights. - Options: 1. Derive the variance analytically 2. Bootstrap the variance 3. Use a robust sandwich variance estimate. - Option 3 is conservative but easier than options 1 and 2. - Option 3 can be achieved by fitting with GEE using and independent working correlation matrix or using the `HC0` option in `vcovHC`. --- ## Variance Estiamte in the NHEFS Data ```r library("geepack") f_w_gee <- geeglm(wt82_71 ~ qsmk, data = dat,weights = w, id = seqn,corstr = "independence") beta <- coef(f_w_gee) SE <- coef(summary(f_w_gee))[, 2] lcl <- beta - qnorm(0.975) * SE ucl <- beta + qnorm(0.975) * SE cbind(beta, lcl, ucl) ``` ``` ## beta lcl ucl ## (Intercept) 1.779978 1.339514 2.220442 ## qsmk 3.440535 2.410587 4.470484 ``` ```r library(sandwich) # Equivalent alternative using sandwich v_wlm_robust <- vcovHC(f_w_lm, type = "HC0") all.equal(v_wlm_robust, vcov(f_w_gee)) ``` ``` ## [1] TRUE ``` --- ## Stabilized Weights - Using weights `\(W^{A} = 1/f(A \vert L)\)` we create a pseudo-population in which (heuristically), each person is matched by someone exactly like them who received the opposite treatment. - Our pseudo-population is twice as big as our actual sample so the mean of `\(W^A\)` is 2. - In the pseudo-population, the frequency of treatment `\(A = 1\)` is 50%. - We could have created other pseudo-populations. --- ## Stabilized Weights - Our requirements are that, in the pseudo-population, probability of treatment is independent of `\(L\)`. - But different people could have different probabilities of treatment. - To create stabilized weights, we construct a pseudo-population in which the probability of receiving each treatment is the same as the frequency of the treatment in our original sample. $$ SW^A = \frac{f(A)}{f(A \vert L)} $$ --- ## Stabilized Weights - In our data, `\(A\)` is binary, so we can compute `\(f(1)\)` as the proportion of quitters in the data. ```r p1 <- mean(dat$qsmk); p1 ``` ``` ## [1] 0.2573436 ``` - In the pseudo-population created by the stabilized weights, each person in our data set corresponds to 26% of a treated person and 74% of an untreated person. ```r dat <- dat %>% mutate(pa = ifelse(dat$qsmk == 0, 1-p1, p1), sw= pa*w) summary(dat$sw) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.3312 0.8665 0.9503 0.9988 1.0793 4.2977 ``` - The expected value of `\(SW^A\)` is 1 because the pseudo-population is the same size as the observed data. --- ## Estimation Using the Stabilized Weights ```r f_sw_gee <- geeglm(wt82_71 ~ qsmk,data = dat, weights = sw, id = seqn,corstr = "independence") beta <- coef(f_sw_gee) SE <- coef(summary(f_sw_gee))[, 2] lcl <- beta - qnorm(0.975) * SE ucl <- beta + qnorm(0.975) * SE cbind(beta, lcl, ucl) ``` ``` ## beta lcl ucl ## (Intercept) 1.779978 1.339514 2.220442 ## qsmk 3.440535 2.410587 4.470484 ``` - These are exactly the results we saw with the unstabilized weights. --- ## Why Use Stabilized Weights - Differences between stabilized and non-stabilized weights only occur when the model for `\(E[Y \vert A]\)` is not saturated. - When the model is not saturated, stabilized weights typically result in greater efficiency. --- ## Marginal Structural Models - In the IP weighting strategy we create a population in which `\(A\)` is indepedent of `\(L\)` and then fit the model `$$E_{ps}[Y \vert A ] = \theta_0 + \theta_1 A$$` - If we believe our conditional exchangeability assumption `\(Y(a) \ci A \vert L\)` and the propensity score model, then in the pseudo-population, `\(E_{ps}[Y \vert A] = E[Y(a)]\)`. - So the parameter `\(\theta_1\)` is interpretable as the causal risk difference. -- - We have proposed a model for the average counterfactual: `$$E[Y(a)] = \beta_0 + \beta_1 a$$` -- - This model is *marginal* because we have marginalized (averaged) over the values of all other covariates. - It is structural because it is a model for a counterfactual `\(E[Y(a)]\)`. --- ## Modeling Continuous Treatments - With a binary exposure, we could construct a saturated model. - For continuous (or highly polytomous) exposures, we can't do that and will have to use a parametric model instead. - In the NHEFS data, let `\(A\)` be the change in smoking intensity ( `\(A = 10\)` indicates that a person increased their smoking by 10 cigarettes). - We will limit to only those who smoked 25 or fewer cigarettes per day at baseline ( `\(N = 1,162\)` ) - We can propose a model for our *dose-response curve* $$ E[Y(a)] = \beta_0 + \beta_1 a + \beta_2 a^2 $$ --- ## Estimating Weights for the Continuous Treatment - To use stabilized weights, we need to estimate `\(f(A \vert L)\)` and `\(f(A)\)`. - Both of these are PDFs which are hard to estimate! - We will assume that both `\(f(A \vert L)\)` and `\(f(A)\)` are normal with constant variance across participants. + These assumptions are almost certainly wrong. - We will estimate the mean of `\(f(A \vert L)\)` via linear regression. - IP weighted estimates of continuous variables can be very sensitive to choice of weights. --- ## Estimating Weights for the Continuous Treatment - We fit a linear model for change in smoking intensity including: + sex, race, education, exercise, life activity + age and age squared + baseline smoking intensity and baseline smoking intensity squared + baseline weight and baseline weight squared - The fitted values of this model give us `\(E[A \vert L]\)`. - We can use the variance of the residuals as an estimate of the variance of `\(A \vert L\)` - We then compute `\(f(A \vert L)\)` using `dnorm`. --- ## Estimating Weights for the Continuous Treatment ```r dat.s <- filter(dat, smokeintensity <= 25) fal_fit<- lm(smkintensity82_71 ~ as.factor(sex) + as.factor(race) + age + I(age ^ 2) + as.factor(education) + smokeintensity + I(smokeintensity ^ 2) + smokeyrs + I(smokeyrs ^ 2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 ^ 2), data = dat.s) fal_mean <- predict(fal_fit, type = "response") fal <- dnorm(dat.s$smkintensity82_71, fal_mean, summary(fal_fit)$sigma) ``` --- ## Estimating Weights for the Continuous Treatment - To estimate `\(f(A)\)` we use the same procedure, except that the mean of `\(E[A]\)` is is just the average change in smoking intensity. ```r fa_fit <- lm(smkintensity82_71 ~ 1, data = dat.s) fa_mean <- predict(fa_fit, type = "response") fa <- dnorm(dat.s$smkintensity82_71, fa_mean, summary(fa_fit)$sigma) dat.s$sw_cont <- fa / fal dat.s$w_cont <- 1/fal ``` --- ## Estimating Weights for the Continuous Treatment - For comparison, we compute both the stabilized weights and the non-stabilized weights ```r summary(dat.s$sw_cont) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.1938 0.8872 0.9710 0.9968 1.0545 5.1023 ``` ```r summary(dat.s$w_cont) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 25.4 26.4 30.0 1078.1 47.5 428611.4 ``` - The non-stabilized weights are much more variable! --- ## Fitting the Continuous Treatment Model - We then fit the marginal structural model $$ E[Y(a)] = \beta_0 + \beta_1 a + \beta_2 a^2$$ using exactly the same strategy we used before. <table> <thead> <tr> <th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Stabilized Weights</div></th> <th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Non-Stabilized Weights</div></th> </tr> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> 95% CI </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> 95% CI </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Intercept </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> (1.4, 2.6) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 7.7 </td> <td style="text-align:left;"> (1, 14) </td> </tr> <tr> <td style="text-align:left;"> Smk Int </td> <td style="text-align:left;"> -0.11 </td> <td style="text-align:left;"> (-0.17, -0.047) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 0.1 </td> <td style="text-align:left;"> (-0.21, 0.42) </td> </tr> <tr> <td style="text-align:left;"> Smk Int Squared </td> <td style="text-align:left;"> 0.0027 </td> <td style="text-align:left;"> (-0.002, 0.0074) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> -0.01 </td> <td style="text-align:left;"> (-0.024, 0.0039) </td> </tr> </tbody> </table> --- ## Continuous Model Results <img src="5_modeling1_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ## Issues with the Continuous Treatment Model - We had to make strong assumptions about the distribution of `\(A \vert L\)`. - There are structural positivity issues -- it is not possible for everyone to reduce their smoking by 80 cigarettes per day. --- ## Effect Modification in Marginal Structural Models - We might be interested in estimating the causal effect within strata of `\(V\)`. - We can propose a marginal structural model that includes `\(V\)` $$ E[Y(a) \vert V] = \beta_0 + \beta_1 a + \beta_2 a V + \beta_3 V $$ - If `\(V\)` and `\(A\)` are both dichotomous, this model is saturated. - It is not really fully marginal any more, but we still call it a marginal structural model. --- ## Interpreting Parameters in the Effect Modification Model - Suppose that `\(V\)` is sex (0: Male, 1: Female) and `\(A\)` is quitting/not quitting. $$ E[Y(a) \vert V] = \beta_0 + \beta_1 a + \beta_2 a V + \beta_3 V $$ - `\(\beta_1\)` is the causal effect of quitting for men. - `\(\beta_1+ \beta_2\)` is the causal effect of quitting for women. - What is `\(\beta_3\)`? -- - `\(\beta_3\)` is not a causal parameter because we haven't made any exchangeability assumptions about `\(V\)`. - `\(\beta_3\)` is the difference between `\(E[Y(0)]\)` in females and `\(E[Y(0)]\)` in males but is not the causal effect of sex. --- ## Stablilized Weights for the EM Model - We have two choices for stabilized weights $$ \frac{f(A)}{f(A \vert L)} \qquad \text{or} \qquad \frac{f(A\vert V)}{f(A \vert L)} $$ - Which will be more efficient? -- - Conditioning on `\(V\)` in the numerator will make the numerator closer to the denominator. - The second choice of weights will be less variable. - So the second choice of weights is more efficient. --- ## Stablilized Weights for the EM Model - We need to compute `\(f(A \vert V)\)` in our data. - We can just use the stratified means. ```r fav_fit <- lm(qsmk ~ sex, data = dat) dat$pav <- ifelse(dat$qsmk == 0, 1-predict(fav_fit, type = "response"), predict(fav_fit, type = "response")) dat <- mutate(dat, sw_em = pav*w) summary(dat$sw_em) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.2930 0.8751 0.9554 0.9989 1.0803 3.8011 ``` --- ## Fitting the Effect Modification Model ```r msm_em <- geeglm( wt82_71 ~ qsmk*sex, data = dat, weights = sw_em, id = seqn, corstr = "independence") beta <- coef(msm_em) SE <- coef(summary(msm_em))[, 2] lcl <- beta - qnorm(0.975) * SE ucl <- beta + qnorm(0.975) * SE cbind(beta, lcl, ucl) ``` ``` ## beta lcl ucl ## (Intercept) 1.784446876 1.1771686 2.3917252 ## qsmk 3.521977634 2.2341453 4.8098100 ## sex -0.008724784 -0.8883884 0.8709389 ## qsmk:sex -0.159478525 -2.2097583 1.8908013 ``` - We have no strong evidence of effect modificaiton by sex. --- ## Unweighted Regression vs IP Weighting - What if we had conditioned on every variable in `\(L\)`? - The stabilized weights would be 1, so we would just be running a regular regression. - If we believe in the model for `\(E[Y(a) \vert L]\)`, then the regression coefficient is interpretable as a (conditional) causal parameter. - Using the marginal model with IP weighting, we need to believe that our estimates of `\(f(A \vert L)\)` are accurate. - In some cases, solving a prediction problem is easier than guessing the correct structural model for all confounders. --- ## Censoring and Missing Data With IP Weights - In our analysis, we have removed 63 individuals who filled in the survey in 1982 but who have a missing value for weight. <table> <thead> <tr> <th style="text-align:right;"> Quit Smoking </th> <th style="text-align:right;"> Number Missing </th> <th style="text-align:right;"> Percent Missing </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 38 </td> <td style="text-align:right;"> 3.2 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> 5.8 </td> </tr> </tbody> </table> - We have more missing data for people who quit smoking than people who did not, so we could have selection bias. --- ## Weights for Censoring - We learned in L4 that, to correct for selection bias, we need to find a set of variables `\(L_c\)` such that `\(Y(C = 0) \ci C \vert L_c\)`. - We then need to estimate `\(W^C = 1/P[C = 0 \vert A, L_c]\)`, or the stabilized version `$$SW^C = \frac{P[C = 0 \vert A]}{P[C = 0 \vert A, L_c]}$$` - We will use the same set of variables, we used to compute weights for confounding. - The total weights will be `\(SW^C \cdot SW^A\)` --- ## Weights for Censoring - We will estimate `\(f(C=0 \vert A, L)\)` using a logistic model adjusting for + the exposure + sex, race age, education, exercise, life activity + linear and quadratic terms for smoking intensity, duration, and baseline weight. - The `dat.u` data frame includes data for all participants. ```r dat.u$cens <- ifelse(is.na(dat.u$wt82), 1, 0) fcal_fit <- glm( cens ~ as.factor(qsmk) + as.factor(sex) + as.factor(race) + age + I(age ^ 2) + as.factor(education) + smokeintensity + I(smokeintensity ^ 2) + smokeyrs + I(smokeyrs ^ 2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 ^ 2), family = binomial(), data = dat.u ) fc0al <- 1 - predict(fcal_fit, type = "response") ``` --- ## Weights for Censoring - We can estimate the numerator of the stablilized weight, `\(f(C = 0 \vert A)\)` using the average within levels of `\(A\)`. ```r fca_fit <-glm(cens ~ as.factor(qsmk), family = binomial(), data = dat.u) fc0a <- 1 - predict(fca_fit, type = "response") sw.c <- fc0a/fc0al dat$sw.c <- sw.c[!is.na(dat.u$wt82)] dat$sw.comb <- dat$sw*dat$sw.c ``` ```r summary(dat$sw.c) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.9442 0.9782 0.9858 0.9990 1.0035 1.7180 ``` ```r summary(dat$sw.comb) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.3390 0.8593 0.9442 0.9974 1.0749 4.1829 ``` --- ## Combined Weights Results - Below are results using the combined weights for both censoring and confounding <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> 95% CI </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Intercept </td> <td style="text-align:left;"> 1.69 </td> <td style="text-align:left;"> (1.23, 2.14) </td> </tr> <tr> <td style="text-align:left;"> Quit Smoking </td> <td style="text-align:left;"> 3.4 </td> <td style="text-align:left;"> (2.36, 4.45) </td> </tr> </tbody> </table> - Compared with the previous results <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> 95% CI </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Intercept </td> <td style="text-align:left;"> 1.78 </td> <td style="text-align:left;"> (1.34, 2.22) </td> </tr> <tr> <td style="text-align:left;"> Quit Smoking </td> <td style="text-align:left;"> 3.44 </td> <td style="text-align:left;"> (2.41, 4.47) </td> </tr> </tbody> </table> --- ## Survey Weights - Typically, nationally representative surveys won't use a completely random sampling scheme. + They will try to over-sample smaller groups. + This allows them to get better estimates for less common covariate values. - Over-sampling introduces selection bias that needs to be corrected. - If the survey was done well, a specific over-sampling rule was used. + For example, you might select census tracts randomly according to a covariate informed probability and then select housees randomly within census tract. - This allows the survey administers to compute `\(P[C = 0 \vert L]\)` for every participant. --- ## Survey Weights - If `\(L\)` includes every factor affecting study inclusion, then `\(P[C = 0 \vert A, L] = P[C = 0 \vert L]\)` for any exposure `\(A\)`. - So we could use the same set of weights regardless of what exposure we are looking at. - Usually, survey administers will compute `\(P[ C = 0 \vert L]\)` and distribute `\(1/P[C = 0 \vert L]\)` as "survey weights". + Survey weights are censoring weights! - If you have survey data that come with survey weights, be sure to read about how these are calculated. + Most likely, if you want to estimate a parameter for the larger population, you will need to use the survey weights. --- # 3. G-Computation (Standardization) --- ## Standardization - In standardization, we compute an estimate of `\(E[Y \vert A = a, L = l]\)` and plug it into the g-formula: $$ E[Y(a)] = \sum_l E[Y \vert A = a, L = l]P[L = l] $$ - If `\(L\)` is continuous, we need to integrate $$ E[Y(a)] = \int_l P[Y \vert A = a, L = l]f(l) dl $$ --- ## Outcome Modeling - In the standardization formula, we need to estimate `\(P[Y \vert A = a, L = l]\)`. - To do this we fit a linear regression ```r f_y <-glm( wt82_71 ~ qsmk + sex + race + age + I(age * age) + as.factor(education) + smokeintensity + I(smokeintensity * smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 * wt71) + qsmk * smokeintensity, data = dat.u) summary(f_y)$coefficients %>% head() ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.588165662 4.313035924 -0.3682245 0.7127562582 ## qsmk 2.559594093 0.809148613 3.1633177 0.0015901758 ## sex -1.430271659 0.468957553 -3.0498958 0.0023281370 ## race 0.560109604 0.581888778 0.9625716 0.3359131882 ## age 0.359635262 0.163318764 2.2020450 0.0278094360 ## I(age * age) -0.006100955 0.001726136 -3.5344591 0.0004206537 ``` --- ## Outcome Modeling - This first step of fitting the outcome model is just fitting a regular linear regression. - If we believe the model, then the coefficient for treatment gives a *conditional* causal effect estimate within strata of `\(L\)`. - If there were no interactions between `\(A\)` and `\(L\)` in the model, the coefficient on `\(A\)` would give the marginal treatment effect. + This is true using a linear model but would not be true in a model with a non-linear link. - An outcome model with no interactions assumes **homogeneity**, i.e. the average treatment effect is the same in every level of `\(L\)`. + Typically, homogeneity is considered to be a very strong assumption. --- ## Outcome Modeling - Our outcome model included an interaction between `\(A\)` and smoking intensity. - In this model, the coefficient on `\(A\)` is the average treatment effect within strata of smoking intensity. - To estimate the average (marginal) treatment effect, we need to average over the distribution of smoking intensity, or more generally the distribution of `\(L\)`. --- ## Standardizing (G-Computation) - Fortunately, we do not need to compute `\(f(L)\)`. - Using the definition of expectation, we can write $$ \int_l P[Y \vert A = a, L = l]f(l)dl = E\left[ E\left[Y \vert A = a, L = l \right]\right] $$ - So we can estimate the standardized mean as $$ \frac{1}{n}\sum_{i=1}^n\hat{E}[Y \vert A = a, L_i] $$ --- ## Standardizing - For each person, we need to compute two values `\(\hat{E}[Y \vert A = 0, L_i]\)` and `\(\hat{E}[Y\vert A = 1, L_i]\)` - These are predicted values that we can get out of our previous regression. - We predict each person's outcome with `\(A = 0\)` and with `\(A = 1\)`. ```r # Make two copies of the data dat0 <- dat1 <- dat.u # In one copy everyone got treatment 0 dat0$qsmk <- 0 # In the second copy everyone got treatment 1 dat1$qsmk <- 1 Y0 <- predict(f_y, newdata = dat0, type = "response") %>% mean() Y1 <- predict(f_y, newdata = dat1, type = "response") %>% mean() cat(round(Y1, digits = 2), "-", round(Y0, digits = 2), "=", round(Y1-Y0, digits = 2)) ``` ``` ## 5.18 - 1.66 = 3.52 ``` --- ## Censoring - If the set of variables in our model for `\(E[Y \vert A, L]\)` are sufficient to adjust for censoring (meaning that they block all paths between `\(Y(C = 0)\)` and `\(C\)`), then we don't need to do anything extra to adjust for censoring. - The trick is to predict `\(Y(a)\)` for **all individuals** with covariate data, not just those with non-missing outcomes. - If `\(L\)` is not sufficient to adjust for censoring, we may need to add additional variables specifically to deal with censoring. - Recall that the s-backdoor criterion gives us a way to select variables to condition on. --- ## Estimating the Variance - To compute the variance of the estimate, we need to bootstrap. - We will repeatedly re-sample our data with replacement. - Each time, we re-fit the outcome model and recompute the standardized effect. - The standard deviation of our bootstrap samples consistently estimates the standard error of our estimate. --- ## Bootstrapping the Variance - Step 1: Write a function to compute the standardized mean ```r # Step 1: Write a function compute the standardized mean std_mean_func <- function(d){ boot_fit <-glm( wt82_71 ~ qsmk + sex + race + age + I(age * age) + as.factor(education) + smokeintensity + I(smokeintensity * smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 * wt71) + qsmk * smokeintensity, data = d) d0 <- d1 <- d d0$qsmk <- 0 d1$qsmk <- 1 Y0 <- predict(boot_fit, newdata = d0, type = "response") %>% mean() Y1 <- predict(boot_fit, newdata = d1, type = "response") %>% mean() return(Y1-Y0) } ``` --- ## Bootstrapping the Variance - Step 2: Repeatedly re-sample the data and compute the standardized mean in the re-sampled data set. ```r set.seed(1) samples <- replicate(n = 500, expr = sample(seq(nrow(dat.u)), size = nrow(dat.u), replace = TRUE)) res <- apply(samples, 2, FUN = function(ix){ boot_dat <- dat.u[ix,] boot_est <- std_mean_func(boot_dat) return(boot_est) }) se <- sd(res) eff <- Y1-Y0 ci <- eff + c(-1, 1)*qnorm(0.975)*se cat(round(eff, digits = 2), "(", round(ci, digits = 2), ")") ``` ``` ## 3.52 ( 2.58 4.46 ) ``` --- ## Effect Modification in G-Computation - If we want to estimate the average causal effect stratified by `\(V\)`, we can simply stratify by `\(V\)` before averaging our predicted values. ```r dat.u$Y0 <- predict(f_y, newdata = dat0, type = "response") dat.u$Y1 <- predict(f_y, newdata = dat1, type = "response") dat.u %>% group_by(sex) %>% summarize(Y0 = mean(Y0), Y1 = mean(Y1)) ``` ``` ## # A tibble: 2 × 3 ## sex Y0 Y1 ## <dbl> <dbl> <dbl> ## 1 0 1.66 5.30 ## 2 1 1.66 5.06 ``` --- ## IP Weighting vs G-Computation - If all of our models are fully saturated, IP weighting and standardization give exactly the same result (recall HW1). - With parametric models, the two methods rely on different assumptions and are not the same. - In the IP weighted analysis, we need to trust our model for `\(f(A \vert L)\)`. - In the standardized analysis, we need to trust our model for `\(E[Y \vert A, L]\)`. - In either case, if our our parametric models are wrong, our predictions will be biased and so our estimators will be biased. - May be wise to compute both and compare. --- ## IP Weighting vs G-Computation - Chatton et al (2020) present simulation results comparing IPW, G-computation, and two methods we haven't talked about yet. - In their simulation scenario, G-computation always had a lower MSE than IPW - For IPW, they calculated variance using robust SEs (as we did earlier) and found slight over coverage (~97% for a 95% target). - They used parametric simulation to compute the variance of the G-computation estimate. This gave close to target coverage but slightly under in some cases (~94%). --- ## Chatton et al Results <img src="5_modeling1_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> --- ## IP Weighting vs G-Computation - Snowden et al (2010) describe G-Computation as rarely used in epidemiology due to lack of awareness. - However, G-Computation can be more precise than IP weighting in some circumstances. + IP weighting can be inefficient when weights are very large. - Vansteelandt and Keiding find that IPW performs better when the covariate distribution is very different in cases and controls. + Depending on the model used fo G-Computation, the dominant group may have too much influence over estimates in these regions. + G-Computation may under-estimate the uncertainty about the counterfactuals in these regions. - Some suggest that we shouldn't be trying to estimate the countefactuals in these regions at all, since this requires extrapolating into regions with low data density.