Resampling and Re-Scaling Summary and Individual Level Data
Source:vignettes/resampling.Rmd
resampling.Rmd
library(GWASBrewer)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
set.seed(1)Introduction
In some applications, we may need to generate multiple data sets for
the same set of effect sizes. This would mimic performing multiple GWAS
on the same trait. There are two functions,
resample_sumstats and resample_inddata that
facilitate this. This vignette will demonstrate these functions as well
as discuss important considerations when the simulation involves
sampling GWAS for multiple ancestries for the same trait.
To get started, we will generate a small sim_mv object.
For a review of that function, see the “Simulating Data” vignette. For
this vignette, we will keep things very simple by having only 12
variants with very large effect sizes. Of course, this is not realistic,
but it will let us see exactly what is going on. In our example, we will
have two traits. The first trait has an effect of 0.2 on the second
trait. Recalling that sim_mv always sets the variance of
each trait to 1, this means, that trait 1 explains 4% (\(0.2^2 = 0.04\)) of the variance of trait 2.
We will also use an LD pattern. Here we just specify the LD pattern for
five variants. This will be repeated 2.2 times to cover 12 variants. We
will specify N so that there is no sample overlap.
set.seed(1000)
my_ld1 <- matrix(c( 1.00, 0.60, 0.40, -0.1, -0.07,
0.60, 1.00, 0.60, -0.1, -0.07,
0.40, 0.60, 1.00, -0.1, -0.07,
-0.10, -0.10, -0.10, 1.0, 0.90,
-0.07, -0.07, -0.07, 0.9, 1.00), nrow = 5, byrow = T)
af1 <- c(0.35, 0.3, 0.4, 0.72, 0.75)
G <- matrix(c(0, 0.2, 0, 0), nrow = 2, byrow =TRUE) # matrix of causal effects
orig_dat <- sim_mv(N = c(10000, 20000),
J = 12,
h2 = c(0.2, 0.3),
pi = 0.4,
G = G,
R_LD = list(my_ld1),
af = af1,
est_s = TRUE)
#> SNP effects provided for 12 SNPs and 2 traits.Resampling Summary Statistics or Individual Level Data from the Same Population
In the simplest scenario, we want to simulate performing a new GWAS or collecting new individual level data for the original set of traits in the original population, meaning that the LD pattern and allele frequencies are the same.
Resampling Summary Statistics from the Same Population
To resample summary statistics, we can use
resample_sumstats. Here, we will regenerate data assuming
that samples for the new GWAS are totally overlapping.
N1 <- matrix(50000, nrow = 2, ncol = 2)
new_dat1 <- resample_sumstats(dat = orig_dat,
N = N1,
R_LD = list(my_ld1),
af = af1,
est_s = TRUE)
#> I will assume that the environmental variance is the same in the old and new population.
#> I will assume that environmental correlation is the same in the old and new population. Note that this could result in different overall trait correlations.
#> SNP effects provided for 12 SNPs and 2 traits.The new_dat1 object is another object of class
sim_mv. The new and old data have the same effect sizes so
beta_joint, beta_marg,
direct_SNP_effects_joint, and
direct_SNP_effects_marg are all the same between the two
objects. They also have the same causal effect structure so the trait
effect matrices direct_trait_effects and
total_trait_effects are the same. However, the summary
statistic matrices beta_hat and s_estimate are
different due to random sampling variation. The true standard errors,
se_beta_hat are also different because the sample size is
different. Finally, since there is sample overlap in the samples for
new_dat1 but not in orig_dat, the
R matrices which give the correlation of effect sizes
across traits are different.
all.equal(orig_dat$beta_joint, new_dat1$beta_joint)
#> [1] TRUE
all.equal(orig_dat$beta_marg, new_dat1$beta_marg)
#> [1] TRUE
head(orig_dat$beta_hat)
#> [,1] [,2]
#> [1,] -0.11039241 0.004995358
#> [2,] -0.18785203 0.002766487
#> [3,] -0.11483385 0.018581987
#> [4,] 0.22595316 0.039173114
#> [5,] 0.19996902 0.037651726
#> [6,] -0.01467208 -0.304974486
head(new_dat1$beta_hat)
#> [,1] [,2]
#> [1,] -0.116203623 0.0092874402
#> [2,] -0.175604708 0.0220111130
#> [3,] -0.110532292 -0.0007815694
#> [4,] 0.251954283 0.0446050591
#> [5,] 0.231799766 0.0426176263
#> [6,] 0.005154759 -0.3105532234
head(orig_dat$se_beta_hat)
#> [,1] [,2]
#> [1,] 0.01482499 0.01048285
#> [2,] 0.01543033 0.01091089
#> [3,] 0.01443376 0.01020621
#> [4,] 0.01574852 0.01113589
#> [5,] 0.01632993 0.01154701
#> [6,] 0.01482499 0.01048285
head(new_dat1$se_beta_hat)
#> [,1] [,2]
#> [1,] 0.006629935 0.006629935
#> [2,] 0.006900656 0.006900656
#> [3,] 0.006454972 0.006454972
#> [4,] 0.007042952 0.007042952
#> [5,] 0.007302967 0.007302967
#> [6,] 0.006629935 0.006629935
head(orig_dat$s_estimate)
#> [,1] [,2]
#> [1,] 0.01487042 0.01046558
#> [2,] 0.01514898 0.01089173
#> [3,] 0.01437689 0.01018576
#> [4,] 0.01556285 0.01122621
#> [5,] 0.01620126 0.01160266
#> [6,] 0.01489534 0.01026832
head(new_dat1$s_estimate)
#> [,1] [,2]
#> [1,] 0.006602645 0.006619905
#> [2,] 0.006844339 0.006885392
#> [3,] 0.006414776 0.006430850
#> [4,] 0.006920035 0.007005126
#> [5,] 0.007188453 0.007256993
#> [6,] 0.006635957 0.006485925Resampling Individual Level Data from the Same Population
We can also generate individual level data. For this we use the
resample_inddata function. This function can be used to
produce three types of output:
- Genotypes only
- Genotypes and phenotypes
- Phenotypes only
resample_inddata uses the hapsim package to
generate individual level data. The sample size argument,
N, in resample_inddata can accept three types
of input: scalar, vector, or data frame. These formats are the same as
those used by sim_mv, so if N is a scalar or
vector, resample_inddata will assume there is no sample
overlap. This means that if we use N = 10 with two traits,
we will get out genotype information for 20 individuals, of which half
have phenotype 1 information and half have phenotype 2 information. To
specify sample overlap, we can use the data frame format. As a reminder,
a sample size data frame should have columns named trait_1,
… trait_[M] and N. The trait_[x]
columns will be interpreted as logicals and the N column
should give the number of samples in each combination of studies.
Generating genotypes only
If resample_inddata is used to generate genotypes only,
we are not really “resampling” anything because there is no need to
input trait data. In this case, we only need to supply the total number
of individuals, the number of variants, and the LD pattern if desired.
Here we will generate data for 15 individuals with an LD pattern
matching the one used in orig_dat. The returned data
includes the genotype matrix and a vector of population allele
frequencies.
genos_only <- resample_inddata(N = 15,
J = 12,
R_LD = list(my_ld1),
af = af1)
#> Generating genotype matrix only.
names(genos_only)
#> [1] "X" "af"
dim(genos_only$X)
#> [1] 15 12
length(genos_only$af)
#> [1] 12
genos_only$X
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#> S1 2 1 1 1 1 0 0 0 2 2 1 1
#> S2 1 0 0 2 2 1 1 0 1 1 2 2
#> S3 0 0 0 0 1 1 1 1 1 1 2 2
#> S4 0 0 0 1 1 0 0 0 0 0 2 2
#> S5 0 0 0 2 2 0 1 0 1 2 2 1
#> S6 1 1 1 0 0 0 0 1 1 1 1 1
#> S7 0 0 1 0 0 1 0 1 1 1 2 2
#> S8 1 1 1 2 2 1 0 1 2 2 0 2
#> S9 0 0 0 1 1 1 0 0 2 2 1 1
#> S10 1 1 1 2 2 1 1 1 2 2 1 1
#> S11 1 0 0 1 1 0 0 1 1 1 2 2
#> S12 1 2 2 1 1 1 1 2 2 2 2 2
#> S13 0 0 0 2 1 0 0 1 1 1 2 2
#> S14 0 0 0 2 2 1 1 1 2 2 1 1
#> S15 0 0 0 2 2 1 0 2 0 1 2 2Generating Genotypes and Phenotypes
To generate both genotypes and phenotypes, we need to include a
sim_mv object that contains effect sizes. For our example,
we will use the data frame sample size format to indicate that three
individuals have both phenotypes measured and seven have only one or the
other measured. Note that we don’t need to include the J
argument because the sim_mv object contains all the
necessary information. The vector of M phenotypes for each
individual is simulated as \(Y_i = Y_{G,i} +
Y_{E,i}\) where \(Y_{G,i}\) is
the vector of genetic components computed from the genotypes and effect
sizes stored in the original sim_mv object and \(Y_{E,i}\) is a vector of environmental
components sampled from a normal distribution with variance given by the
Sigma_E matrix in the original sim_mv
object.
N <- data.frame("trait_1" = c(1, 1, 0), "trait_2" = c(0, 1, 1), "N" = c(3, 3, 4))
new_ind_dat1 <- resample_inddata(N = N,
dat = orig_dat,
R_LD = list(my_ld1),
af = af1,
calc_sumstats = FALSE)
#> Generating both genotypes and phenotypes.
#> SNP effects provided for 12 SNPs and 2 traits.
#> I will assume that the environmental variance is the same in the old and new population.
#> I will assume that environmental correlation is the same in the old and new population. Note that this could result in different overall trait correlations.Genotype data are stored in new_ind_dat1$X and phenotype
data are in new_ind_dat1$Y.
names(new_ind_dat1)
#> [1] "X" "Y" "af" "Sigma_G" "Sigma_E"
#> [6] "pheno_sd" "h2" "trait_corr" "beta_joint"
new_ind_dat1$X
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#> S1 0 1 1 1 1 0 0 0 1 1 1 2
#> S2 1 1 1 2 2 1 1 1 1 1 1 1
#> S3 1 1 2 1 1 0 0 0 1 1 1 1
#> S4 0 0 0 1 1 1 1 1 1 1 1 1
#> S5 0 0 0 1 1 1 1 1 2 2 1 1
#> S6 0 0 0 2 2 1 1 0 2 2 2 1
#> S7 0 0 0 2 2 2 1 1 1 1 0 0
#> S8 0 0 0 1 1 0 0 1 2 2 2 2
#> S9 2 2 1 0 0 0 1 1 2 2 2 2
#> S10 0 1 1 2 2 1 1 1 1 1 2 1
new_ind_dat1$Y
#> y_1 y_2
#> 1 -0.64246004 NA
#> 2 1.26150568 NA
#> 3 -0.05425048 NA
#> 4 0.47374118 0.77563114
#> 5 -0.66257053 -0.64414580
#> 6 1.34406827 -0.08414463
#> 7 NA 0.42914109
#> 8 NA -2.21483702
#> 9 NA -1.61651866
#> 10 NA 0.55091258The returned object also includes some additional information
including Sigma_G, Sigma_E, and
beta_joint which have the same meaning as the objects with
the same name in a sim_mv object. If we had set
calc_sumstats to TRUE, the object would also
include effect size and standard error estimates for the association
between each variant and each trait.
Generating Phenotypes Only
Finally, we can use resample_inddata to calculate
phenotypes for a previously generated set of genotypes from a
sim_mv object. To do this, we supply the genotypes matrix
to the genos argument. If this is supplied, no genotypes
matrix will be returned. Even though the genotype matrix is supplied, we
also need to supply the R_LD and af arguments
for the population LD and allele frequencies. These are used to
correctly calculate the genetic variance-covariance matrix. Note that it
is important that the total number of individuals implied by the sample
size argument match the number of rows in the genotypes matrix. The
following call will generate an error because we supplied 15 genotypes
but the sample size data frame only includes 10 individuals.
phenos_only <- resample_inddata(N = N,
dat = orig_dat,
genos = genos_only$X,
R_LD = list(my_ld1),
af = af1,
calc_sumstats = FALSE)
#> Generating phenotypes only.
#> Error in check_matrix(genos, "genos", ntotal, J): Expected genos to have 10 rows, found 15We can fix the poblem by only including the first 10 rowsof the genotype matrix.
phenos_only <- resample_inddata(N = N,
dat = orig_dat,
genos = genos_only$X[1:10,],
R_LD = list(my_ld1),
af = af1,
calc_sumstats = FALSE)
#> Generating phenotypes only.
#> SNP effects provided for 12 SNPs and 2 traits.
#> I will assume that the environmental variance is the same in the old and new population.
#> I will assume that environmental correlation is the same in the old and new population. Note that this could result in different overall trait correlations.
names(phenos_only)
#> [1] "Y" "af" "Sigma_G" "Sigma_E" "pheno_sd"
#> [6] "h2" "trait_corr" "beta_joint"
phenos_only$Y
#> y_1 y_2
#> 1 1.22836882 NA
#> 2 -0.42435626 NA
#> 3 -0.62076763 NA
#> 4 -0.86060862 0.5125080
#> 5 0.09021464 -1.0949420
#> 6 -0.69423142 -1.4815503
#> 7 NA -1.1507476
#> 8 NA 0.1640274
#> 9 NA -0.4314887
#> 10 NA -0.8675082Resampling Data from a Different Population
Both resample_sumstats and resample_inddata
can be used as above with different LD and allele frequencies than used
to create the original object. However, there are some special
considerations about trait variance and scaling that we will discuss in
this section.
Understanding Effect Size Units
To understand resampled data that use different LD or allele
frequencies than the original data, it is important to understand the
units of effect sizes produced by sim_mv.
sim_mv always assumes that phenotype variance is equal to
1, so the interpretation of the effects in beta_joint is
the change in phenotype in units of phenotype SD per change in genotype
in units of either allele or genotype SD. To see the genotype unit, we
can check the geno_scale object. For example,
orig_dat$geno_scale
#> [1] "allele"By default, sim_mv will use the per-allele scale if
allele frequencies are available and otherwise use the per-SD scale. We
can also check the the phenotypes are scaled to unit variance in
pheno_sd which will always be a vector of 1’s if the object
was produced by sim_mv.
orig_dat$pheno_sd
#> [1] 1 1Changing LD and Allele Frequencies
The resampling functions in GWASBrewer assume that
effect sizes in the new population are the same as in the old population
on the genotype and phenotype scale given. If you are changing
populations, it is much more sensible to assume that effect sizes are
constant on the per-allele scale than on the per-genotype SD scale. This
means that if you want to regenerate data from a different population,
it is a good idea to start with input data that have
geno_scale equal to “allele”.
One consequence of differing allele frequencies and LD structure is that the total genetic variance will be different in the new population than in the old. By default, the resampling functions will assume that the environmental variance is the same in the two populations. This means that the overall variance of the phenotype in the new population will probably not be 1. The resampling functions will not rescale the phenotype in the new population because this would mean that the phenotypes in the two populations had different units and were not comparable. However, you can rescale the output after the fact using if you would like the phenotypes to have unit variance.
We can see this in action by resampling summary statistics using different allele frequencies and LD pattern than used originally.
af2 <- rep(0.1, 5)
my_ld2 <- matrix(0.3, nrow = 5, ncol = 5)
diag(my_ld2) <- 1
new_dat2 <- resample_sumstats(dat = orig_dat,
N = N,
R_LD = list(my_ld2),
af = af2)
#> Genetic variance in the new population differs from the genetic variance in the old population.
#> I will assume that the environmental variance is the same in the old and new population.
#> I will assume that environmental correlation is the same in the old and new population. Note that this could result in different overall trait correlations.
#> Note that the phenotype in the new population has a different variance from the phenotype in the old population.
#> I will keep the phenotype on the same scale as the original data, so effect sizes in the old and new object are comparable. If you would like to rescale the phenotype to have variance 1, use rescale_sumstats.
#> SNP effects provided for 12 SNPs and 2 traits.Notice that the function produces a message to let us know that the
phenotype has a different variance in the new population. We can check
by looking at pheno_sd. The genetic covariance matrix,
heritability, and overall trait correlation will also be different.
However, the environmental covariance will be the same.
new_dat2$pheno_sd
#> [1] 0.9649663 0.9164662
orig_dat$pheno_sd
#> [1] 1 1
new_dat2$h2
#> [1] 0.04683348 0.11545049
orig_dat$h2
#> [1] 0.1124495 0.2570578
new_dat2$Sigma_G
#> [,1] [,2]
#> [1,] 0.043609461 0.008093447
#> [2,] 0.008093447 0.096968051
orig_dat$Sigma_G
#> [,1] [,2]
#> [1,] 0.11244947 0.01934268
#> [2,] 0.01934268 0.25705783
new_dat2$Sigma_E
#> [,1] [,2]
#> [1,] 0.8875505 0.1736201
#> [2,] 0.1736201 0.7429422
orig_dat$Sigma_E
#> [,1] [,2]
#> [1,] 0.8875505 0.1736201
#> [2,] 0.1736201 0.7429422Note that the features that are kept constant are the environmental
variance Sigma_E and the per-allele joint effects,
beta_joint.
Changing Environmental Variance or Covariance
Although the default behavior is to keep Sigma_E
constant, we can change this using the arguments
new_env_var, new_h2, new_R_E and
new_R_obs. These are options for both
resample_sumstats and resample_inddata. The
new_env_var and new_h2 arguments provide
different ways of specifying the environmental variance so only one of
these can be used at a time. new_h2 gives the heritability
in the new population while new_env_var directly gives the
environmental variance. The parameters new_R_E and
new_R_obs are alternate ways of specifying environmental
correlation. new_R_E directly specifies the enviromental
correlation while new_R_obs gives the total trait
correlation. Only one of these two parameters can be supplied.
new_R_E <- diag(2)
new_R_E[1,2] <- new_R_E[2,1] <- 0.4
new_dat3 <- resample_sumstats(dat = orig_dat,
N = N,
R_LD = list(my_ld2),
af = af2,
new_env_var = c(0.9, 1.3),
new_R_E = new_R_E)
#> Genetic variance in the new population differs from the genetic variance in the old population.
#> Note that the phenotype in the new population has a different variance from the phenotype in the old population.
#> I will keep the phenotype on the same scale as the original data, so effect sizes in the old and new object are comparable. If you would like to rescale the phenotype to have variance 1, use rescale_sumstats.
#> SNP effects provided for 12 SNPs and 2 traits.
new_dat3$Sigma_E
#> [,1] [,2]
#> [1,] 0.9000000 0.4326662
#> [2,] 0.4326662 1.3000000
cov2cor(new_dat3$Sigma_E)
#> [,1] [,2]
#> [1,] 1.0 0.4
#> [2,] 0.4 1.0
new_dat3$h2
#> [1] 0.04621558 0.06941322Note that specifying new_h2 will always result in
exactly the heritability specified. This is in contrast with the
h2 argument of sim_mv which gives the expected
heritability.
new_dat4 <- resample_sumstats(dat = orig_dat,
N = N,
R_LD = list(my_ld2),
af = af2,
new_h2 = c(0.15, 0.25),
new_R_E = new_R_E)
#> Genetic variance in the new population differs from the genetic variance in the old population.
#> Note that the phenotype in the new population has a different variance from the phenotype in the old population.
#> I will keep the phenotype on the same scale as the original data, so effect sizes in the old and new object are comparable. If you would like to rescale the phenotype to have variance 1, use rescale_sumstats.
#> SNP effects provided for 12 SNPs and 2 traits.
new_dat4$h2
#> [1] 0.15 0.25
new_dat4$Sigma_E
#> [,1] [,2]
#> [1,] 0.2471203 0.1072480
#> [2,] 0.1072480 0.2909042Rescaling Effect Size Units
If for any reason we want to change the units of effects in an object
produced by sim_mv or resmaple_sumstats, we
can use rescale_sumstats. For example, in the code below,
we rescale the effects in orig_dat to be on the per-SD
scale. Note that by doing this, we will delete the allele frequency
information in the snp_info table. This is because allele
frequency is irrelevant for per-SD scaled effects. To convert back to
per-allele scale, we would need to supply the allele frequency.
orig_rescale1 <- rescale_sumstats(dat = orig_dat,
output_geno_scale = "sd")
orig_rescale1$beta_joint
#> [,1] [,2]
#> [1,] 0.000000000 0.0000000000
#> [2,] -0.097136125 0.0188220320
#> [3,] 0.000000000 0.0000000000
#> [4,] 0.146918495 0.0293836989
#> [5,] 0.000000000 0.0000000000
#> [6,] 0.000000000 0.0061492991
#> [7,] -0.003777444 -0.0007554889
#> [8,] 0.000000000 -0.5047783058
#> [9,] 0.000000000 0.0000000000
#> [10,] 0.000000000 0.0000000000
#> [11,] 0.000000000 0.0000000000
#> [12,] -0.280286489 -0.0560572978
orig_rescale1$geno_scale
#> [1] "sd"
orig_dat$beta_joint
#> [,1] [,2]
#> [1,] 0.000000000 0.000000000
#> [2,] -0.149884295 0.029043026
#> [3,] 0.000000000 0.000000000
#> [4,] 0.231374881 0.046274976
#> [5,] 0.000000000 0.000000000
#> [6,] 0.000000000 0.009116327
#> [7,] -0.005828723 -0.001165745
#> [8,] 0.000000000 -0.728584727
#> [9,] 0.000000000 0.000000000
#> [10,] 0.000000000 0.000000000
#> [11,] 0.000000000 0.000000000
#> [12,] -0.432491442 -0.086498288
## back to per-allele scale
orig_rescale2 <- rescale_sumstats(dat = orig_rescale1,
output_geno_scale = "allele",
af = orig_dat$snp_info$AF)
orig_rescale2$beta_joint
#> [,1] [,2]
#> [1,] 0.000000000 0.000000000
#> [2,] -0.149884295 0.029043026
#> [3,] 0.000000000 0.000000000
#> [4,] 0.231374881 0.046274976
#> [5,] 0.000000000 0.000000000
#> [6,] 0.000000000 0.009116327
#> [7,] -0.005828723 -0.001165745
#> [8,] 0.000000000 -0.728584727
#> [9,] 0.000000000 0.000000000
#> [10,] 0.000000000 0.000000000
#> [11,] 0.000000000 0.000000000
#> [12,] -0.432491442 -0.086498288We can also change the scale of the outcomes.
orig_rescale3 <- rescale_sumstats(dat = orig_dat,
output_geno_scale = "allele",
output_pheno_sd = c(1.4, 0.8))There are a few effects of changing the phenotype scale. First the effect sizes are different. In this case the effect sizes for trait 1 have been multiplied by 1.4 and the effect sizes for trait 2 have been multiplied by 0.8. Second, the genetic and environmental covariance matrices have been scaled appropriately.
orig_rescale3$Sigma_G
#> [,1] [,2]
#> [1,] 0.22040097 0.02166381
#> [2,] 0.02166381 0.16451701
orig_rescale3$Sigma_E
#> [,1] [,2]
#> [1,] 1.7395990 0.1944545
#> [2,] 0.1944545 0.4754830
orig_rescale3$Sigma_G + orig_rescale3$Sigma_E
#> [,1] [,2]
#> [1,] 1.9600000 0.2161183
#> [2,] 0.2161183 0.6400000Note that the heritability, orig_rescale3$h2 has not
changed but is no longer equal to the diagonal of Sigma_G.
The trait correlation in trait_corr is also the same. The
final difference is in the total and direct trait effects matrix. In our
case, trait 1 has been multiplied by 1.4 and trait 2 was multiplied by
0.8. On the original scale, a one unit increase in trait 1 caused a 0.2
unit increase in trait 2. So on the new scale, a 1 unit increase in
trait 1 causes a \(0.2 \cdot (0.8/1.4) =
0.114\) unit increase in trait 2.
Note that the causal relationship between the traits implies the following relationship between effect sizes:
with(orig_dat, all.equal(direct_SNP_effects_joint[,2] + total_trait_effects[1,2]*direct_SNP_effects_joint[,1], beta_joint[,2]))
#> [1] TRUEThis is the basis of methods like Mendelian randomization. We can see that after scaling, this relationship is still true.