Simulate multivariate GWAS data
Usage
sim_mv(
N,
J,
h2,
pi,
G = 0,
est_s = FALSE,
R_obs = NULL,
R_E = NULL,
R_LD = NULL,
af = NULL,
snp_effect_function = "normal",
snp_info = NULL,
sporadic_pleiotropy = TRUE,
pi_exact = FALSE,
h2_exact = FALSE
)
Arguments
- N
GWAS sample size. N can be a scalar, vector, or matrix. If N is a scalar, all GWAS have the same sample size and there is no overlap between studies. If N is a vector, each element of N specifies the sample size of the corresponding GWAS and there is no overlap between studies. If N is a matrix, N_ii specifies the sample size of study i and N_ij specifies the number of samples present in both study i and study j. The elements of N must be positive but non-integer values will not generate an error.
- J
Number of variants to simulate
- h2
A scalar or vector giving the heritability of each trait.
- pi
A scalar or vector giving the expected proportion of direct effect SNPs for each trait.
- G
Matrix of direct effects. Rows correspond to the 'from' trait and columns correspond to the 'to' trait, so
G[1,2]
is the direct effect of trait 1 on trait 2. G should have 0 on the diagonal.- est_s
If TRUE, return estimates of se(`beta_hat`). If FALSE, the exact standard error of `beta_hat` is returned. Defaults to FALSE.
- R_obs
Total observational correlation between traits. R_obs won't impact summary statistics unless there is sample overlap. See Details for default behavior.
- R_E
Total correlation of the environmental components only. R_E and R_obs are alternative methods of specifying trait correlation. Use only one of these two options. R_E may be phased out in the future.
- R_LD
Optional list of LD blocks. R_LD should have class
list
. Each element of R_LD can be either a) a matrix, b) a sparse matrix (classdsCMatrix
) or c) an eigen decomposition (classeigen
). All elements should be correlation matrices, meaning that they have 1 on the diagonal and are positive definite. See Details and vignettes.- af
Optional vector of allele frequencies. If R_LD is not supplied, af can be a scalar, vector or function. If af is a function it should take a single argument (n) and return a vector of n allele frequencies (See Examples). If R_LD is supplied, af must be a vector with length equal to the size of the supplied LD pattern (See Examples).
- snp_effect_function
Optional function to generate variant effects.
snp_effect_function
can be a single function or list of functions of length equal to the number of traits (see Details).- snp_info
Optional
data.frame
of variant information to be passed to variant effect functions. IfR_LD
is specified,snp_info
should have number of rows equal to the size of the supplied LD pattern. Otherwisesnp_info
should haveJ
rows.- sporadic_pleiotropy
Allow sporadic pleiotropy between traits. Defaults to TRUE.
- pi_exact
If TRUE, the number of direct effect SNPs for each trait will be exactly equal to
round(pi*J)
.- h2_exact
If TRUE, the heritability of each trait will be exactly `h2`.
- return_dat
Useful development option, not recommend for general users.
Value
A list with the following elements:
Simulated effect estimates and standard errors are contained in matrices
+ beta_hat
: Effect estimates for each trait
+ se_beta_hat
Standard error of effect estimates, equal to sqrt(1/N_m *Var(G_j)).
+ s_estimate
Estimate of se_beta_hat
. Present only if est_s = TRUE
.
Four matrices contain direct and total marginal and joint SNP-trait associations:
+ direct_SNP_effects_marg
and direct_SNP_effects_joint
give direct effects of SNPs on traits. These are the
same if there is no LD. If there is LD, direct_SNP_effects_marg
is the direct component of the expected marginal association.
+ beta_marg
and beta_joint
give the total SNP effects (direct and indirect). These are the
same if there is no LD. If there is LD, beta_marg
is the total expected marginal association. i.e. beta_marg
is the
expected value of beta_hat
.
Four matrices describe the covariance of traits and the row correlation of effect estimates
+ Sigma_G
Genetic variance-covariance matrix. Diagonal elements are equal to heritability.
+ Sigma_E
Environmental variance-covariance matrix.
+ trait_corr
Population correlation of traits. trait_corr = Sigma_G + Sigma_E
.
+ R
Row correlation of beta_hat - beta_marg
, equal to trait_corr
scaled by the overlap proportion matrix.
Finally,
+ snp_info
Is a data frame with variant information. If R_LD
was omitted, snp_info
contains only the allele frequency of
each variant. If R_LD
was included, snp_info
also contains block and replicate information as well as any information supplied to the
snp_info
input parameter.
Details
This function generates GWAS summary statistics from a linear SEM specified by the matrix G. The previous "xyz" mode is now deprecated. If you are used to using this function in xyz mode, you can generate the corresponding G using the xyz_to_G function (see examples).
G should be square (#traits by #traits) matrix with 0s on the diagonal.
All traits have variance 1, so G[i,j]^2
is the proportion of variance of trait j explained by the direct effect of trait i.
You will get an error if you specify a cyclic DAG. There are also some combinations of DAG and heritability that are impossible and
will give an error. For example in a two trait DAG where trait 1 has an effect of \(x\) on trait 2, and trait 1 has a heritability
of \(h^2_1\), trait 2 must have a heritability of at least \((x^2)*(h^2_1)\).
To generate data with LD, supply R_LD
which can be a list of matrices, sparse matrices, or eigen-decompositions.
These matrices are interpreted as blocks in a block-diagonal LD matrix. The af
argument must be provided
and should be a vector with length equal to the total size of the LD pattern (the sum of the sizes of each block).
R_LD
does not need to have the same size as J
. The LD pattern will be repeated or subset as necessary
to generate the desired number of variants.
If R_obs
is NULL (default value), we assume that direct environmental effects on each trait are independent and all
environmental correlation results from the relationships specified in G
. Alternatively R_obs
can be any positive
definite correlation matrix.
The snp_effect_function
argument supplies a custom function to generate standardized direct variant effects. By default,
standardized direct effects are generated from a normal distribution. This means that in the default mode, all variants
have equal expected heritability explained regardless of allele frequency or other features. The function(s) passed to
snp_effect_function
will override this default. This function should take three arguments, n
, sd
, and snp_info
.
The output should be a vector of length n with expected total sum of squares equal to sd^2
. Note: Even though the
final effects returned will usually be on the per-allele scale, snp_effect_function
should
return standardized (per-sd) effect sizes.
Examples
# Two traits with no causal relationship and some environmental correlation
# specify completely overlapping GWAS
N <- matrix(1000, nrow = 2, ncol =2)
R_obs <- matrix(c(1, 0.3, 0.3, 1), nrow = 2, ncol = 2)
dat <- sim_mv(G = 2, N = N, J = 20000, h2 = c(0.4, 0.3), pi = 1000/20000,
R_obs = R_obs)
#> SNP effects provided for 20000 SNPs and 2 traits.
dat$R # This is the true correlation of the estimation error of beta_hat
#> [,1] [,2]
#> [1,] 1.0 0.3
#> [2,] 0.3 1.0
cor(dat$beta_hat - dat$beta_marg) # Should be similar to dat$R
#> [,1] [,2]
#> [1,] 1.0000000 0.2935818
#> [2,] 0.2935818 1.0000000
# The af argument can be a scalar, vector, or function.
dat <- sim_mv(N = N, J = 20000, h2 = c(0.4, 0.3), pi = 1000/20000,
G = 2, R_obs = R_obs, af = function(n){rbeta(n = n, 1, 5)})
#> SNP effects provided for 20000 SNPs and 2 traits.
# A very simple example with LD
# Use a pattern of two small blocks of LD
A1 <- matrix(0.7, nrow = 10, ncol = 10)
diag(A1) <- 1
A2 <- matrix(0.1, nrow = 6, ncol = 6)
diag(A2) <- 1
# If using LD, af should have the same size as the LD pattern
af <- runif(n = 16)
dat <- sim_mv(N = N, J = 20000, h2 = c(0.4, 0.3), pi = 1000/20000,
G = 2, R_obs = R_obs, R_LD = list(A1, A2), af = af)
#> SNP effects provided for 20000 SNPs and 2 traits.
# Use xyz_to_G to generate G from xyz specification
myG <- xyz_to_G(tau_xz = c(0.2, -0.3), tau_yz = c(0.1, 0.25),
dir_xz = c(1, -1), dir_yz = c(1,1), gamma = 0)
# If N is a scalar or a vector, there is no sample overlap
dat <- sim_mv(N = 10000, J = 20000, h2 = rep(0.4, 4),
pi = c(500, 500, 1000, 1000)/20000,
G = myG)
#> SNP effects provided for 20000 SNPs and 4 traits.
plot(dat$beta_marg[,3], dat$beta_marg[,1])
abline(0, dat$total_trait_effects[3,1])