--- title: "Application of PhenotypeSimulator for a power analysis study" author: "Hannah Meyer" date: "`r Sys.Date()`" output: pdf_document: fig_caption: yes toc: true toc_depth: 2 bibliography: SimulationAndLinearModel.bib csl: plos-genetics.csl vignette: > %\VignetteIndexEntry{Simulation and linear model fit} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library("PhenotypeSimulator") library("ggplot2") ``` This vignette demonstrates the usage and application of *PhenotypeSimulator*. It illustrates the workflow for generating a genetic and phenotypic dataset used for a simple model comparison. First, it shows an example of how to simulate genome-wide SNPs from a re-sampling based approach ([Hapgen2](http://mathgen.stats.ox.ac.uk/genetics_software/hapgen/hapgen2.html)). The genotypes can then be used with *PhenotypeSimulator* to generate multiple correlated phenotypes with defined genetic and non-genetic fixed effects as well as correlation induced by kinship structure. The simulated phenotypes, genotypes, non-genetic covariates and kinship are then used with a linear mixed model framework ( [GEMMA](http://www.xzlab.org/software/GEMMAmanual.pdf)) to investigate the power of multivariate (all phenotypes are analysed jointly) and univariate models (each phenotype is analysed separately) for genome-wide association studies (GWAS). Chunks that demonstrate the usage of external software in bash (Hapggen2 and GEMMA) and chunks that depend on the fully simulated genotypes or GWAS results are simply echoed and not evaluated. Chunks that show summary statistics and correlation plots are evaluated when building this vignette and use the summary data provided in the `extdata` folder of the package (the .Rmd version of the vignette shows the chunks saving the summary data). ## Genotype simulation via Hapgen2 Hapgen2 @Su2011 is a resampling-based genotype simulation tool. As such it needs a reference data set from which to sample genotypes. In the following, the 1000Genomes data from the [impute2 webpage](https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated.html) are used as the reference dataset. We first download the data, unzip them and filter for central European samples (CEU): ```{bash, eval=FALSE} #hapdir=~/data/hmeyer/Supplementary/CEU.0908.impute.files dir=~/data/hmeyer/Supplementary mkdir $dir cd $dir wget https://mathgen.stats.ox.ac.uk/impute/ALL_1000G_phase1integrated_v3_impute.tgz tar ALL_1000G_phase1integrated_v3_impute.tgz hapdir=$dir/ALL_1000G_phase1integrated_v3_impute cd $hapdir for chr in `seq 1 22`; do gunzip ALL_1000G_phase1integrated_v3_chr${chr}_impute.hap.gz gunzip ALL_1000G_phase1integrated_v3_chr${chr}_impute.legend.gz done ``` To find the column numbers of the CEU samples, we search in the samples file for CEU and see that they are in columns 413-497. We then extract these columns from the all .hap files: ```{bash, eval=FALSE} grep -Fn CEU ALL_1000G_phase1integrated_v3.sample for chr in `seq 1 22`; do cut -f 413-497 -d " " ALL_1000G_phase1integrated_v3_chr${chr}_impute.hap \ > chr${chr}.ceu_subset.hap done ``` Hapgen2 and its user manual can be found and downloaded from [here](http://mathgen.stats.ox.ac.uk/genetics_software/hapgen/hapgen2.html). The following code chunk simulates genotypes for 1000 control individuals (no cases) on chromosomes chr1 to chr22. Subsequently, the number of variants on chr5, chr7 and chr11 are counted. These chromosomes were arbitrarily chosen to contain the causal variants. We will use those numbers later as input for *PhenotypeSimulator* (providing the number of variants on the causal chromosomes is optional, but will speed up the simulation procedure). ```{bash, eval=FALSE} for chr in `seq 1 22`; do # Can't get hapgen to work witout the -dl flag (should be possible since #v2.0.2) so create dummy variable, that can be passed on dummyDL=`sed -n '2'p $hapdir/ALL_1000G_phase1integrated_v3_chr${chr}_impute.legend | cut -d ' ' -f 2` hapgen2 -m $hapdir/genetic_map_chr${chr}_combined_b37.txt \ -l $hapdir/ALL_1000G_phase1integrated_v3_chr${chr}_impute.legend \ -h $hapdir/chr${chr}.ceu_subset.hap \ -o $hapdir/genotypes_chr${chr}_hapgen \ -n 1000 0 \ -dl $dummyDL 0 0 0 \ -no_haps_output done chr5=`wc -l $hapdir/genotypes_chr5_hapgen.controls.gen |cut -d " " -f 1` chr7=`wc -l $hapdir/genotypes_chr7_hapgen.controls.gen |cut -d " " -f 1` chr11=`wc -l $hapdir/genotypes_chr11_hapgen.controls.gen |cut -d " " -f 1` ``` For the estimation of the kinship between the individuals we make use of [PLINK](https://www.cog-genomics.org/plink2) @Chang2015. By specifying --oxford-single-chr, we indicate that the input format is the 'oxford' format (.gen/.sample format, see [here](http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html)) and convert the data into plink format (--make-bed). We then strictly prune the genotypes ( --indep-pairwise and --extract) and save the chromosome-wide output file names into a file. This file is used for merging the chromosome-wide files into a single, genome-wide genotype file. Finally, we use this pruned, genome-wide SNP file to estimate the genetic relationship of the samples. ```{bash, eval=FALSE} cd $hapdir # convert to plink format and prune SNPs for chr in `seq 1 22`; do plink --data genotypes_chr${chr}_hapgen.controls \ --oxford-single-chr $chr \ --make-bed \ --out genotypes_chr${chr}_hapgen.controls plink --bfile genotypes_chr${chr}_hapgen.controls \ --indep-pairwise 50kb 5 1 \ --out genotypes_chr${chr}_hapgen.controls plink --bfile genotypes_chr${chr}_hapgen.controls \ --extract genotypes_chr${chr}_hapgen.controls.prune.in \ --make-bed \ --out genotypes_chr${chr}_hapgen.controls.pruned echo -e "genotypes_chr${chr}_hapgen.controls.pruned" >> file_list done # Merge chromsome-wide files into a single, genome-wide file plink --merge-list file_list --make-bed --out genotypes_genome_hapgen.controls for chr in `seq 1 22`; do plink --bfile genotypes_chr${chr}_hapgen.controls.pruned \ --exclude genotypes_genome_hapgen.controls-merge.missnp \ --make-bed \ --out genotypes_chr${chr}_hapgen.controls.nodup echo -e "genotypes_chr${chr}_hapgen.controls.nodup" >> file_list_nodup done plink --merge-list file_list_nodup --allow-no-sex \ --make-bed \ --out genotypes_genome_hapgen.controls # compute kinship plink --bfile genotypes_genome_hapgen.controls\ --make-rel square \ --out genotypes_genome_hapgen.controls.grm ``` ```{r, eval=FALSE} # load kinship data, add small value to diagonal for numerical stability and # do an eigendecomposition indir <- "~/data/hmeyer/Supplementary/CEU.0908.impute.files" kinship <- read.table(paste(indir, "/genotypes_genome_hapgen.controls.grm.rel",sep=""), sep="\t", header=FALSE) kinship <- as.matrix(kinship) diag(kinship) <- diag(kinship) + 1e-4 kinship_decomposed <- eigen(kinship) write.table(kinship, paste(indir, "/genotypes_genome_hapgen.controls.grm.rel", sep=""), sep="\t", col.names=FALSE, row.names=FALSE) write.table(kinship_decomposed$values, paste(indir, "/genotypes_eigenvalues", sep=""), sep="\t", col.names=FALSE, row.names=FALSE) write.table(kinship_decomposed$vectors, paste(indir, "/genotypes_eigenvectors", sep=""), sep="\t", col.names=FALSE, row.names=FALSE) ``` ## Phenotype simulation via PhenotypeSimulator We simulate a phenotype set consisting of three traits for 1000 samples with ten genetic variant effects shared across all traits (`theta=1`; SNPs randomly sampled from the simulated genotypes) and four non-genetic covariate effects. Additional correlation between the phenotypes is simulated, induced by i) the genetic structure of the individuals (infinitesimal genetic effect through estimated kinship), ii) a correleated non-genetic effect (i.e. correlated measurement noise) and iii) an observational noise effect. For each effect component its proportion of the total genetic variance is specified (the NrSNPsChromsome in the `runSimulation` function are the values we obtained from "chr5=`wc -l $hapdir/genotypes_chr5_hapgen.controls.gen |cut -d " " -f 1`" etc. above). The non-genetic covariates consist of four independent components, two following a binomial and two following a normal distribution. The genetic variant effect was simulated to only show shared effects across traits. The remainder of the components were simulated with 80\% of their variance shared across all traits, while the rest remained independent. The total genetic variance accounted to 40\% leaving 60\% of variance explained by the noise terms. ```{r, eval=FALSE} indir <- "/homes/hannah/data/hmeyer/Supplementary/CEU.0908.impute.files" # specify directory to save data; if it doesn't exist yet, create i.e. needs # writing permissions datadir <- '/homes/hannah/tmp/phenotypeSimulator' if (!dir.exists(datadir)) dir.create(datadir, recursive=TRUE) # specify filenames and parameters totalGeneticVar <- 0.4 totalSNPeffect <- 0.1 h2s <- totalSNPeffect/totalGeneticVar kinshipfile <- paste(indir, "/genotypes_genome_hapgen.controls.grm.rel", sep="") genoFilePrefix <- paste(indir, "/genotypes_", sep="") genoFileSuffix <- "_hapgen.controls.gen" # simulate phenotype with three phenotype components simulation <- runSimulation(N = 1000, P = 3, cNrSNP=10, seed=43, kinshipfile = kinshipfile, format = "oxgen", genoFilePrefix = genoFilePrefix, genoFileSuffix = genoFileSuffix, chr = c(5,7,11), mBetaGenetic = 0, sdBetaGenetic = 0.2, theta=1, NrSNPsOnChromosome=c(533966, 503129, 428863), genVar = totalGeneticVar, h2s = h2s, phi = 0.6, delta = 0.2, rho=0.2, NrFixedEffects = 2, NrConfounders = c(2, 2), distConfounders = c("bin", "norm"), probConfounders = 0.2, genoDelimiter=" ", kinshipDelimiter="\t", kinshipHeader=FALSE, verbose = TRUE ) ``` ```{r, eval=FALSE, echo=FALSE} savedir <- paste(system.file("extdata", package="PhenotypeSimulator"), "/resultsSimulationAndLinearModel", sep="") saveRDS(simulation$phenoComponentsFinal, paste(savedir, "/simulation_phenoComponentsFinal.rds", sep="")) ``` Figure \ref{fig:heatmaps} shows heatmaps of the trait-trait correlation (Pearson correlation) of the final phenotype $\mathbf{Y}$ and its five phenotype components: genetic variant effects $\mathbf{XB}$, non-genetic covariate effects $\mathbf{WA}$, relatedness effects $\mathbf{U}$ (infinitesimal genetic effects), non-genetic correlated effects $\mathbf{T}$ and observational noise $\mathbf{\Psi}$. ```{r, echo=FALSE} savedir <- paste(system.file("extdata", package="PhenotypeSimulator"), "/resultsSimulationAndLinearModel", sep="") simulation <- list( phenoComponentsFinal=readRDS(paste(savedir, "/simulation_phenoComponentsFinal.rds", sep=""))) ``` ```{r heatmaps, fig.keep='all', fig.height=3, fig.width=7, eval=TRUE, fig.cap="\\label{fig:heatmaps}Heatmaps of the trait-by-trait correlation (Pearson correlation) of the simulated phenotype and its five phenotype components."} cor_phenotype <- reshape2::melt(cor(simulation$phenoComponentsFinal$Y)) cor_phenotype$type <- "Y" cor_genBg <- reshape2::melt(cor(simulation$phenoComponentsFinal$Y_genBg)) cor_genBg$type <- "U" cor_genFixed <- reshape2::melt(cor(simulation$phenoComponentsFinal$Y_genFixed)) cor_genFixed$type <- "XB" cor_noiseFixed <- reshape2::melt(cor(simulation$phenoComponentsFinal$Y_noiseFixed)) cor_noiseFixed$type <- "WA" cor_noiseBg <- reshape2::melt(cor(simulation$phenoComponentsFinal$Y_noiseBg)) cor_noiseBg$type <- "Psi" cor_correlatedBg <- reshape2::melt(cor(simulation$phenoComponentsFinal$Y_correlatedBg)) cor_correlatedBg$type <- "T" cor_components <- rbind(cor_phenotype, cor_genBg, cor_genFixed, cor_noiseBg, cor_noiseFixed, cor_correlatedBg) cor_components$type <- factor(cor_components$type, levels=c("Y", "WA", "XB", "U", "T", "Psi"), labels=c("bold(Y)", "bold(WA)", "bold(XB)", "bold(U)", "bold(T)", "bold(Psi)")) colnames(cor_components) <- c("TraitA", "TraitB", "Correlation", "type") # For prettier plotting cor_components$TraitA <- gsub("_", " ", cor_components$TraitA) cor_components$TraitB <- gsub("_", " ", cor_components$TraitB) p_corr <- ggplot(data=cor_components, aes(x=TraitA, y=TraitB, fill=Correlation)) p_corr <- p_corr + geom_tile() + facet_grid(~ type, labeller = label_parsed) + scale_x_discrete(expand = c(0,0)) + scale_y_discrete(expand = c(0,0)) + scale_fill_gradient2(low="#F2AD00", high="#5BBCD6" , mid="white", limits = c(-1,1), breaks = c(-1, -0.5, 0, 0.5, 1), labels = c(-1, -0.5, 0, 0.5, 1), guide=guide_colorbar(direction = "horizontal", title.position="top", title.hjust =0.5)) + xlab(" ") + ylab(" ") + theme_bw() + coord_fixed() + theme(axis.text.x = element_text(angle=90, vjust=0.5), legend.position = "bottom", strip.background = element_rect(fill="white", color="white"), axis.text =element_text(size=8), legend.title =element_text(size=8), legend.text =element_text(size=8), strip.text =element_text(size=8)) print(p_corr) ``` ```{r, echo=FALSE, eval=FALSE} ggsave(plot=p_corr, filename=paste(savedir, "/correlation_phenotype.pdf", sep=""), units="mm", width=86, height=60) ``` The phenotypes, kinship, non-genetic covariates and the intermediate phenotype components (such as shared and independendent genetic variant effects) are saved to datadir. In addition to saving all components in .csv format, we also specify gemma as an output format. This option will create phenotype, kinship and covariate files that can directly be used as input for GEMMA. ```{r, eval=FALSE} # Save phenotypes, kinship and non-genetic covariates in csv and # GEMMA-specific format outdirectory <- savePheno(simulation, directory = datadir, intercept_gemma=TRUE, format=c("csv", "gemma"), saveIntermediate=TRUE) ``` As the simulated genotype dataset was large, we chose to sample from the number of variants on the causal chromosomes and did not read the entire genotype data set into memory. As such, the genotypes themselves are not yet in GEMMA specific format. The following code chunk (obtained from the [GEMMA user manual](http://www.xzlab.org/software/GEMMAmanual.pdf)) reformats the oxford format (.gen/.sample), to the GEMMA (or BIMBAM format): ```{bash, eval=FALSE} for chr in `seq 1 22`; do cat $hapdir/genotypes_chr${chr}_hapgen.controls.gen | awk -v s=1000 \ '{ printf $2 "," $4 "," $5; for(i=1; i<=s; i++) printf "," $(i*3+3)*2+$(i*3+4); printf "\n"}' \ > $hapdir/genotypes_chr${chr}_hapgen.controls.gemma done ``` ## Genome-wide association study via GEMMA One application of *PhenotypeSimulator* is generating phenotypes for model evaluation. As a case study, we show the use of the simulated phenotypes to evaluate the power of multivariate (all phenotypes are analysed jointly) and univariate models (each phenotype is analyses separately) for GWAS where multiple traits per individual are available. The univariate and multivariate linear mixed model as implemented in [GEMMA](http://www.xzlab.org/software/GEMMAmanual.pdf) @Zhou2014 are briefly described below. **Univariate linear mixed model. ** In the univariate linear mixed model, the $N \times 1$ vector of a single phenotype from $N$ samples is modeled as the sum of a genetic fixed effect with the $N \times 1$ vector of genetic variant $\mathbf{x}$ and its effect size $\beta$, $\mathbf{W} = (\mathbf{w}_1, \dot, \mathbf{w}_c)$ is an $N \times C$ matrix of $C$ covariates (fixed effects) including a column of 1s (implemented in *PhenotypeSimulator* via intercept_gemma=TRUE); $\alpha$ is a $C$-vector of the corresponding coefficients including the intercept; $\mathbf{u}$ is an $N$-vector of random genetic effects; $\mathbf{e}$ is an $N$-vector of observational noise. $\sigma_g$ and $\sigma_e$ are the variance of the genetic and nosie random effects; $\mathbf{K}$ is the known $N \times N$ relatedness matrix and $\mathbf{I}_n$ is an $N \times N$ identity matrix. $MVN_N$ denotes the $N$-dimensional multivariate normal distribution. $$\mathbf{y} = \mathbf{x}\beta + \mathbf{W}\alpha + \mathbf{u} + \mathbf{e}$$ with $$\mathbf{u} \sim MVN_{N}(0, \sigma_g \text{K})$$ and $$\mathbf{e} \sim MVN_{N}(0, \sigma_e I_N)$$ **Multivariate linear mixed model. ** The univariate linear mixed model can be extended to the multivariate linear mixed model with: $$\mathbf{Y} = x\beta^T + \mathbf{WA} + \mathbf{U} + \mathbf{E}$$ with $$\mathbf{U} \sim MN_{N\times P}(0, \mathbf{K}, \mathbf{V}_g)$$ and $$\mathbf{E} \sim MN_{N\times P}(0, \mathbf{I}_N, \mathbf{V}_e)$$ where $\mathbf{Y}$ is the $N \times P$ vector of $P$ phenotype from $N$ samples, $\mathbf{x}$ the $N \times 1$ vector of genetic variant and its $P \times 1$ effect size vector $\mathbf{\beta}$. $\mathbf{W} = (\mathbf{w}_1, \dot, \mathbf{w}_c)$ is an $N \times C$ matrix of $C$ covariates (fixed effects) including a column of 1s; $A$ is the $C \times P$-matrix of the corresponding coefficients including the intercept; $\mathbf{U}$ is an $N \times P$-matrix of random genetic effects; $\mathbf{E}$ is an $N \times P$-matrix of observational noise. $\mathbf{V}_g$ and $\mathbf{V}_e$ are the $P \times P$ symmetric matrices of the genetic and noise variance components; $\mathbf{K}$ is the known $N \times N$ relatedness matrix and $\mathbf{I}_n$ is an $N \times N$ identity matrix. $MN(\mu, \mathbf{V}_1, \mathbf{V}_2)$ denotes a matrix normal distribution with mean = $\mu$, the column covariance $\mathbf{V}_1$ and the row covariance $\mathbf{V}_2$. We use these models for the multi-trait analyses of the three simulated phenotypes (multivariate linear mixed model, flags: -lmm 2 and -n 1 2 3) and the independent analyses of each of the three phenotypes (univariate linear mixed model, flags: -lmm 2 and -n 1 or -n 2 or -n 3). We disable the automatic minor allele frequency filtering by setting -maf 0. The following code chunk shows the GEMMA (version 0.96) setup to run these models on a per chromosome basis ```{bash, eval=FALSE} hapdir=/homes//hannah/data/hmeyer/Supplementary/CEU.0908.impute.files datadir=/homes/hannah/tmp/phenotypeSimulator for chr in `seq 1 22`; do # multivariate linear mixed model across all three traits gemma0.96 \ -g $hapdir/genotypes_chr${chr}_hapgen.controls.gemma \ -p $datadir/Ysim_gemma.txt \ -d $hapdir/genotypes_eigenvalues \ -u $hapdir/genotypes_eigenvectors \ -c $datadir/Covs_gemma.txt \ -maf 0 \ -km 1 \ -lmm 2 \ -n 1 2 3 \ -outdir $datadir \ -o GWAS_gemma_LMM_mt_chr$chr # univariate linear mixed model for trait 1 gemma0.96 \ -g $hapdir/genotypes_chr${chr}_hapgen.controls.gemma \ -p $datadir/Ysim_gemma.txt \ -d $hapdir/genotypes_eigenvalues \ -u $hapdir/genotypes_eigenvectors \ -c $datadir/Covs_gemma.txt \ -maf 0 \ -km 1 \ -lmm 2 \ -n 1 \ -outdir $datadir \ -o GWAS_gemma_LMM_st_trait1_chr$chr # univariate linear mixed model for trait 2 gemma0.96 \ -g $hapdir/genotypes_chr${chr}_hapgen.controls.gemma \ -p $datadir/Ysim_gemma.txt \ -d $hapdir/genotypes_eigenvalues \ -u $hapdir/genotypes_eigenvectors \ -c $datadir/Covs_gemma.txt \ -maf 0 \ -km 1 \ -lmm 2 \ -n 2 \ -outdir $datadir \ -o GWAS_gemma_LMM_st_trait2_chr$chr # univariate linear mixed model for trait 3 gemma0.96 \ -g $hapdir/genotypes_chr${chr}_hapgen.controls.gemma \ -p $datadir/Ysim_gemma.txt \ -d $hapdir/genotypes_eigenvalues \ -u $hapdir/genotypes_eigenvectors \ -c $datadir/Covs_gemma.txt \ -maf 0 \ -km 1 \ -lmm 2 \ -n 3 \ -outdir $datadir \ -o GWAS_gemma_LMM_st_trait3_chr$chr done ``` Subsequently, we extract the columns of interest for the power analysis (simply based on p-values). The last column contains the p-value of th analyses, the second column contains the genetic variant identifier. We extract these columns and write the per-chromosome files into a genome-wide file. ```{bash, eval=FALSE} columns_LMM_st=`head -1 $datadir/GWAS_gemma_LMM_st_trait3_chr22.assoc.txt|wc -w` columns_LMM_mt=`head -1 $datadir/GWAS_gemma_LMM_mt_chr22.assoc.txt | wc -w` for chr in `seq 1 22`; do tail -n +2 $datadir/GWAS_gemma_LMM_st_trait1_chr${chr}.assoc.txt | cut \ -f 2,$columns_LMM_st >> $datadir/GWAS_gemma_LMM_st_trait1_pvalues.txt tail -n +2 $datadir/GWAS_gemma_LMM_st_trait2_chr${chr}.assoc.txt | cut \ -f 2,$columns_LMM_st >> $datadir/GWAS_gemma_LMM_st_trait2_pvalues.txt tail -n +2 $datadir/GWAS_gemma_LMM_st_trait3_chr${chr}.assoc.txt | cut \ -f 2,$columns_LMM_st >> $datadir/GWAS_gemma_LMM_st_trait3_pvalues.txt tail -n +2 $datadir/GWAS_gemma_LMM_mt_chr${chr}.assoc.txt | cut \ -f 2,$columns_LMM_mt >> $datadir/GWAS_gemma_LMM_mt_pvalues.txt done ``` The quantile-quantile plots of p-values observed from the univariate and multivariate LMMs fitted to all genome-wide SNPs are depicted in Figure \ref{fig:qqplot}. ```{r, eval=FALSE} datadir="/homes/hannah/tmp/phenotypeSimulator" causalSNPs <- fread(paste(datadir, "/SNP_NrSNP10.csv", sep=""), header=TRUE, sep=",", stringsAsFactors=FALSE, data.table=FALSE) causalSNPnames <- colnames(causalSNPs) LMM_mt <- fread(paste(datadir, "/GWAS_gemma_LMM_mt_pvalues.txt", sep=""), data.table=FALSE, header=FALSE, stringsAsFactors=FALSE) LMM_st_trait1 <- fread(paste(datadir, "/GWAS_gemma_LMM_st_trait1_pvalues.txt", sep=""), data.table=FALSE, header=FALSE, stringsAsFactors=FALSE) LMM_st_trait2 <- fread(paste(datadir, "/GWAS_gemma_LMM_st_trait2_pvalues.txt", sep=""), data.table=FALSE, header=FALSE, stringsAsFactors=FALSE) LMM_st_trait3 <- fread(paste(datadir, "/GWAS_gemma_LMM_st_trait3_pvalues.txt", sep=""), data.table=FALSE, header=FALSE, stringsAsFactors=FALSE) LMM_mt <- LMM_mt[order(LMM_mt[,2]),] LMM_mt$expected <- stats::ppoints(nrow(LMM_mt)) LMM_mt$model <- "mvLMM" LMM_mt$trait <- "all" colnames(LMM_mt)[1:2] <- c("rsID", "observed") LMM_st_trait1 <- LMM_st_trait1[order(LMM_st_trait1[,2]),] LMM_st_trait1$expected <- LMM_mt$expected LMM_st_trait1$model <- "uvLMM" LMM_st_trait1$trait <- "trait1" colnames(LMM_st_trait1)[1:2] <- c("rsID", "observed") LMM_st_trait2 <- LMM_st_trait2[order(LMM_st_trait2[,2]),] LMM_st_trait2$expected <- LMM_mt$expected LMM_st_trait2$model <- "uvLMM" LMM_st_trait2$trait <- "trait2" colnames(LMM_st_trait2)[1:2] <- c("rsID", "observed") LMM_st_trait3 <- LMM_st_trait3[order(LMM_st_trait3[,2]),] LMM_st_trait3$expected <- LMM_mt$expected LMM_st_trait3$model <- "uvLMM" LMM_st_trait3$trait <- "trait3" colnames(LMM_st_trait3)[1:2] <- c("rsID", "observed") pvalues <- rbind(LMM_mt, LMM_st_trait1, LMM_st_trait2, LMM_st_trait3) pvalues$causal <- "all" pvalues$causal[which(pvalues$rsID %in% causalSNPnames)] <- "simulated causal" pvalues$causal <- factor(pvalues$causal, levels=c("all", "simulated causal")) p_qq <- ggplot(data=pvalues, aes(x=-log10(expected), y=-log10(observed))) p_qq <- p_qq + geom_point(aes(color=causal, shape=trait)) + scale_color_manual(values=c('darkgrey', '#1b9e77'), name="SNPs", guide=guide_legend(nrow = 2, title.position="top", title.hjust =0.5, override.aes = list(shape=21))) + scale_shape_manual(values=c(18, 15:17), name="Traits", guide=guide_legend(nrow = 2, override.aes = list(colour='darkgrey'), title.position="top", title.hjust =0.5)) + geom_abline(intercept=0, slope=1, col="black") + geom_point(data=dplyr::filter(pvalues, causal == "simulated causal"), color='#1b9e77', aes(shape=trait)) + xlab(expression(Expected~~-log[10](italic(p))) ) + ylab(expression(Observed~~-log[10](italic(p)))) + facet_grid(~ model) + theme_bw() + theme(strip.background = element_rect(fill="white"), axis.title = element_text(size=12), axis.text =element_text(size=10), legend.title =element_text(size=12), legend.text =element_text(size=10), strip.text =element_text(size=10), legend.position = "bottom", legend.direction = "horizontal") ``` ```{r echo=FALSE, eval=FALSE, out.width='100%'} ggsave(plot=p_qq, filename=paste(savedir, "/gwas_results_gemma_qqplot.png", sep=""), units="mm", width=86, height=100) ``` ```{r qqplot, echo=FALSE, fig.cap="\\label{fig:qqplot}Quantile-quantile plots of p-values observed from the multivariate linear mixed model (mvLMM, traits:all) and the univariate linear mixed models (uvLMM, traits: trait1/trait2/trait3) fitted to each of about eight million genome-wide SNPs (grey), including the ten SNPs for which a phenotype effect was modelled (green)." } #knitr::include_graphics("../docs/articles/Simulation-and-LinearModel_files/figure-html/qqplot-1.png") knitr::include_graphics("gwas_results_gemma_qqplot.png") ``` Counting the number of causal SNPs that both models can recover, we see that the multi-trait GWAS shows greater power compared to any of the single trait analyses. Here a total of four causal SNPs (Figure \ref{fig:qqplot}) compared the combined count of the single-trait GWAS of three was detected with a p-values of less than the commonly applied GWAS threshold of $5 \times 10^{-8}$ (Table \ref{tab:sigSNPs}). The p-values of the three univariate analyses were adjusted for multiple hypothesis testing (Bonferroni adjustment). ```{r, eval=FALSE} sigSNPs <- dplyr::filter(pvalues, causal == "simulated causal", observed < 5*10^-8) sigSNPs$observed_adjusted <- sigSNPs$observed sigSNPs$observed_adjusted[which(sigSNPs$trait != "all")] <- 3*sigSNPs$observed[which(sigSNPs$trait != "all")] sigSNPs <- dplyr::filter(sigSNPs, observed_adjusted < 5*10^-8) ``` ```{r, echo=FALSE, eval=FALSE} write.table(sigSNPs, paste(savedir, "/sigSNPs.csv", sep=""), col.names=TRUE, row.names=FALSE, quote=FALSE, sep=",") ``` ```{r sigSNPs, echo=FALSE} sigSNPs <- read.table(paste(savedir, "/sigSNPs.csv", sep=""), header=TRUE, sep=",") knitr::kable(sigSNPs[, c(1,4,5,7)], caption="\\label{tab:sigSNPs}P-values from multi-trait GWAS (model == mvLMM) and single-trait GWAS (model == uvLMM)", table.attr = "id=\"sigSNPs\"", digits=40) ``` The ability of linear (mixed) models to detect the SNPs for which a phenotype effect was modelled depends on the allele frequencies of these SNPs and the effect size @Cohen1992,@Halsey2015: the higher the effect size and/or the allele frequencies the better the power to detect the SNP effects. The p-values of all SNPs with simulated effect on the phenotypes in relation to their allele frequencies and simulated effect sizes are depicted in Figure \ref{fig:freq-beta}. There is a strong trend for SNPs with high allele frequencies and large simulated effect sizes to have low p-values. ```{r, eval=FALSE} # Read the SNPs simulated to have an effect on the phenotype and their SNP # effects SNPs <- read.csv(paste(datadir,"/SNP_NrSNP10.csv", sep=""), row.names=1) SNPeffect <- read.csv(paste(datadir,"/SNP_effects_NrSNP10.csv", sep=""), row.names=1) # HapGen simulated SNPs might be encoded as 7-25481146 , which R converts to # X7.25481146 via make.names in read.table; substitute X and . to get original # names SNPnames <- gsub("\\.", "-", gsub("X", "", colnames(SNPs))) SNPnames <- SNPnames[order(SNPnames)] simulated_withEffect <- dplyr::filter(pvalues, rsID %in% SNPnames) ``` ```{r, echo=FALSE, eval=FALSE} write.table(simulated_withEffect, paste(savedir,"/simulated_withEffect.csv", sep=""), col.names=TRUE, quote=FALSE,row.names=FALSE) ``` ```{r, echo=FALSE, eval=TRUE} SNPeffect <- read.csv(paste(savedir,"/SNP_effects_NrSNP10.csv", sep=""), row.names=1) SNPs <- read.csv(paste(savedir,"/SNP_NrSNP10.csv", sep=""), row.names=1) SNPnames <- gsub("\\.", "-", gsub("X", "", colnames(SNPs))) SNPnames <- SNPnames[order(SNPnames)] simulated_withEffect <- read.csv(paste(savedir,"/simulated_withEffect.csv", sep="")) ``` ```{r freq-beta, fig.cap="\\label{fig:freq-beta}P-values, allele frequencies and simulated effect sizes of the ten SNPs simulated to have an effect on phenotype."} # get the allele frequencies of the SNPs simulated to have an effect on the # phenotype freq <- apply(SNPs, 2, getAlleleFrequencies) minor <- apply(freq, 2, min) minor <- minor[order(gsub("X", "", names(minor)))] # format the effect size table and combine with p-value and frequency # information effects <- reshape2::melt(SNPeffect, value.name="beta", variable.name="name") effects$type <- gsub('\\..*', '', effects$name) effects$rsID <- gsub("\\.", "-", gsub('.*_', '', effects$name)) effects$trait <- rep(paste("trait", 1:3, sep=""), 10) effects <- effects[order(effects$rsID),] effects <- dplyr::filter(effects, rsID %in% simulated_withEffect$rsID) simulated_withEffect <- simulated_withEffect[order(simulated_withEffect[, 1]),] simulated_withEffect$beta <- NA simulated_withEffect$beta[simulated_withEffect$trait != "all"] <- effects$beta simulated_withEffect$freq <- rep(minor[which(SNPnames %in% effects$rsID)], each=4) # plot association p-values in relation to the absolute value of the simulated # effect sizes and the SNPs allele frequencies p <- ggplot(dplyr::filter(simulated_withEffect, trait != "all")) p <- p + geom_point(aes(x=freq, y=-log10(observed), color=abs(beta), shape=trait)) + geom_hline(yintercept=-log10(5*10^(-8)), color="darkgrey") + scale_color_gradientn(colours=c('#f0f9e8','#ccebc5','#a8ddb5','#7bccc4', '#4eb3d3','#2b8cbe','#08589e'), name=expression("| simulated" ~beta~ "|")) + scale_shape_discrete(name="Phenotype in GWAS") + xlab("Allele frequencies") + ylab(expression(-log[10](p-value))) + theme_bw() print(p) ``` ```{r echo=FALSE, eval=FALSE, out.width='100%'} ggsave(plot=p, filename=paste(savedir, "/effectsizes-freq-pvalues.pdf", sep=""), units="mm", width=150, height=120) ``` ## References