--- title: "Using PhenotypeSimulator to simulate phenotypes based on an example dataset" author: "Hannah Meyer" date: "`r Sys.Date()`" output: pdf_document: fig_caption: yes toc: true toc_depth: 2 bibliography: Vignette-example-data.bib csl: plos-genetics.csl header-includes: - \usepackage[bf]{caption} - \usepackage{float} - \floatplacement{figure}{H} vignette: > %\VignetteIndexEntry{Simulation based on example data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` *PhenotypeSimulator* allows for the simulation of high-dimensional phenotypes comprised from a number of different components: genetic variant and population- genetic effects, non-genetic covariates, correlated and observational noise effects. In this vignette, we show the usefulness of being able to specify different phenotype components with distinct properties. We demonstrate how different phenotype components relate directly to a real high-dimensional, image-derived data set of cardiac morphology. Figure \ref{fig:wallthickness} shows the average left ventricular wall thickness of 1,185 healthy volunteers measured at more than 27,000 positions in the left ventricle of the heart @deMarvao2014 @Biffi2017. ```{r wall, echo=FALSE, fig.cap="\\label{fig:wallthickness}**Left ventricular wall thickness and contributing factors.** Left ventricular wall thickness was measured at more than 27k positions in a cohort of 1,185 healthy volunteers. The average wall thickness at each position is depicted in saturated colours (the mesh depicts the right ventricle for reference). Heart wall thickness is determined by an interplay of genetic (green arrows) and non-genetic factors (grey arrows). Additional correlation between positional thickness can be observed for measures in close spatial proximity.", fig.align='center'} #knitr::include_graphics("../docs/articles/SimulationBasedOnExampleData_files/figure-html/wall-1.png") knitr::include_graphics("wallthickness-min.png") ``` These cardiac morphology measurements are likely shaped by a combination of direct genetic variant effects, population genetic effects and non-genetic covariates. In addition, other influences like image acquisition and processing can contribute to the final measurements - these effects are summarised as observational noise. Furthermore, by the nature of the measurements i.e. wall thickness of an approximately smooth ventricular wall, measurements in close spatial proximity will be strongly correlated. The following sections will show the distribution of additional cohort measurements like height and age (non-genetic covariates) and the correlation of the measurements by proximity. In analogy to those, similar components will then be simulated with *PhenotypeSimulator*. ## Cardiac morphology data The following section depicts data from 100 randomly selected participants of the 1,185 healthy volunteers of the original study. From the 27k heart wall thickness measurements, a subset of 1,000 measurements ie. phenotypes was chosen to depict the correlation pattern. ```{r load real data} library(PhenotypeSimulator) library(ggplot2) phenofile <- system.file("extdata/vignettes/pheno_small.rds", package = "PhenotypeSimulator") covsfile <- system.file("extdata/vignettes/covs.rds", package = "PhenotypeSimulator") pheno <- readRDS(phenofile) covs <- readRDS(covsfile) dim(pheno) dim(covs) colnames(covs) ``` As example covariates of the dataset, age, height, weights, sex and the number slices in the 2D cardiac MRI were chosen (see original publication for details on phenotypes @deMarvao2014). Figure \ref{fig:distribution_covs} shows the distribution of these covariates across the 100 individuals. ```{r distribution, fig.cap="\\label{fig:distribution_covs}**Distribution of covariates.** Histograms of the reduced cohort data (100 individuals) for sex, age, height, weight and number of 2D cardiac MRI slices.", fig.align='center'} covs_melt <- reshape2::melt(covs, value.name="meassure", variable.name="covariate") covs_melt$covariate <- factor(covs_melt$covariate, levels=c("Age", "Height", "Weight", "Sex", "Slices"), labels= c("Age [years]", "Height [cm]", "Weight [kg]", "Sex", "Slices")) p_cont <- ggplot(dplyr::filter(covs_melt, covariate %in% c("Age [years]", "Height [cm]", "Weight [kg]")), aes(x=meassure)) p_cont <- p_cont + geom_histogram(binwidth=1) + facet_wrap(~covariate, scales = "free") + theme_bw() + theme(axis.title.x=element_blank(), strip.background = element_rect(fill="white", color="black")) p_cat <- ggplot(dplyr::filter(covs_melt, covariate %in% c("Sex", "Slices")), aes(x=meassure)) p_cat <- p_cat + geom_bar(aes(x=meassure)) + facet_wrap(~covariate, scales = "free") + theme_bw() + theme(axis.title.x=element_blank(), strip.background = element_rect(fill="white", color="black")) combine <- cowplot::plot_grid(p_cont, p_cat, nrow=2) print(combine) ``` The correlation of the 1,000 left-ventricular wall thickness measurements is shown in Figure \ref{fig:corpheno}. While we don't know the exact influences of each covariate, the genetic effects or the observational noise, we can clearly see that there is spatial correlation effect, where positions in proximity (close to the diagonal) in general are more correlated than more distant positions. ```{r correlation phenotypes, eval=FALSE} cor_pheno <- cor(pheno) gplots::heatmap.2(cor_pheno, trace="none", dendrogram="none", keysize=1, col=colorRampPalette(c("white", "#5BBCD6")), labRow=FALSE, labCol=FALSE, breaks=seq(0,1,0.001), key.title="", key.xlab="Pearson's correlation coefficient", key.ylab="", density.info="none") ``` ```{r pheno, echo=FALSE, out.width='\\textwidth', fig.align='center',fig.cap="\\label{fig:corpheno}\\textbf{Correlation of left ventricular wall thickness measurements.}"} cor_pheno <- cor(pheno) #knitr::include_graphics("../docs/articles/SimulationBasedOnExampleData_files/figure-html/pheno-1.png") knitr::include_graphics("cor_pheno-min.png") ``` ## Simulation For the simulation of similar phenotype components as shown above, we make the assumption that the above distributions approximate a uniform (age), normal (weight, height) and binomial (sex) distribution. The slices are assumed to be uniformly drawn from four categories. In the following, we extract the relevant distribution parameters: mean and standard deviation for normal distribution, range for uniform distribution, and proportion of ones from binomial distribution. To approximate the spatial correlation, we get the 0.999 quantile of the correlation values. ```{r data parameters } age_range <- range(covs$Age) weight_mean <- mean(covs$Weight) weight_sd <- sd(covs$Weight) height_mean <- mean(covs$Height) height_sd <- sd(covs$Height) slices_categories <- length(unique(covs$Slices)) sex_proportion_male <- length(which(covs$Sex==1))/nrow(covs) correlation <- quantile(unlist(cor_pheno[lower.tri(cor_pheno)]), 0.999) ``` We use these distribution parameters to simulate non-genetic covariate effects with *PhenotypeSimulator::noiseFixedEffects*. For the simulation of the spatial correlation effect, we use the above quantile cut-off for the construction of the correlation matrix with *PhenotypeSimulator::correlatedBgEffects*. ```{r simulate data} set.seed(25) covs_simulated <- noiseFixedEffects(N=100, P=1000, NrFixedEffects = 5, NrConfounders = 1, distConfounders = c("unif", "norm", "norm", "cat_unif", "bin"), mConfounders = c(mean(age_range), weight_mean, height_mean), sdConfounders = c(age_range[2] -mean(age_range), weight_sd, height_sd), probConfounders = sex_proportion_male, catConfounders = slices_categories) background_simulated <- correlatedBgEffects(N=100, P=1000, pcorr=correlation) ``` Figure \ref{fig:distribution_simulated} and \ref{fig:corsimulated} show the simulated covariates and spatial correlation based on the parameters from the real data. While the distributions are not a perfect replication (as expected with small sample sizes and our initial assumptions about the true underlying distributions of the covariates), they will serve as good approximations for simulating similar datasets as the one we based the simulations on. ```{r simulation, fig.cap="\\label{fig:distribution_simulated}**Distribution of simulated covariates.**", fig.align='center'} covs_simulated_melt <- reshape2::melt(covs_simulated$cov, value.name="meassure", variable.name="covariate") covs_simulated_melt$covariate <- factor(covs_simulated_melt$covariate, levels=c("sharedConfounder1_unif1", "sharedConfounder2_norm1", "sharedConfounder3_norm1", "sharedConfounder5_bin1", "sharedConfounder4_cat_unif1"), labels= c("Age simulated", "Height simulated", "Weight simulated", "Sex simulated", "Slices simulated")) p_cont <- ggplot(dplyr::filter(covs_simulated_melt, covariate %in% c("Age simulated", "Height simulated", "Weight simulated")), aes(x=meassure)) p_cont <- p_cont + geom_histogram(binwidth=1) + facet_wrap(~covariate, scales = "free") + theme_bw() + theme(axis.title.x=element_blank(), strip.background = element_rect(fill="white", color="black")) p_cat <- ggplot(dplyr::filter(covs_simulated_melt, covariate %in% c("Sex simulated", "Slices simulated")), aes(x=meassure)) p_cat <- p_cat + geom_bar(aes(x=meassure)) + facet_wrap(~covariate, scales = "free") + theme_bw() + theme(axis.title.x=element_blank(), strip.background = element_rect(fill="white", color="black")) combine <- cowplot::plot_grid(p_cont, p_cat, nrow=2) print(combine) ``` ```{r correlation simulated background, eval=FALSE} gplots::heatmap.2(cor(background_simulated$correlatedBg), trace="none", dendrogram="none", keysize=1, col=colorRampPalette(c("white", "#5BBCD6")), labRow=FALSE, labCol=FALSE, breaks=seq(0,1,0.001), key.title="", key.xlab="Pearson's correlation coefficient", key.ylab="", density.info="none") ``` ```{r background, load simulated background, echo=FALSE, out.width='\\textwidth', fig.align='center', fig.cap="\\label{fig:corsimulated}\\textbf{Correlation of simulated spatial proximity.}"} #knitr::include_graphics("../docs/articles/SimulationBasedOnExampleData_files/figure-html/background-1.png") knitr::include_graphics("cor_simulated-min.png") ``` \newpage ## References