--- title: "Genotype simulation supported by *PhenotypeSimulator*" author: "Hannah Meyer" date: "`r Sys.Date()`" output: pdf_document bibliography: genotype-simulation.bib csl: plos-genetics.csl vignette: > %\VignetteIndexEntry{Supported Genotype Simulation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message=FALSE} library(knitr) knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` There are a number of different strategies to generate genotype data for genetic association studies. 1. **Simple bi-allelic SNPs without structure**: In the most simple case and assuming bi-allelic SNPs, each SNP is simulated from a binomial distribution with two trials and probability equal to the given allele frequencies. This simple approach does not simulate any dependency between the genotypes as is observed with LD structure in the genome. Simple bi-allelic SNPs can be simulated with PhenotypeSimulator::simulateGenotypes or via PLINK @Chang2015. For large datasets (~ 1 million SNPs), simulation via PLINK is recommended. 1. **Coalescent approaches**: Coalescent methods simulate genealogical events backward in time. These events typically include the coalescence of two sequences into a single ancestral lineage, recombination within a sequence or migration between populations. A coalescent-based approach to simulate whole genome data is implemented in GENOME @Liang2007 and its output format is supported as an input format for *PhenotypeSimulator*. 1. **Forward-time approaches**: Forward-time simulation methods evolve a population forward in time, subject to arbitrary genetic and demographic factors @Peng2010. Several algorithms are available, many of which allow customisation by building the simulation scheme in R or python, hence the output format of the genotypes can be specified by the user. *PhenotypeSimulator* offers reading genotypes from delimited-files and as such, any genotypes generated by these programmes can be read (e.g. MetaSim @Strand2002 or simuPop @Peng2005). 1. **Resampling approaches**: Resampling-based approaches combine existing genotype data into the genotypes of the simulated samples, thereby retaining allele frequency and LD patterns @Wright2007. A standard resampling-based approach that uses common genotype formats (common for the [oxford genetics format](http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html)) is Hapgen2 and an example usage is given below. Examples for genotype simulation via PLINK, GENOME and Hapgen2 are given below. The corresponding data can be found in the extdata folder. ## Simple bi-allelic SNPs without structure via PLINK Download [PLINK](https://www.cog-genomics.org/plink/1.9/) and create a folder for the PLINK output files. The example below uses PLINK to simulate 1000 SNPs, with allele frequencies between 0 and 1 for 100 controls and 0 cases. PhenotypeSimulator::readStandardGenotypes reads the resulting .bim, .fam, .bed files. ```{bash, eval=FALSE} # Serves as output directory for simulation and parameter file mkdir -p ~/PhenotypeSimulator/inst/extdata/genotypes/plink cd ~/PhenotypeSimulator/inst/extdata/genotypes/plink # Write paramter file to simulate 1000 SNPs, with allele frequencies between 0 # and 1 for 100 controls and 0 cases echo -e "1000\tSNP\t0.00\t1.00\t1.00\t1.00" > plink_sim_par.txt plink --simulate plink_sim_par.txt \ --simulate-ncases 0 \ --simulate-ncontrols 100 \ --out genotypes_plink ``` ## Coalescent simulation via Genome Download [GENOME](http://csg.sph.umich.edu/liang/genome/download.html) and create a folder for the GENOME output files. The example below uses GENOME to simulate genetic data for a population comprised of three sub-populations with 30, 30 and 40 samples. The simulated POPULATION PROFILE, SNP positions and the genotypes are all saved in genotypes_genome.txt. PhenotypeSimulator::readStandardGenotypes reads genotype information by parsing this output file and extracting the samples information following the line 'Samples:' ```{bash, eval=FALSE} mkdir -p ~/PhenotypeSimulator/inst/extdata/genotypes/genome cd ~/PhenotypeSimulator/inst/extdata/genotypes/genome # subpopulation with 30, 30 and 40 individuals each genome -pop 3 30 30 40 > genotypes_genome.txt ``` ## Resampling-based simulation via Hapgen2 Download [hapgen2](http://mathgen.stats.ox.ac.uk/genetics_software/hapgen/hapgen2.html), [hapgen example files](http://mathgen.stats.ox.ac.uk/genetics_software/hapgen/download/example/hapgen2.example.tgz) and the 1000Genomes data from the [impute2 webpage](https://mathgen.stats.ox.ac.uk/impute/impute_v1.html#Using_IMPUTE_with_the_HapMap_Data) as starting point for the resampling-based genotype simulation with hapgen2. For formating of the 1000Genomes data, have a look at this [vignette](https://hannahvmeyer.github.io/PhenotypeSimulator/articles/Simulation-and-LinearModel.html#genotype-simulation-via-hapgen2). The example files contain example haplotypes (ex.haps), legend files (ex.leg) map (ex.map) and TagSNP files (ex.tags). The following examples simulates genotype data for 100 controls and 0 cases with 1000 SNPs from chromosome 1. PhenotypeSimulator::readStandardGenotypes reads the resulting .gen and .samples files. ```{bash, eval=FALSE} # contains ex.leg file and serves as output directory mkdir -p ~/PhenotypeSimulator/inst/extdata/genotypes/hapgen cd ~/PhenotypeSimulator/inst/extdata/genotypes/hapgen # contains 1000Genomes haplotype data used for sampling hapdir=/path/to/CEU.0908.impute.files hapgen2 -m $hapdir/genetic_map_chr1_combined_b37.txt \ -l ex.leg \ -h $hapdir/chr1.ceu_subset.hap \ -o genotypes_hapgen \ -n 100 0 \ -dl 45162 0 0 0 \ -no_haps_output ``` ## References