knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(mousegwas)
This package was built for easier deployment of GWAS in mouse panels. Several mouse panels have been genotyped using different arrays like the Mouse Diversity Array (MDA) or the GigaMUGA (GMUGA), a lot of which are available through The Jackson Laboratory Mouse Phenome Database (MPD: https://phenome.jax.org/about/snp_retrievals_help). This package use the genotyping downloaded from MPD to run GWAS on mouse phenotypes using GEMMA or pyLMM and plot the results.
The phenotypes should be given as a csv file with at least one column to define the strain and other columns for the phenotypes and the covariates, a companion yaml file will describe the input data.
The yaml file should contain the following data:
phenotypes: OFA_Groom_first5m: papername: GrTime5m group: Grooming Quantity
strain: Strain
groups: - Activity - Anxiety - Grooming Pattern - Grooming Quantity
F1: - B6129SF1/J: - C57BL/6J - 129S1/SvImJ - B6129PF1/J: - C57BL/6J - 129P3/J - B6AF1/J: - C57BL/6J - A/J
covar: - Sex - BodyLength
sex: Sex
translate: - BTBRT...tf.J: BTBR T<+> Itpr3<tf>/J - 129S1.SvImJ: 129S1/SvImJ - MSM/MsJ: none - MOLF/EiJ: none
coat: - 129P3/J: albino - 129S1/SvImJ: white-bellied agouti - 129X1/SvJ: albino
*confSNPs: Confounding SNPs. A list of SNPs that will be used as covariates in the GWAS. This is useful to remove a massive selection factor. e.g.:
confSNPs: - rs32105080
The script for running the GWAS is called run_GWAS.R
Mouse panel GWAS is different than human GWAS, the LD blocks are wider, the population structure is biased and the individuals are usually homozygous. It is unclear how many individuals from each strain should be included in the analysis so we implemented an option to select a defined number of individuals from each Strain x Sex combination or to get the average value of each strain while controlling for the covariates by taking the residuals of a linear model solving each phenotype with the covariates.
This option can be specified using the parameter --downsample
or -d
with the number of individuals to select from each Strain x Sex group or 0 for average.
If the input table contains strain names that are different than the names in the genotypes table then a translation dictionary can be given in the yaml file (phenotype -> genotype). When F1 strains are phenotyped then names of the parent strains should be specified in the yaml file under the F1 term, the female parent should be the first one to determine the X chromosome.
After the table of phenotypes is ready for the individuals that will be used in the GWAS, the values are being scaled to have a mean of zero and variance of one. A quantile-quantile normalization is also optional with the --qqnorm
option. Boolean phenotypes are not being scaled.
SNPs can be filtered using the parameters --missing
for the minimal fraction of missing data and --MAF
for the minimal allele frequency threshold.
By default the software for running the GWAS will be GEMMA, if it is not available it will be downloaded. The model will be LMM with all tests (-lmm 4
). pyLMM is also an option and can be selected with -m pyLMM
or --method pyLMM
, although pyLMM should be installed independently, if the executables are not available in the default path they can be given using --pylmm/-p pylmmGWAS.py
and --pylmmkinship/-k pylmmKinship.py
.
Multivariate GWAS is implemented in GEMMA and can be used by specifying the groups of the phenotypes in the yaml file. Multivariate will be used by default on top of the univariate analysis unless disabled with --nomv
.
If multiple penotypes are being tested, their results can be combined with meta-analysis. The meta-analysis algorithm used here is METASOFT. Care should be taken when using METASOFT as it assumes the beta factors have the same signs. If two phenotypes are anti-correlated then their p-values will not be combined, only the strongest will eventually be used for the combined p-values. Deploying METASOFT can be done with h --runmetaasoft
. After running METASOFT the lambda_mean and lambda_hetero scaling parameters can be obtained from the log file and given back using the --labmda_mean
and --lambda_hetero
parameters.
Coat colors defined in the yaml file can be used either as covariates with --coat_covar
or as phenotypes with --coat_phenotype
in which case no input file should be given. Another option to consider coat color as covariates is by specifying the SNPs in coat color determining genes using the confSNPs
feature in the yaml file.
If the --shuffle
option is selected, the phenotypes values will be shuffled among all input individuals and the GWAS will be carried on as normal after that. Make sure to change --seed
to avoid running the same shuffling more than once when trying to obtain an empirical p-value.
The plotting part was adjusted for publication and can be changed as desired. It can be used with the postprocess_mv.R script with the output directory of run_GWAS.R as input (--outdir
or -o
) and the directory to put the figures in (--plotdir
or -p
).
First, the script will plot simple Manhattan plots with the threshold for significant SNP as defined by --pvalthr
($-log_{10}(\tt P-value))$). To remove SNPs in the same LD blocks, correlated SNPs are removed after the best SNP is selected, the correlation threshohld ($r^{2}$) can be changed with --peakcorr
, width of the peak region is limited by --ldpeakdist
given as mega base-pairs (Mbp).
Each parameter can be assigned a group. All phenotypes in a group can be combined together and ran as a multivariate GWAS in run_GWAS.R
if used without --nomv
. Another option to combine phenotypes is to take the minimal p-value of each SNP across all phenotypes in the group. This can be done by applying the --nomv
flag when running postprocess_mv.R
. It will also add a Manhattan plot that combines all of the phenotypes.
After the peaks were selected for each phenotype, all the peak SNPs from all of the phenotypes are being collected and the p-values of these SNPs in all the phenotypes are used for clustering the peaks using k-means algorithm. The number of clusters can be set using the --clusters
option and is defaulted for seven clusters. Each cluster gets its own color, the p-values table is being plotted as a heatmap along with the cluster colors. After assigning clusters for the peak SNPs, the Manhattan plots are being repainted with the cluster colors. NPSs correlated to the peak SNP will get the same color and their transparency will be proportional to the correlation coefficient.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.