knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Objectives

Learn how to visualize missing genotypes in your genomic dataset with the function grur::missing_visualization (time = 15 min).

Workflow

The function missing_visualization in grur uses various genomic input files and conduct identity-by-missingness analyses (IBM) using Principal Coordinates Analysis (PCoA), also called Multidimensional Scaling (MDS) and RDA (Redundancy Analysis) to highlight missing data patterns. Figures and summary tables of missing information at the marker, individual and population level are generated. Below, the simplest form of the function to get results. More options are available, please see the function documentation.

Prepare your R workspace

Clean your desk

rm(list = ls())

Follow the instruction to install grur

Load the required libraries:

library("grur")

Set your working directory, e.g.:

setwd("~/Documents/test_missing_visualization_vignette")

Note: running codes in chunks inside R Notebook might cause problem, run it outside in the console (the default here).

Download the test data

Dataset: in this example, we use the data in Ferchaud and Hansen (2015 and 2016) paper. The code below gets the vcf from Dryad directly. But you can skip the step if it's already in the folder.

writeBin(httr::content(httr::GET("http://datadryad.org/bitstream/handle/10255/dryad.97237/sticklebacks_Danish.vcf?sequence=1"), "raw"), "stickleback_ferchaud_2015.vcf")

With a vcf you also need a strata file (indicating population groupings)

to download the strata for this example: strata link

Run the function

This is the simplest way to run the function:

ibm <- grur::missing_visualization(
  data = "stickleback_ferchaud_2015.vcf", 
  strata = "strata.stickleback.tsv")

The function does a few automatic filters: Monomorphic markers are removed Only common markers between strata are kept for the analysis * Individuals and markers statistics are generated automatically

A new object ibm was created in your global environment. It's a list and to view it's content use:

names(ibm)

Lots of info in there... Lets focus on just a few. A folder is also created automatically. The function generates by default a large object (list):

Visualization

To view the IBM-PCoA plot made with POP_ID grouping:

ibm$ibm.plots$ibm.strata.POP_ID

The dark green bubble from KIB it's an individual with almost all of his genotypes missing. This one skip the radar of the authors ;)

The heatmap showing missingness:

heatmap <- ibm$heatmap 
heatmap 

The vertical black line highlight the problem in the vcf with the individual missing almost all it's genotypes.

View the table with summary of missing genotypes per individuals:

table.ind <- ibm$missing.genotypes.ind
table.ind

To view the distribution

ibm$missing.genotypes.ind.plots
ibm$missing.genotypes.ind.histo

Show the helper figure showing how many individuals could potentially be blacklisted based on % on genotypes.

ibm$ind.genotyped.helper.plot

All these figures are combined in the folder...

To view the distribution of missingness per markers

ibm$missing.genotypes.markers.combined.plots

Other figures are created, explore the list of objects and folder. Read the doc.

To view the distribution of FH and missing genotypes per individuals

ibm$missing.genotypes.ind.fh.combined.plots 

This is weird figure is caused by the outlier individual. To remove this individual, re-run missing_visualization with the argument blacklist.id and one of the several blacklists written to the working directory (e.g. blacklist.id.missing.70.tsv).

Explore the rest by yourself!

Interpretation

Do you see patterns in your plots that provides insight about the relationships that missing values might have with other variables (inspired from r4ds).

If you see a pattern, ask yourself:

Strategies

Arguments

Filtering

References

Ferchaud A, Hansen MM (2016) The impact of selection, gene flow and demographic history on heterogeneous genomic divergence: threespine sticklebacks in divergent environments. Molecular Ecology 25(1): 238–259. http://dx.doi.org/10.1111/mec.13399

Ferchaud A, Hansen MM (2015) Data from: The impact of selection, gene flow and demographic history on heterogeneous genomic divergence: threespine sticklebacks in divergent environments. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.kp11q

Danecek P, Auton A, Abecasis G et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156–2158.

Purcell S, Neale B, Todd-Brown K et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics, 81, 559–575.

Keller MC, Visscher PM, Goddard ME (2011) Quantification of inbreeding due to distant ancestors and its detection using dense single nucleotide polymorphism data. Genetics, 189, 237–249.

Kardos M, Luikart G, Allendorf FW (2015) Measuring individual inbreeding in the age of genomics: marker-based measures are better than pedigrees. Heredity, 115, 63–72.

Hedrick PW, Garcia-Dorado A. (2016) Understanding Inbreeding Depression, Purging, and Genetic Rescue. Trends in Ecology and Evolution. 2016;31: 940-952. doi:10.1016/j.tree.2016.09.005



thierrygosselin/grur documentation built on Oct. 28, 2020, 5:48 p.m.