missing_visualization: Visualize missing genotypes in genomic data set

Description Usage Arguments Details Value Author(s) References Examples

View source: R/missing_visualization.R

Description

Use this function to visualize pattern of missing data.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
missing_visualization(
  data,
  strata = NULL,
  strata.select = "POP_ID",
  distance.method = "euclidean",
  ind.missing.geno.threshold = c(2, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90),
  filename = NULL,
  parallel.core = parallel::detectCores() - 1,
  write.plot = TRUE,
  ...
)

Arguments

data

14 options for input (diploid data only): VCFs (SNPs or Haplotypes, to make the vcf population ready), plink (tped, bed), stacks haplotype file, genind (library(adegenet)), genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT, and a data frame in long/tidy or wide format. To verify that radiator detect your file format use detect_genomic_format (see example below). Documented in Input genomic datasets of tidy_genomic_data.

DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset.

strata

(optional) The strata file is a tab delimited file with a minimum of 2 columns headers: INDIVIDUALS and STRATA. If a strata file is specified the strata argument will have precedence on the population groupings (POP_ID) used internally. The STRATA column can be any hierarchical grouping. For missing_visualization function, use additional columns in the strata file to store metadata that you want to look for pattern of missingness. e.g. lanes, chips, sequencers, etc. Note that you need different values inside the STRATA for the function to work. Default: strata = NULL.

strata.select

(optional, character) Use this argument to select the column from the strata file to generate the PCoA-IBM plot. More than 1 column you want to visualize, use a string of character e.g. strata.select = c("POP_ID", "LANES", "SEQUENCER", "WATERSHED") to test 4 grouping columns inside the strata file. Default: strata.select = "POP_ID"

distance.method

(character) The distance measure to be used. This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". The function uses dist. Default: distance.method = "euclidean".

ind.missing.geno.threshold

(string) Percentage of missing genotype allowed per individuals (to create the blacklists). Default:ind.missing.geno.threshold = c(10, 20, 30, 40, 50, 60, 70, 80, 90).

filename

(optional) Name of the tidy data set, written to the directory created by the function.

parallel.core

(optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1.

write.plot

(optional, logical) When write.plot = TRUE, the function will write to the directory created by the function the plots, except the heatmap that take longer to generate. For this, do it manually following example below. Default: write.plot = TRUE.

...

(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see advance mode or special sections below).

Details

filename

The function uses write.fst, to write the tidy data frame in the directory. The file extension appended to the filename provided is .rad. The file is written with the Lightning Fast Serialization of Data Frames for R package. To read the tidy data file back in R use read.fst.

Value

A list is created with several objects: the principal coordinates with eigenvalues of the PCoA, the identity-by-missingness plot, several summary tables and plots of missing information per individuals, populations and markers. Blacklisted ids are also included. Whitelists of markers with different level of missingness are also generated automatically. A heatmap showing the missing values in black and genotypes in grey provide a general overview of the missing data. The heatmap is usually long to generate, and thus, it's just included as an object in the list and not written in the folder.

Author(s)

Thierry Gosselin thierrygosselin@icloud.com and Eric Archer eric.archer@noaa.gov

References

Legendre, P. and Legendre, L. (1998) Numerical Ecology, 2nd English edition. Amsterdam: Elsevier Science BV.

Keller MC, Visscher PM, Goddard ME (2011) Quantification of inbreeding due to distant ancestors and its detection using dense single nucleotide polymorphism data. Genetics, 189, 237–249.

Kardos M, Luikart G, Allendorf FW (2015) Measuring individual inbreeding in the age of genomics: marker-based measures are better than pedigrees. Heredity, 115, 63–72.

Hedrick PW, Garcia-Dorado A. (2016) Understanding Inbreeding Depression, Purging, and Genetic Rescue. Trends in Ecology and Evolution. 2016;31: 940-952. doi:10.1016/j.tree.2016.09.005

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
## Not run: 
#Using a  VCF file, the simplest for of the function:
ibm.koala <- missing_visualization(
data = "batch_1.vcf",
strata = "population.map.strata.tsv"
)

# To see what's inside the list
names(ibm.koala)

# To view the heatmap:
ibm.koala$heatmap

# To save the heatmap
# move to the appropriate directory
ggplot2::ggsave(
filename = "heatmap.missing.pdf",
plot = ibm.koala$heatmap,
width = 15, height = 20,
dpi = 600, units = "cm", useDingbats = FALSE)

# To view the IBM analysis plot:
ibm.koala$ibm_plot

## End(Not run)

thierrygosselin/grur documentation built on Oct. 28, 2020, 5:48 p.m.