summary_pstacks: Summarize STACKS pstacks files

Description Usage Arguments Value References See Also Examples

View source: R/summary_pstacks.R

Description

This function reads the output of STACKS pstacks folder and summarise the information.

The information shown can help to choose individuals to include/exclude from the catalog, the next step in STACKS pipeline (see example).

Usage

1
2
3
4
5
summary_pstacks(
  pstacks.folder,
  parallel.core = parallel::detectCores() - 1,
  verbose = FALSE
)

Arguments

pstacks.folder

(character). The path to the pstacks output files.

parallel.core

(integer, optional) The number of core used for parallel execution. Default: parallel::detectCores() - 1.

verbose

(logical, optional) Make the function a little more chatty during execution. Default: verbose = FALSE.

Value

The function returns a summary (data frame) containing:

  1. INDIVIDUALS: the sample id

  2. LOCUS_NUMBER: the number of locus

  3. BLACKLIST_PSTACKS: the number of locus blacklisted by pstacks

  4. FOR_CATALOG: the number of locus available to generate the catalog

  5. BLACKLIST_ARTIFACT: the number of artifact genotypes (> 2 alleles, see details)

  6. FILTERED: the number of locus after artifacts are removed

  7. HOMOZYGOSITY: the number of homozygous genotypes

  8. HETEROZYGOSITY: the number of heterozygous genotypes

  9. MEAN_NUMBER_SNP_LOCUS: the mean number of SNP/locus (excluding the artifact locus)

  10. MAX_NUMBER_SNP_LOCUS: the max number of SNP/locus observed for this individual (excluding the artifact locus)

  11. NUMBER_LOCUS_4SNP: the number of locus with 4 or more SNP/locus (excluding the artifact locus)

References

Catchen JM, Amores A, Hohenlohe PA et al. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1, 171-182.

Catchen JM, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124-3140.

See Also

pstacks cstacks run_cstacks

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
## Not run: 
pstacks.summary <- stackr::summary_pstacks(
pstacks.folder = "output_pstacks", verbose = TRUE)

# To get the number of snp/locus for a specific individual:
snp.info <- pstacks.summary %>%
dplyr::filter(INDIVIDUALS == "ID2") %>%
dplyr::select(INDIVIDUALS, SNP_LOCUS) %>%
tidyr::unnest(.)

#Similarly, for the blacklisted locus (artifactual):
blacklist <- pstacks.summary %>%
dplyr::filter(INDIVIDUALS == "ID2") %>%
dplyr::select(INDIVIDUALS, BLACKLIST_ARTIFACT) %>%
tidyr::unnest(.)

# Decide individuals to include in catalog:

# Make the pstacks.summary dataframe population wise by including pop info.
# For this we need a strata file ... it's similar to a population map, but with headers.
# It's a 2 columns files with INDIVIDUALS and STRATA as column header.
strata <- readr::read_tsv("strata.octopus.tsv", col_types = "cc") %>%
dplyr::rename(POP_ID = STRATA)

# join the dataframes
individuals.catalog <- dplyr::left_join(pstacks.summary, strata, by = "INDIVIDUALS") %>%
dplyr::arrange(desc(FILTERED))

# look at the FILTERED column
hist(individuals.catalog$FILTERED)
boxplot(individuals.catalog$FILTERED)
mean(individuals.catalog$FILTERED)

# lets say we want to filter out individuals below 45000 markers
individuals.catalog.filtered <- individuals.catalog %>%
dplyr::filter(FILTERED > 45000)

# number of ind per pop
dplyr::select(individuals.catalog.filtered, INDIVIDUALS, POP_ID) %>%
dplyr::group_by(POP_ID) %>% dplyr::tally(.)

# choose 20% of individuals per pop to represent the catalog

individuals.catalog.select <- individuals.catalog.filtered %>%
dplyr::group_by(POP_ID) %>%
dplyr::sample_frac(tbl = ., size = 0.2, replace = FALSE) %>%
dplyr::ungroup(.) %>%
dplyr::arrange(desc(FILTERED)) %>%
dplyr::distinct(INDIVIDUALS, POP_ID)

# Using desc (from large to lower number of markers) ensure that
# the catalog is populated with samples with the highest number of loci first.
# This speed up the process in cstacks!


# Create a file with individuals and pop to use in
# cstacks or run_cstacks

readr::write_tsv(
x = individuals.catalog.select,
path = "population.map.octopus.samples.catalog.tsv",
col_names = FALSE) # stacks doesn't want column header

## End(Not run)

thierrygosselin/stackr documentation built on Nov. 11, 2020, 11 a.m.