View source: R/summary_pstacks.R
summary_pstacks | R Documentation |
This function reads the output of STACKS pstacks folder and summarise the information.
The information shown can help to choose individuals to include/exclude from the catalog, the next step in STACKS pipeline (see example).
summary_pstacks(
pstacks.folder,
parallel.core = parallel::detectCores() - 1,
verbose = FALSE
)
pstacks.folder |
(character). The path to the pstacks output files. |
parallel.core |
(integer, optional) The number of core used for parallel
execution.
Default: |
verbose |
(logical, optional) Make the function a little more chatty during
execution.
Default: |
The function returns a summary (data frame) containing:
INDIVIDUALS: the sample id
LOCUS_NUMBER: the number of locus
BLACKLIST_PSTACKS: the number of locus blacklisted by pstacks
FOR_CATALOG: the number of locus available to generate the catalog
BLACKLIST_ARTIFACT: the number of artifact genotypes (> 2 alleles, see details)
FILTERED: the number of locus after artifacts are removed
HOMOZYGOSITY: the number of homozygous genotypes
HETEROZYGOSITY: the number of heterozygous genotypes
MEAN_NUMBER_SNP_LOCUS: the mean number of SNP/locus (excluding the artifact locus)
MAX_NUMBER_SNP_LOCUS: the max number of SNP/locus observed for this individual (excluding the artifact locus)
NUMBER_LOCUS_4SNP: the number of locus with 4 or more SNP/locus (excluding the artifact locus)
Catchen JM, Amores A, Hohenlohe PA et al. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1, 171-182.
Catchen JM, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124-3140.
## Not run:
pstacks.summary <- stackr::summary_pstacks(
pstacks.folder = "output_pstacks", verbose = TRUE)
# To get the number of snp/locus for a specific individual:
snp.info <- pstacks.summary %>%
dplyr::filter(INDIVIDUALS == "ID2") %>%
dplyr::select(INDIVIDUALS, SNP_LOCUS) %>%
tidyr::unnest(.)
#Similarly, for the blacklisted locus (artifactual):
blacklist <- pstacks.summary %>%
dplyr::filter(INDIVIDUALS == "ID2") %>%
dplyr::select(INDIVIDUALS, BLACKLIST_ARTIFACT) %>%
tidyr::unnest(.)
# Decide individuals to include in catalog:
# Make the pstacks.summary dataframe population wise by including pop info.
# For this we need a strata file ... it's similar to a population map, but with headers.
# It's a 2 columns files with INDIVIDUALS and STRATA as column header.
strata <- readr::read_tsv("strata.octopus.tsv", col_types = "cc") %>%
dplyr::rename(POP_ID = STRATA)
# join the dataframes
individuals.catalog <- dplyr::left_join(pstacks.summary, strata, by = "INDIVIDUALS") %>%
dplyr::arrange(desc(FILTERED))
# look at the FILTERED column
hist(individuals.catalog$FILTERED)
boxplot(individuals.catalog$FILTERED)
mean(individuals.catalog$FILTERED)
# lets say we want to filter out individuals below 45000 markers
individuals.catalog.filtered <- individuals.catalog %>%
dplyr::filter(FILTERED > 45000)
# number of ind per pop
dplyr::select(individuals.catalog.filtered, INDIVIDUALS, POP_ID) %>%
dplyr::group_by(POP_ID) %>% dplyr::tally(.)
# choose 20% of individuals per pop to represent the catalog
individuals.catalog.select <- individuals.catalog.filtered %>%
dplyr::group_by(POP_ID) %>%
dplyr::sample_frac(tbl = ., size = 0.2, replace = FALSE) %>%
dplyr::ungroup(.) %>%
dplyr::arrange(desc(FILTERED)) %>%
dplyr::distinct(INDIVIDUALS, POP_ID)
# Using desc (from large to lower number of markers) ensure that
# the catalog is populated with samples with the highest number of loci first.
# This speed up the process in cstacks!
# Create a file with individuals and pop to use in
# cstacks or run_cstacks
readr::write_tsv(
x = individuals.catalog.select,
file = "population.map.octopus.samples.catalog.tsv",
col_names = FALSE) # stacks doesn't want column header
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.