import.snpR.data | R Documentation |
import.snpR.data
converts genotype and meta data to the snpRdata class,
which stores raw genotype data, sample and locus specific metadata, useful
data summaries, repeatedly internally used tables, calculated summary
statistics, and sliding-window statistic data.
import.snpR.data(
genotypes,
snp.meta = NULL,
sample.meta = NULL,
mDat = "NN",
chr.length = NULL,
...,
header_cols = 0,
rows_per_individual = 2,
marker_names = FALSE,
fix_overlaps = TRUE,
verbose = FALSE,
.pass_filters = FALSE,
.skip_filters = FALSE
)
genotypes |
data.frame, unique S4 from other packages, or filename. If a data.frame, raw genotypes in a two-character format ("GG", "GA", "CT", "NN"), where SNPs are in rows and individual samples are in columns. Otherwise, see documentation for allowed S4 objects and files. |
snp.meta |
data.frame, default NULL. Metadata for each SNP, must have a number of rows equal to the number of SNPs in the dataset. If NULL, a single "snpID" column will be added. |
sample.meta |
data.frame, default NULL. Metadata for each individual sample, must have a number of rows equal to the number of samples in the dataset. If NULL, a single "sampID" column will be added. |
mDat |
character, default "NN", matching the encoding of missing genotypes in the data provided to the genotypes argument. |
chr.length |
numeric, default NULL. If a path to a .ms file is provided, specifies chromosome lengths. Note that a single value assumes that each chromosome is of equal length whereas a vector of values gives the length for each chromosome in order. |
... |
Additional arguments passed to |
header_cols |
numeric, default 0. Number of header columns containing SNP metadata. Used if a tab delimited or STRUCTURE input file is provided. |
rows_per_individual |
numeric (1 or 2), default 2. Number of rows used for each individual. For structure input files only. |
marker_names |
logical, default FALSE. If TRUE, assumes that a header row of marker is present. For structure input files only. |
fix_overlaps |
Logical, default TRUE. If TRUE, overlapping positions will be checked and fixed during 'ms' file import. |
verbose |
Logical, default FALSE. If TRUE, will print a few status updates and checks. |
.pass_filters |
Internal, probably not for user use. Used to pass filtering history when sub-setting when this function is called internally. |
.skip_filters |
Internal, probably not for user use. Used to skip re-filtering during sub-setting when this function is called internally. |
The snpRdata class is built to contain SNP genotype data for use by functions
in the snpR package. It inherits from the S3 class data.frame, in which the
genotypes are stored, and can be manipulated identically. It also stores
sample and locus specific metadata, genomic summary information, and results
from most snpR functions. Genotypes are stored in the "character" format, as
output by format_snps
. Missing data is noted with "NN".
Inputs can be provided either as pre-existing R objects (in several different
formats) or as paths to files. In both cases, snpR
will attempt to
guess the data format from either the object classs, first genotype or the
file extension (as appropriate).
Supports automatic import of several types of files. Options:
.vcf or .vcf.gz: Variant Call Format (vcf) files, supported
via vcfR
. If not otherwise provided, snp metadata is
taken from the fixed fields in the VCF and sample metadata from the sample
IDs. Note that this only imports SNPs with called genotypes!
.ms: Files in the ms format, as provided by many commonly used simulation tools.
.txt, NN: SNP genotypes stored as actual base calls (e.g. "AA", "CT").
.txt, 0000: SNP genotypes stored as four numeric characters (e.g. "0101", "0204").
.txt, snp_tab: SNP genotypes stored with genotypes in each cell, but only a single nucleotide noted if homozygote and two nucleotides separated by a space if heterozygote (e.g. "T", "T G").
.txt, sn: SNP genotypes stored with genotypes in each cell as 0 (homozygous allele 1), 1 (heterozygous), or 2 (homozyogus allele 2).
.genepop or .gen: genepop file format, with genotypes stored as either 4 or 6 numeric characters. Works only with bi-allelic data. Genotypes will be converted (internally) to NN: the first allele (numerically) will be coded as A, the second as C.
.fstat: FSTAT file format, with genotypes stored as either 4 or 6 numeric characters. Works only with bi-allelic data. Genotypes will be converted (internally) to NN: the first allele (numerically) will be coded as A, the second as C.
.bed/.fam/.bim:
PLINK .bed, .fam, and .bim files, via read_plink
. If
any of these file types is provided, snpR (via
read_plink
) will look for the other file types
automatically. Sample metadata should be contained in the .fam file and SNP
metadata in the .bim file, so sample or snp meta data provided here will be
ignored.
.str: STRUCTURE import files in either 1 or 2 rows per
individual as defined by the rows_per_individual
argument.
Additional arguments can be provided to import.snpR.data that will be passed
to fread
when reading in genotype data.
Sample and snp metadata can also be provided via file path, and will be read
in using fread
with the default settings.
If these settings are not correct, please read in the metadata manually and
provide to import.snpR.data.
Supports automatic conversions from some other popular S4 object types. Options:
genind: genind
objects from
adegenet. Note, no need to import genepop objects, the equivalent statistics
are calculated automatically when functions called with facets. Sample and
SNP IDs as well as, when possible, pop IDs will be taken from the genind
object. This data will be added too but will not replace data provided to
the SNP or sample.meta arguments. Note that only SNP data is
currently allowed, data with more than two alleles for loci will return an
error.
genlight: genlight
objects from
adegenet. Sample and SNP IDs, SNP positions, SNP chromosomes, and pop IDs
will be taken from the genlight object if possible. This data will be added
too but will not replace data provided to the SNP or sample.meta arguments.
vcfR: vcfR
objects from vcfR. If not provided,
snp metadata is taken from the fixed fields in the VCF and sample metadata
from the sample IDs. Note that this only imports SNPs with called
genotypes!
Genotypes, metadata, and results are stored in slots and directly accessable with the 'at' symbol operator. Slots are as follows:
sample.meta: sample metadata (population, family, phenotype, etc.).
snp.meta: SNP metadata (SNP ID, chromosome, linkage group, position, etc.).
facet.meta: internal metadata used to track facets that have been previously applied to the dataset.
mDat: missing data format.
snp.form: number of characters per SNP.
genotables: a list containing tabulated genotypes (gs), allele counts (as), and missing data (wm). facet.meta contains the corresponding metadata.
facets: vector of the facets that have been added to the data.
facet.type: classes of the added facets (snp, sample, complex, or .base).
stats: data.frame containing all calculated non-pairwise single-snp statistics and metadata.
window.stats: data.frame/table containing all non-pairwise statistics calculated for sliding windows.
pairwise.stats: data.frame/table containing all pairwise (fst) single-snp statistics.
pairwise.window.stats: data.frame/table containing all pairwise statistics calculated for sliding windows.
sample.stats: data.frame/table containing statistics calculated for each individual sample.
pairwise.LD: nested list containing linkage disequilibrium
data (see calc_pairwise_ld
for more information).
window.bootstraps: data.frame/table containing all calculated bootstraps for sliding window statistics.
sn: list containing "sn", sn formatted data, and "type" type of interpolation.
calced_stats:
Named list of named character vectors that tracks the calculated statistics
for each facet (see calc_genetic_distances
for more
information).
genetic_distances: nested list containing genetic distance data.
names: column names for genotypes.
row.names: row names for genotypes.
.Data: list of vectors containing raw genotype data.
.S3Class: notes the inherited S3 object class.
Note that most of these slots are used primarily internally.
All calculated data can be accessed using the get.snpR.stats
function. See documentation.
William Hemstrom
# import example data as a snpRdata object
# produces data identical to that contained in the stickSNPs example dataset.
genos <- stickRAW[,-c(1:2)]
snp_meta <- stickRAW[,1:2]
sample_meta <- data.frame(pop = substr(colnames(stickRAW)[-c(1:2)], 1, 3),
fam = rep(c("A", "B", "C", "D"),
length = ncol(stickRAW) - 2),
stringsAsFactors = FALSE)
import.snpR.data(genos, snp.meta = snp_meta, sample.meta = sample_meta,
mDat = "NN")
# from an adegenet genind object
ex.genind <- adegenet::df2genind(t(stickRAW[,-c(1:2)]),
ncode = 1, NA.char = "N") # get genind data
# note, will add whatever metadata data is in the genind object to the
# snpRdata object.
# Could be run without the snp or sample metadatas.
import.snpR.data(ex.genind, snp_meta, sample_meta)
# from an adegenet genlight object
num <- format_snps(stickSNPs, "sn", interpolate = FALSE)
genlight <- methods::as(t(num[,-c(1:2)]), "genlight")
## run the conversion, could be run without the snp or sample metadatas.
dat <- import.snpR.data(genlight)
## Not run:
## not run:
# from a file:
# note that the drop argument is passed to data.table::fread!
dat <- import.snpR.data(system.file("extdata", "stick_NN_input.txt",
package = "snpR"), drop = 1:2)
# if wanted, snp and sample metadata could be provided as usual.
## not run:
# from plink:
# make plink data
format_snps(stickSNPs, "plink", outfile = "plink_test", chr = "chr")
# read plink
dat <- import.snpR.data("plink_test.bed")
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.