import.snpR.data: Import genotype and metadata into a snpRdata object.

View source: R/snpRdata.R

import.snpR.dataR Documentation

Import genotype and metadata into a snpRdata object.

Description

import.snpR.data converts genotype and meta data to the snpRdata class, which stores raw genotype data, sample and locus specific metadata, useful data summaries, repeatedly internally used tables, calculated summary statistics, and sliding-window statistic data.

Usage

import.snpR.data(
  genotypes,
  snp.meta = NULL,
  sample.meta = NULL,
  mDat = "NN",
  chr.length = NULL,
  ...,
  header_cols = 0,
  rows_per_individual = 2,
  marker_names = FALSE,
  verbose = FALSE,
  .pass_filters = FALSE,
  .skip_filters = FALSE
)

Arguments

genotypes

data.frame, unique S4 from other packages, or filename. If a data.frame, raw genotypes in a two-character format ("GG", "GA", "CT", "NN"), where SNPs are in rows and individual samples are in columns. Otherwise, see documentation for allowed S4 objects and files.

snp.meta

data.frame, default NULL. Metadata for each SNP, must have a number of rows equal to the number of SNPs in the dataset. If NULL, a single "snpID" column will be added.

sample.meta

data.frame, default NULL. Metadata for each individual sample, must have a number of rows equal to the number of samples in the dataset. If NULL, a single "sampID" column will be added.

mDat

character, default "NN", matching the encoding of missing genotypes in the data provided to the genotypes argument.

chr.length

numeric, default NULL. If a path to a .ms file is provided, specifies chromosome lengths. Note that a single value assumes that each chromosome is of equal length whereas a vector of values gives the length for each chromosome in order.

...

Additional arguments passed to fread if a genotype file name is passed that is not a vcf or ms file.

header_cols

numeric, default 0. Number of header columns containing SNP metadata. Used if a tab delimited or STRUCTURE input file is provided.

rows_per_individual

numeric (1 or 2), default 2. Number of rows used for each individual. For structure input files only.

marker_names

logical, default FALSE. If TRUE, assumes that a header row of marker is present. For structure input files only.

verbose

Logical, default FALSE. If TRUE, will print a few status updates and checks.

.pass_filters

Internal, probably not for user use. Used to pass filtering history when sub-setting when this function is called internally.

.skip_filters

Internal, probably not for user use. Used to skip re-filtering during sub-setting when this function is called internally.

Details

The snpRdata class is built to contain SNP genotype data for use by functions in the snpR package. It inherits from the S3 class data.frame, in which the genotypes are stored, and can be manipulated identically. It also stores sample and locus specific metadata, genomic summary information, and results from most snpR functions. Genotypes are stored in the "character" format, as output by format_snps. Missing data is noted with "NN".

Inputs can be provided either as pre-existing R objects (in several different formats) or as paths to files. In both cases, snpR will attempt to guess the data format from either the object classs, first genotype or the file extension (as appropriate).

File import

Supports automatic import of several types of files. Options:

  • .vcf or .vcf.gz: Variant Call Format (vcf) files, supported via vcfR. If not otherwise provided, snp metadata is taken from the fixed fields in the VCF and sample metadata from the sample IDs. Note that this only imports SNPs with called genotypes!

  • .ms: Files in the ms format, as provided by many commonly used simulation tools.

  • .txt, NN: SNP genotypes stored as actual base calls (e.g. "AA", "CT").

  • .txt, 0000: SNP genotypes stored as four numeric characters (e.g. "0101", "0204").

  • .txt, snp_tab: SNP genotypes stored with genotypes in each cell, but only a single nucleotide noted if homozygote and two nucleotides separated by a space if heterozygote (e.g. "T", "T G").

  • .txt, sn: SNP genotypes stored with genotypes in each cell as 0 (homozygous allele 1), 1 (heterozygous), or 2 (homozyogus allele 2).

  • .genepop: genepop file format, with genotypes stored as either 4 or 6 numeric characters. Works only with bi-allelic data. Genotypes will be converted (internally) to NN: the first allele (numerically) will be coded as A, the second as C.

  • .fstat: FSTAT file format, with genotypes stored as either 4 or 6 numeric characters. Works only with bi-allelic data. Genotypes will be converted (internally) to NN: the first allele (numerically) will be coded as A, the second as C.

  • .bed/.fam/.bim: PLINK .bed, .fam, and .bim files, via read_plink. If any of these file types is provided, snpR (via read_plink) will look for the other file types automatically. Sample metadata should be contained in the .fam file and SNP metadata in the .bim file, so sample or snp meta data provided here will be ignored.

  • .str: STRUCTURE import files in either 1 or 2 rows per individual as defined by the rows_per_individual argument.

Additional arguments can be provided to import.snpR.data that will be passed to fread when reading in genotype data.

Sample and snp metadata can also be provided via file path, and will be read in using fread with the default settings. If these settings are not correct, please read in the metadata manually and provide to import.snpR.data.

Conversions from other S4 objects

Supports automatic conversions from some other popular S4 object types. Options:

  • genind: genind objects from adegenet. Note, no need to import genepop objects, the equivalent statistics are calculated automatically when functions called with facets. Sample and SNP IDs as well as, when possible, pop IDs will be taken from the genind object. This data will be added too but will not replace data provided to the SNP or sample.meta arguments. Note that only SNP data is currently allowed, data with more than two alleles for loci will return an error.

  • genlight: genlight objects from adegenet. Sample and SNP IDs, SNP positions, SNP chromosomes, and pop IDs will be taken from the genlight object if possible. This data will be added too but will not replace data provided to the SNP or sample.meta arguments.

  • vcfR: vcfR objects from vcfR. If not provided, snp metadata is taken from the fixed fields in the VCF and sample metadata from the sample IDs. Note that this only imports SNPs with called genotypes!

Slots

Genotypes, metadata, and results are stored in slots and directly accessable with the 'at' symbol operator. Slots are as follows:

  • sample.meta: sample metadata (population, family, phenotype, etc.).

  • snp.meta: SNP metadata (SNP ID, chromosome, linkage group, position, etc.).

  • facet.meta: internal metadata used to track facets that have been previously applied to the dataset.

  • mDat: missing data format.

  • snp.form: number of characters per SNP.

  • genotables: a list containing tabulated genotypes (gs), allele counts (as), and missing data (wm). facet.meta contains the corresponding metadata.

  • facets: vector of the facets that have been added to the data.

  • facet.type: classes of the added facets (snp, sample, complex, or .base).

  • stats: data.frame containing all calculated non-pairwise single-snp statistics and metadata.

  • window.stats: data.frame/table containing all non-pairwise statistics calculated for sliding windows.

  • pairwise.stats: data.frame/table containing all pairwise (fst) single-snp statistics.

  • pairwise.window.stats: data.frame/table containing all pairwise statistics calculated for sliding windows.

  • sample.stats: data.frame/table containing statistics calculated for each individual sample.

  • pairwise.LD: nested list containing linkage disequilibrium data (see calc_pairwise_ld for more information).

  • window.bootstraps: data.frame/table containing all calculated bootstraps for sliding window statistics.

  • sn: list containing "sn", sn formatted data, and "type" type of interpolation.

  • calced_stats: Named list of named character vectors that tracks the calculated statistics for each facet (see calc_genetic_distances for more information).

  • genetic_distances: nested list containing genetic distance data.

  • names: column names for genotypes.

  • row.names: row names for genotypes.

  • .Data: list of vectors containing raw genotype data.

  • .S3Class: notes the inherited S3 object class.

Note that most of these slots are used primarily internally.

All calculated data can be accessed using the get.snpR.stats function. See documentation.

Author(s)

William Hemstrom

Examples

# import example data as a snpRdata object
# produces data identical to that contained in the stickSNPs example dataset.
genos <- stickRAW[,-c(1:2)]
snp_meta <- stickRAW[,1:2]
sample_meta <- data.frame(pop = substr(colnames(stickRAW)[-c(1:2)], 1, 3), 
                          fam = rep(c("A", "B", "C", "D"), 
                                    length = ncol(stickRAW) - 2), 
                          stringsAsFactors = FALSE)
import.snpR.data(genos, snp.meta = snp_meta, sample.meta = sample_meta, 
                 mDat = "NN")

# from an adegenet genind object
ex.genind  <- adegenet::df2genind(t(stickRAW[,-c(1:2)]), 
                                  ncode = 1, NA.char = "N") # get genind data
# note, will add whatever metadata data is in the genind object to the 
# snpRdata object. 
# Could be run without the snp or sample metadatas.
import.snpR.data(ex.genind, snp_meta, sample_meta) 

# from an adegenet genlight object
num <- format_snps(stickSNPs, "sn", interpolate = FALSE)
genlight <- methods::as(t(num[,-c(1:2)]), "genlight")

## run the conversion, could be run without the snp or sample metadatas.
dat <- import.snpR.data(genlight)

## Not run: 
## not run:
# from a file:
# note that the drop argument is passed to data.table::fread!
dat <- import.snpR.data(system.file("extdata", "stick_NN_input.txt", 
                                    package = "snpR"), drop = 1:2) 
# if wanted, snp and sample metadata could be provided as usual.

## not run:
# from plink:
# make plink data
format_snps(stickSNPs, "plink", outfile = "plink_test", chr = "chr")

# read plink
dat <- import.snpR.data("plink_test.bed")

## End(Not run)


hemstrow/snpR documentation built on March 20, 2024, 7:03 a.m.