read.long: Read SNP genotype data in long format
In NikNakk/snpStats: SnpMatrix and XSnpMatrix classes and methods

Description Usage Arguments Details Value Note Author(s) See Also Examples

This function reads SNP genotype data from a file in which each line refers to a single genotype call. Replaces the earlier function read.snps.long.

read.long(file, samples, snps,
            fields = c(snp = 1, sample = 2, genotype = 3, confidence = 4,
                       allele.A = NA, allele.B = NA),
            split = "\t| +", gcodes, no.call = "", threshold = NULL,
            lex.order = FALSE, verbose = FALSE)

`file`	Name(s) of file(s) to be read (can be gzipped)
`samples`	Either a vector of sample identifiers, or the number of samples to be read. If a single file is to be read and this argument is omitted, the file will be scanned initially and all samples will be included
`snps`	Either a vector of SNP identifiers, or the number of SNPs to be read. If a single file is to be read and this argument is omitted, the file will be scanned initially and all SNPs will be included
`fields`	A named vector giving the locations of the required fields. See Details below
`split`	A regular expression specifying how the input line will be split into fields. The default value specifies separation of fields by a TAB character, or by one or more blanks
`gcodes`	When the genotype is read as a single field, this argument specifies how it is handled. See Details below.
`no.call`	The string which indicates "no call" for either a genotype or (when the genotype is read as two allele fields) an allele
`threshold`	A vector of length 2 giving the lower and higher acceptable limits for the confidence score
`lex.order`	If `TRUE`, the alleles at each locus will be in lexographical order. Otherwise, ordering of alleles is arbitrary, depending on the order in which they are encountered
`verbose`	If `TRUE`, this turns on output from the function. Otherwise only error and warning messages are produced

Each line on the input file represents a single call and is split into fields using the function strsplit. The required fields are extracted according to the fields argument. This must contain the locations of the sample and snp identifier fields and either the location of a genotype field or the locations of two allele fields.

If the samples and snps arguments contain vectors of character strings, a SnpMatrix is created with these row and column names and the genotype values are "cherry-picked" from the input file. If either, or both, of these arguments are specified simply as numbers, then these numbers determine the dimensions of the SnpMatrix created. In this case samples and/or SNPs are included in the SnpMatrix on a first-come-first-served basis. If either or both of these arguments are omitted, a preliminary scan of the input file is carried out to find the missing sample and/or SNP identifiers. In this scan, when a sample or SNP identifier differs from that in the previous line, but is identical to one previously found, then all the relevant identifiers are assumed to have been found. This implies that the file must be sorted, in some consistent order, by sample and by SNP (although either one of these may vary fastest).

If the genotype is to be read as a single field, the genotype element of the fields argument must be set to the appropriate value, and the allele.A and allele.B elements should be set to NA. Its handling is controlled by the gcodes argument. If this is missing or NA, then the genotype is assumed to be represented by a two-character field, the two characters representing the two alleles. If gcodes is a single string, then it is assumed to contain a regular expression which will split the genotype field into two allele fields. Otherwise, gcode must be an array of length three, specifying the three genotype codes in the order "AA", "AB", "BB".

If the two alleles of the genotype are to be read from two separate fields, the genotype element should be set to NA and the allele.A and allele.B elements set to the appropriate values. The gcode argument should be missing or set to NA.

If the genotype is read as a single field matching one of three specified codes, the function returns an object of class SnpMatrix. Otherwise it returns a list whose first element is the SnpMatrix object and whose second element is a dataframe containing the allele codes, with the SNP identifiers as row names. Note that allele codes only occur in this file if they occur in a genotype which was accepted. Thus, monomorphic SNPs have allele.B coded as NA, and SNPs which never pass confidence score filters have both alleles coded as NA.

Unlike read.snps.long, this function is written entirely in R and may not be particularly fast. However, it imposes no restrictions on the allele codes recognized.

Homozygous genotypes are assumed to be represented in the input file by coding both alleles to the same value. No special provision is made to read XSnpMatrix objects; such data should first be read as a SnpMatrix and then coerced to an XSnpMatrix using new or as.

David Clayton dc208@cam.ac.uk

SnpMatrix-class, XSnpMatrix-class