setGenData: Creates a 'genData' object from a plaintext file.

Description Usage Arguments Value

Description

setGenData assumes that the plaintext file (fileIn) contains records of individuals in rows, and phenotypes, covariates and markers in columns. The columns included in columns 1:nColSkip are used to populate the slot @pheno of a genData object, and the remaining columns are used to fill the slot @geno. If the first row contains a header (header=TRUE), data in this row is used to determine variables names for @pheno and marker names for @map and @geno. Genotypes are stored in a distributed matrix (dMatrix). By default a column-distributed (cDMatrix) is used for @geno, but the user can modify this using the distributed.by argument. The number of chunks is either specified by the user (use nChunks when calling setGenData) or determined internally so that each ff_matrix object has a number of cells that is smaller than .Machine$integer.max/1.2. setGenData creates a folder (folderOut) that contains the binary flat files (geno_*.bin) and the genData object (typically named genData.RData. Optionally (if returnData is TRUE) it returns the genData object to the environment. The filename of the ff_matrix objects are saved as relative names. Therefore, to be able to access the content of the data included in @geno the working directory must either be the folder where these files are saved (folderOut) or the object must be loaded using the loadGenData function included in the package.

Usage

1
2
3
4
5
setGenData(fileIn, header, dataType, distributed.by = "columns", n = NULL,
  p = NULL, folderOut = paste("genData_", sub("\\.[[:alnum:]]+$", "",
  basename(fileIn)), sep = ""), returnData = TRUE, na.strings = "NA",
  nColSkip = 6, idCol = 2, verbose = FALSE, nChunks = NULL,
  dimorder = if (distributed.by == "rows") 2:1 else 1:2)

Arguments

fileIn

The path to the plaintext file.

header

If TRUE, the file contains a header.

dataType

The coding of genotypes. Use 'character' for A/C/G/T or 'integer' for numeric coding.

distributed.by

If columns a column-distributed matrix (cDMatrix) is created, if rows a row-distributed matrix (rDMatrix).

n

The number of individuals.

p

The number of markers.

folderOut

The path to the folder where to save the binary files.

returnData

If TRUE, the function returns a genData object.

na.strings

The character string use to denote missing value.

nColSkip

The number of columns to be skipped to reach the genotype information in the file.

idCol

The index of the ID column.

verbose

If TRUE, progress updates will be posted.

nChunks

The number of chunks to create.

dimorder

The physical layout of the chunks.

Value

If returnData is TRUE, a genData object is returned.


gdlc/dMatrix documentation built on May 17, 2019, 12:12 a.m.