create.gpData: Create genomic prediction data object
In synbreed: Framework for the Analysis of Genomic Prediction Data using R

Description Usage Arguments Details Value Note Author(s) See Also Examples

This function combines all raw data sources in a single, unified data object of class gpData. This is a list with elements for phenotypic, genotypic, marker map, pedigree and further covariate data. All elements are optional.

create.gpData(
  pheno = NULL,
  geno = NULL,
  map = NULL,
  pedigree = NULL,
  family = NULL,
  covar = NULL,
  reorderMap = TRUE,
  map.unit = "cM",
  repeated = NULL,
  modCovar = NULL,
  na.string = "NA",
  cores = 1
)

`pheno`	`data.frame` with individuals organized in rows and traits organized in columns. For unrepeated measures unique `rownames` should identify individuals. For repeated measures, the first column identifies individuals and a second column indicates repetitions (see also argument `repeated`).
`geno`	`matrix` with individuals organized in rows and markers organized in columns. Genotypes could be coded arbitrarily. Missing values should be coded as `NA`. Colums or rows with only missing values not allowed. Unique `rownames` identify individuals and unique `colnames` markers. If no `rownames` are available, they are taken from element `pheno` (if available and if dimension matches). If no `colnames` are used, the `rownames` of `map` are used if dimension matches.
`map`	`data.frame` with one row for each marker and two columns (named `chr` and `pos`). First columns gives the chromosome (`numeric` or `character` but not `factor`) and second column the position on the chromosome in centimorgan or the physical distance relative to the reference sequence in basepairs. Unique `rownames` indicate the marker names which should match with marker names in `geno`. Note that order and number of markers must not be identical with the order in `geno`. If this is the case, gaps in the map are filled with `NA` to ensure the same number and order as in element `geno` of the resulting `gpData` object.
`pedigree`	Object of class `pedigree`.
`family`	`data.frame` assigning individuals to families with names of individuals in `rownames` This information could be used for replacing of missing values with function `codeGeno`.
`covar`	`data.frame` with further covariates for all individuals that either appear in `pheno`, `geno` or `pedigree$ID`, e.g. sex or age. `rownames` must be specified to identify individuals. Typically this element is not specified by the user.
`reorderMap`	`logical`. Should markers in `geno` and `map` be reordered by chromosome number and position within chromosome according to `map` (default = `TRUE`)?
`map.unit`	`character`. Unit of position in `map`, i.e. 'cM' for genetic distance or 'bp' for physical distance (default = 'cM').
`repeated`	This column is used to identify the replications of the phenotypic values. The unique values become the names of the third dimension of the pheno object in the `gpData`. This argument is only required for repeated measurements.
`modCovar`	`vector` with `colnames` which identify columns with covariables in `pheno`. This argument is only required for repeated measurements.
`na.string`	`character` or vector of `characters`. You can specify values with which `NA` is coded in your geno object. In case you read missing values from a file not as missing, but as character strings. It can be specified more than one value for missings in a vector. Default is `"NA"`.
`cores`	`numeric`. Here you can specify the number of cores you like to use.

The class gpData is designed to provide a unified framework for data related to genomic prediction analysis. Every data source can be omitted. In this case, the corresponding argument must be NULL. By default (argument reorderMap), markers in geno are ordered by their position in map. Individuals are ordered in alphabetical order.

An object of class gpData can contain different subsets of individuals or markers in the elements pheno, geno and pedigree. In this case the id in covar comprises all individuals that either appear in pheno, geno and pedigree. Two additional columns in covar named phenotyped and genotyped are automatically generated to identify individuals that appear in the corresponding gpData object.

Object of class gpData which is a list with the following elements

`covar`	`data.frame` with information on individuals
`pheno`	`array` (individuals x traits x replications) with phenotypic data
`geno`	`matrix` marker matrix containing genotypic data. Columns (marker) are in the same order as in `map` (if `reorderMap=TRUE`.)
`pedigree`	object of class `pedigree`
`map`	`data.frame` with columns 'chr' and 'pos' and markers sorted by 'pos' within 'chr'
`phenoCovars`	`array` with phenotypic covariates
`info`	`list` with additional information on data (coding of data, unit in `map`) From synbreed version 0.11-11 on the function `codeGeno` adds here the package version which was used to do the coding. There are differences in codings between version 0.10-11 and 0.11-0!

In case of missing row names or column names in one item, information is substituted from other elements (assuming the same order of individuals/markers) and a warning specifying the assumptions is returned. Please check them carefully.

Valentin Wimmer and Hans-Juergen Auinger with contributions be Peter VandeHaar

codeGeno, summary.gpData, gpData2data.frame

set.seed(123)
# 9 plants with 2 traits
n <- 9 # only for n > 6
pheno <- data.frame(Yield = rnorm(n, 200, 5), Height = rnorm(n, 100, 1))
rownames(pheno) <- letters[1:n]

# marker matrix
geno <- matrix(sample(c("AA", "AB", "BB", NA),
  size = n * 12, replace = TRUE,
  prob = c(0.6, 0.2, 0.1, 0.1)
), nrow = n)
rownames(geno) <- letters[n:1]
colnames(geno) <- paste("M", 1:12, sep = "")

# genetic map
# one SNP is not mapped (M5) and will therefore be removed
map <- data.frame(chr = rep(1:3, each = 4), pos = rep(1:12))
map <- map[-5, ]
rownames(map) <- paste("M", c(1:4, 6:12), sep = "")

# simulate pedigree
ped <- simul.pedigree(3, c(3, 3, n - 6))

# combine in one object
gp <- create.gpData(pheno, geno, map, ped)
summary(gp)


# 9 plants with 2 traits , 3 replications
n <- 9 #
pheno <- data.frame(
  ID = rep(letters[1:n], 3), rep = rep(1:3, each = n),
  Yield = rnorm(3 * n, 200, 5), Height = rnorm(3 * n, 100, 1)
)

# combine in one object
gp2 <- create.gpData(pheno, geno, map, repeated = "rep")
summary(gp2)