prepare.data: Prepare Data
In introgress: methods for analyzing introgression between divergent lineages

Description Usage Arguments Details Value Author(s) References See Also Examples

This function is meant to be used with individuals from an admixed population. The function determines the number of alleles inherited from each of two parental populations at each locus. The counts are based on genotype data from specified parental populations, which must be supplied. This function works with both co-dominant and dominant (or haploid) data.

prepare.data(admix.gen=NULL, loci.data=NULL,
             parental1=NULL, parental2=NULL,
             pop.id=TRUE, ind.id=TRUE, fixed=FALSE,
             sep.rows=FALSE, sep.columns=FALSE)

`admix.gen`	a matrix, array or data frame with genotype data
`loci.data`	a matrix or array providing marker information.
`parental1`	a matrix or two-dimensional array if `fixed=FALSE`, a single character if `fixed = TRUE`.
`parental2`	a matrix or two-dimensional array if `fixed=FALSE`, a single character if `fixed=TRUE`.
`pop.id`	a logical specifying whether `admix.gen` includes a row specifying sampling localities.
`ind.id`	a logical specifying whether `admix.gen` includes a row specifying individual identifications.
`fixed`	a logical specifying whether all loci scored exhibit fixed differences between the parental populations.
`sep.rows`	a logical specifying whether genotypes at a locus are recorded using two rows.
`sep.columns`	a logical specifying whether genotypes at a locus are recorded using two columns.

Genotypic data for individuals are provided in admix.gen, a data object with genotypes for each individual at each locus in the format ‘A/D’ or ‘110/114’ for co-dominant data, ‘A’ or ‘hap1b’ for haploid data, and ‘0’ or ‘1’ for dominant data. In other words, for co-dominant and haploid data alleles can be encoded by any simple character string. Each row should contain data for a locus and columns should correspond to individuals. Missing data should be entered as ‘NA/NA’ or ‘NA’ for co-dominant and haploid / dominant data, respectively.

Alternatively, in admix.gen genotypic data for an individual can be split between two rows (sep.rows = TRUE) or two columns (sep.columns = TRUE). These options are similar to those of the data format for the program structure (Pritchard et al. 2000, Falush et al. 2003), with the difference that admix.gen is transposed relative to the input for structure. Thus, after reading in a structure file, the data matrix can be transposed with rawdata <- t(rawdata) before passing the matrix to prepare.data. If genotype data are split across columns or rows, and they include haploid or dominant markers, the second allele for these markers should be recorded as NA.

If pop.id = TRUE and ind.id = TRUE the first row of admix.gen should give the population identification (i.e. sampling locality) of each individual and the second row should provide a unique individual identification; genotype information would then begin on row three.

loci.data is a matrix or array data object where each row provides information on one locus. The first column gives a unique locus name (e.g. "locus3"), and the second column specifies whether the locus is co-dominant ("C" or "c"), haploid ("H" or "h"), or dominant ("D" or "d"). These first two columns in loci.data are required. The third column, which is optional, is a numeric value specifying the linkage groups for the marker. If present, this column is used in the mk.image function for plotting. The fourth column, which is also optional, is a numeric value specifying both the linkage group and location on the linkage group (e.g. 3.70, for a marker at 70 cM on linkage group 3). This last column could be used to generate a different order in which to utilize marker data from admix.gen in other functions in the package (specified in the marker.order argument to mk.image and clines.plot). Each column in loci.data should have a heading (the second column should be named "type").

If the parental populations exhibit fixed differences for all markers scored (i.e. fixed = TRUE) then parental1 and parental2 should give the character used to specify alleles derived from parental populations one and two, respectively (e.g. parental1 = "p1" and parental2 = "p2"). If parental populations exhibit fixed differences at all loci, the count matrix produced by prepare.data is simply a count of the number of alleles inherited from parental population 1 for each individual at each locus (0, 1, or 2 for co-dominant marker data; 0 or 1 for dominant or haploid marker data).

If the parental populations do not exhibit fixed differences at all loci scored (i.e. fixed = FALSE) then parental1 and parental2 should be matrix data objects providing genotype data for individuals sampled from each of the parental populations. These data objects should be in the same format as the genotype.data data object, with the difference that they should not contain rows for individual and population identifications at the top. prepare.data uses the parental data objects to calculate allele frequencies at each locus for both of the parental populations. Alleles are then binned into allelic classes with maximum (equal to the observed) frequency differentials between parental populations (δ, Gregorius and Roberds 1986). These allelic classes serve as the basis for estimating the count matrix, which is in the same format as described above. In the absence of fixed differences the counts are of alleles from the allelic class associated with population 1 and the frequency of allelic classes in the parental species can be used to account for uncertainty in the ancestry of particular alleles.

See Gompert and Buerkle (2009a, 2009b) for additional details.

A list with the following components:

`Individual.data`	a matrix with `pop.id` and `ind.id` data if they were supplied.
`Count.matrix`	the count matrix; each row corresponds to a locus and each column represents an individual.
`Combos.to.use`	`NULL` if `fixed = TRUE`, otherwise this provides the allelic class data needed for `genomic.clines`.
`Parental1.allele.freq`	matrix of allele frequencies calculated for parental population 1 where each row is a locus and each column is an allele.
`Parental2.allele.freq`	matrix of allele frequencies calculated for parental population 2 where each row is a locus and each column is an allele.
`Alleles`	a matrix specifying the names of the alleles in the same order as they are given in `Parental1.allele.freq` and `Parental2.allele.freq` for each locus.
`Admix.gen`	the matrix of genotype data for the admixed population; each row corresponds to a locus and each column represents an individual.

Zachariah Gompert zgompert@uwyo.edu, C. Alex Buerkle buerkle@uwyo.edu

Falush D., Stephens M., and Pritchard J. K. (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics, 164, 1567-1587.

Gompert Z. and Buerkle C. A. (2009) A powerful regression-based method for admixture mapping of isolation across the genome of hybrids. Molecular Ecology, 18, 1207-1224.

Gompert Z. and Buerkle C. A. (2009) introgress: a software package for mapping components of isolation in hybrids. Molecular Ecology Resources, in preparation.

Gregorius H. R. and Roberds J. H. (1986) Measurement of genetical differentiation among subpopulations. Theoretical and Applied Genetics, 71, 826-834.

Pritchard J. K., Stephens M., and Donnelly P. (2000) Inference of population structure using multilocus genotype data. Genetics, 155, 945-959.

delta, mk.image, genomic.clines, clines.plot

## Not run: 
## load simulated data
## markers have fixed differences, with
## alleles coded as 'P1' and 'P2'
data(AdmixDataSim1)
data(LociDataSim1)

## use prepare.data to produce introgress.data
introgress.data<-prepare.data(admix.gen=AdmixDataSim1,
                              loci.data=LociDataSim1,
                              parental1="P1", parental2="P2",
                              pop.id=FALSE, ind.id=FALSE, fixed=TRUE)


## End(Not run)