data.prep: Prepare data for hybrid index and genomic cline estimation.
In ribailey/gghybrid: Evolutionary Analysis of Hybrids and Hybrid Zones

data.prep

R Documentation

Prepare data for hybrid index and genomic cline estimation.

Description

Prepare data for hybrid index and genomic cline estimation.

Usage

data.prep(
  data,
  loci,
  alleles,
  marker.info = NULL,
  S0 = NULL,
  S1 = NULL,
  sourceAbsent = FALSE,
  precols,
  INDLABEL.name = "INDLABEL",
  POPID.name = "POPID",
  max.S.MAF = 0.5,
  min.diff = 0,
  min.allele.copies.S0 = 0,
  min.allele.copies.S1 = 0,
  AF.CIoverlap = TRUE,
  return.genotype.table = FALSE,
  return.locus.table = FALSE,
  prior.prop_1 = c(1, 1)
)

Arguments

`data`	A `data` object produced by `read.data`, or a custom `data.table` in the same format.
`loci`	A `loci` object produced by `read.data`, or a custom character vector with locus names.
`alleles`	An `alleles` object produced by `read.data`, or a custom `data.table` in the same format.
`marker.info`	A `marker.info` object uploaded using `read.data`, or a custom `data.table` in the same format, with one row per locus and any number of columns. Default is `NULL`.
`S0`	Character vector. Identifiers of the test subjects representing the first parental reference set (“Source 0”). Allele frequencies in this set represent a hybrid index of 0. By default `data.prep` searches for these identifiers in the `POPID` column of the `data` object, but a different column can be designated using POPID.name. Default is `NULL`, and parental reference samples are not needed if parental allele frequencies are included in a `marker.info` file in the correct format.
`S1`	Character vector. Identifiers of the test subjects representing the second parental reference set (“Source 1”). Allele frequencies in this set represent a hybrid index of 1. By default `data.prep` searches for these identifiers in the `POPID` column of the `data` object, but a different column can be designated using POPID.name. Default is `NULL`, and parental reference samples are not needed if parental allele frequencies are included in a `marker.info` file in the correct format.
`sourceAbsent`	Logical. Whether source (parental reference) samples are absent from the dataset. If `TRUE`, the parental allele frequencies will be taken from the `marker.info` object uploaded using `read.data`. Default is `FALSE`.
`precols`	Character vector. A `precols` object produced by `read.data`, or a custom character vector with names of all pre-marker columns in the `data` object. No default.
`INDLABEL.name`	Character string. The name of the column in `data` designated as `INDLABEL`. Default is “INDLABEL”.
`POPID.name`	Character string. The name of the column in `data` designated as `POPID`. If the `data` object contains only one pre-marker column, such as `INDLABEL`, this must be specified as the POPID.name to avoid an error. The main purpose of this option is to declare which column contains the identifiers for parental reference samples, and any non-marker column can be used. Although it must be specified at least by the default, it is ignored if there are no parental reference samples in the dataset. Default is “POPID”.
`max.S.MAF`	Numeric. Locus filtering option. Removes loci for which the smaller of the two parental minor allele frequencies is greater than this value. Included because only one parental reference minor allele frequency needs to be close to zero for the locus to be informative. Default is `0.5` (no filtering).
`min.diff`	Numeric. Locus filtering option. Removes loci for which the difference in parental reference allele frequency is less than this value. Default is `0` (no filtering).
`min.allele.copies.S0`	Numeric. Locus filtering option based on sample size. Removes loci for which the number of allele copies in the `S0` parental reference set, excluding missing data, is less than this value. This option is only available when parental reference samples `S0` and `S1` are included in the dataset. Default is `0` (no filtering).
`min.allele.copies.S1`	Numeric. Locus filtering option based on sample size. Removes loci for which the number of allele copies in the `S1` parental reference set, excluding missing data, is less than this value. This option is only available when parental reference samples `S0` and `S1` are included in the dataset. Default is `0` (no filtering).
`AF.CIoverlap`	Logical. Recommended locus filtering option, based on posterior credible intervals of parental allele frequencies. If `FALSE`, removes loci for which there is overlap between the allele frequency 95 percent credible intervals of the two parental reference populations. If parental reference sets are identified, the posterior credible intervals are calculated internally. If this option is to be used on parental allele frequencies uploaded in a `marker.info` file, that file must contain columns entitled ‘S0.prop_1_cred.int.upper’ and ‘S1.prop_1_cred.int.lower’, to indicate some kind of confidence limits for the parental allele frequencies. Only the upper limit is needed for `S0`, and only the lower limit is needed for `S1`. Default is `TRUE` (no filtering).
`return.genotype.table`	Logical. Whether to create a genotype table with individual genotypes for each locus scored from `0` to `PLOIDY`, based on their number of copies of the designated `S1` allele (the allele with relatively higher frequency in the `S1` parental reference set). Default is `FALSE`.
`return.locus.table`	Logical. Whether to return a table with one row per locus containing a wider array of locus-level data, such as parental allele frequency credible intervals and number of non-missing allele copies in each parental reference set. Default is `FALSE` to save memory, but this could be useful in many circumstances.
`prior.prop_1`	Numeric vector. Values of `shape1` and `shape2` parameters for a beta-distributed prior for parental allele frequencies (proportions) of the `S1_allele`. Default is `shape1=shape2=1` (uniform distribution).

Details

The primary output of data.prep is a long-form data.table with 1 row per non-missing allele copy in the data set, to be used in downstream hybrid index and genomic cline estimation. It also optionally produces (1) a locus table, with one row per locus and containing useful locus-level data (see ‘Value’), and (2) a genotype table, with genotypes scored according to the number of alleles from parental reference set S1. This may be useful for example in estimating admixture-induced (‘ancestry’) linkage disequilibria. data.prep uses objects produced by read.data as input, or custom objects in the same format.

Even when all filtering options are left at default values, loci with exactly zero allele frequency difference between S0 and S1 will be removed. Therefore, it is often likely that the resulting analysis table will contain fewer loci than were loaded by read.data.

The optional marker.info file should be a file with one row per marker and any number of columns (including the marker name in a column named ‘locus’), each for a variable of interest for grouping markers, such as chromosome, map location, gene annotations, sliding window identifiers etc. These variables can then be used downstream for statistical comparison of hybrid index or clines among groups of loci. In the absence of parental reference samples, certain column names and information must be included in the marker.info file. See examples here and in read.data.

The locus data object includes posterior mean, mode and upper and lower 95 percent credible intervals for frequency of the Source 1 allele for each locus, assuming beta-distributed proportions. Note that if the locus is fixed in a parental reference set, its posterior mode will be 0 or 1 and will fall outside the credible intervals due to the influence of the non-zero prior shape parameters. If prior shape parameters are set to zero (equivalent to no prior), the best posterior estimate is the mean rather than the mode, but in this case credible intervals cannot be estimated when the locus is fixed for one allele.

Value

list containing at least two components: (1) data.prep, a data.table and data.frame long-form version of the imported data table, with one row per non-missing allele copy and the data columns required for analysis, listed below; (2) Nloci.postfilter, a numeric scalar indicating the number of loci remaining after filtering. If parental reference samples are identified, a third output is returned, sourceInfo, a data.table and data.frame listing all the unique entries in the declared ‘POPID.name’ column (default ‘POPID’) and indicating whether they represent ‘S0’, ‘S1’ or ‘TEST’. The two optional outputs are: (1) geno.data, a data.table and data.frame containing genotypes for each locus scored from 0 to PLOIDY, based on their number of copies of the designated S1 allele (the allele with relatively higher frequency in the S1 parental reference set); (2) locus.data, a data.table and data.frame with one row per locus and containing useful locus-specific information, including the contents of any uploaded marker.info object.

data.prep contains the pre-marker columns and the following fields, plus any columns from a declared marker.info object:

`Source`	whether the sample is from parental reference 0 (‘S0’), parental reference 1 (‘S1’) or is a test individual (‘TEST’).
`locus`	the locus names.
`S0.prop_1`	frequency (proportion) of the `S1_allele` in `S0`.
`S1.prop_1`	frequency (proportion) of the `S1_allele` in `S1`.
`Source_allele`	numeric. Whether the allele is the `S0_allele` (‘0’) or `S1_allele` (‘1’).

locus.data additionally contains, at a minimum:

`S0_allele`	name of the allele with relatively higher frequency in `S0`.
`S1_allele`	name of the allele with relatively higher frequency in `S1`.
`Source.afdiff`	difference in allele frequency between `S0` and `S1`.
`S0.MAF`	`S0` minor allele frequency.
`S1.MAF`	`S1` minor allele frequency.
`min.MAF`	smaller of `S0.MAF` and `S1.MAF`.

If parental reference samples are present, locus.data further contains:

`S0.N_1`	number of non-missing copies of the `S1_allele` in `S0`.
`S1.N_1`	number of non-missing copies of the `S1_allele` in `S1`.
`S0.prop_0`	frequency (proportion) of the `S0_allele` in `S0`.
`S1.prop_0`	frequency (proportion) of the `S0_allele` in `S1`.
`N_allele_copies.S0`	total number of non-missing allele copies in `S0`.
`N_allele_copies.S1`	total number of non-missing allele copies in `S1`.
`S0.prop_1_shape1`	shape1 parameter estimate for the posterior beta distribution of the `S1_allele` frequency in `S0`.
`S0.prop_1_shape2`	shape2 parameter estimate for the posterior beta distribution of the `S1_allele` frequency in `S0`.
`S0.prop_1_postmode`	mode of the beta-distributed posterior for the frequency of the `S1_allele` in `S0`.
`S0.prop_1_postmean`	mean of the beta-distributed posterior for the frequency of the `S1_allele` in `S0`.
`S0.prop_1_cred.int.lower`	lower 95 percent credible interval (2.5 percent quantile of the posterior beta distribution) of the frequency of the `S1_allele` in `S0`.
`S0.prop_1_cred.int.upper`	upper 95 percent credible interval (97.5 percent quantile of the posterior beta distribution) of the frequency of the `S1_allele` in `S0`.
`S1.prop_1_shape1`	shape1 parameter estimate for the posterior beta distribution of the `S1_allele` frequency in `S1`.
`S1.prop_1_shape2`	shape2 parameter estimate for the posterior beta distribution of the `S1_allele` frequency in `S1`.
`S1.prop_1_postmode`	mode of the beta-distributed posterior for the frequency of the `S1_allele` in `S1`.
`S1.prop_1_postmean`	mean of the beta-distributed posterior for the frequency of the `S1_allele` in `S1`.
`S1.prop_1_cred.int.lower`	lower 95 percent credible interval (2.5 percent quantile of the posterior beta distribution) of the frequency of the `S1_allele` in `S1`.
`S1.prop_1_cred.int.upper`	upper 95 percent credible interval (97.5 percent quantile of the posterior beta distribution) of the frequency of the `S1_allele` in `S1`.
`S1.prop_0_shape1`	shape1 parameter estimate for the posterior beta distribution of the `S0_allele` frequency in `S1`.
`S1.prop_0_shape2`	shape2 parameter estimate for the posterior beta distribution of the `S0_allele` frequency in `S1`.
`S0.prop_0_shape1`	shape1 parameter estimate for the posterior beta distribution of the `S0_allele` frequency in `S0`.
`S0.prop_0_shape2`	shape2 parameter estimate for the posterior beta distribution of the `S0_allele` frequency in `S0`.

Author(s)

Richard Ian Bailey richardianbailey@gmail.com

Examples


## Not run: 
###############################################################
#Read in and prepare a file with no parental reference samples#
###############################################################

#Create an input data file and save to the working directory.
cat("INDLABEL POPID chr1:001 chr1:002","ind1 pop1 A A","ind1 pop1 A B","ind2 pop1 B B","ind2 pop1 B B","ind3 pop2 NA A","ind3 pop2 B A", file = "ex.data", sep = "\n")

#Create a marker.info file with the necessary parental reference information and save to the working directory.
cat("locus refAllele alternateAllele S0.prop_r S1.prop_r type","chr1:001 A B 0.1 0.9 intronic","chr1:002 B A 0.8 0.2 exonic", file = "ex_marker_info2.data", sep = "\n")

#Read in the data and marker.info file.
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,NUMLOCI=2,MISSINGVAL=NA,marker.info.file = "ex_marker_info2.data", sourceAbsent = TRUE)

#Run data.prep.
prepdata=data.prep(
 data=dat$data,               #part of the read.data output object#
 loci=dat$loci,               #part of the read.data output object#
 sourceAbsent = TRUE,         #Must be set to TRUE in the absence of parental reference samples. Default is FALSE#
 marker.info=dat$marker.info, #must be specified if sourceAbsent = TRUE; make sure it contains the headers and their 
                              #contents as specified above when 'sourceAbsent=TRUE'#
 alleles=dat$alleles,         #part of the read.data output object#
 #S0,                         #character vector identifying parental reference S0 samples. Leave blank in the absence of parental reference samples#
 #S1,                         #character vector identifying parental reference S1 samples. Leave blank in the absence of parental reference samples#
 precols=dat$precols,         #part of the read.data output object#
 return.genotype.table=TRUE,  #This returns an table, with loci in columns, of diploid (or other ploidy) genotypes, coded as number of copies of the 
                              #allele with higher frequency in S1 than S0#
 return.locus.table=TRUE      #Returns a table with one row per locus and all the individual marker data, including data already uploaded plus values 
                              #calculated within this function#
)

#####################################################################
#Read in and prepare a file that includes parental reference samples#
#####################################################################

#Make a new data input file with more entries, including S0 and S1 samples, in plink structure format but with missing data recoded to NA. This includes 
#a MAPDISTANCE row below the marker names.
cat("chr1:001 chr1:002","-3 3 2 4",
 "ind1 pop1 A A A B","ind2 pop1 B B B B","ind3 pop2 NA B A A", 
 "ind4 pop3 A A A A","ind5 pop3 A B A A","ind6 pop3 A A A A",
 "ind7 pop3 NA A A A","ind8 pop4 A NA B B","ind9 pop4 B B B B",
 "ind10 pop4 B B B B","ind11 pop5 A A B B",
 file = "ex3.data", sep = "\n")

#pop3 represents S0; pop4 and pop5 represent S1.

#Run read.data and data.prep.
dat3 <- read.data(file="ex3.data",MISSINGVAL=NA,NUMINDS=11,nprecol=2,ONEROW=1,MAPDISTANCE=1,precol.headers=0)

prepdata= data.prep(
 data=dat3$data,
 loci=dat3$loci,
 sourceAbsent = FALSE,                         #default#
 #marker.info=dat$marker.info,                 #a marker.info file is not obligatory when parental reference samples are declared#
 alleles=dat3$alleles,
 S0="pop3",                                    #first parental reference set; must be specified when sourceAbsent = FALSE#
 S1=c("pop4","pop5"),                          #second parental reference set; must be specified when sourceAbsent = FALSE#
 precols=dat3$precols,
 return.genotype.table=TRUE,                   #default is FALSE to save memory#
 return.locus.table=TRUE                       #default is FALSE to save memory#
)

#########################################
#Now completely exclude the POPID column#
#########################################

#Just to show that the functions work in the absence of this column#

#First with no parental reference samples. Modify the plink format 'ex3.data' above to remove the POPID column, 
#and use the existing marker info file, 'ex_marker_info2.data'#

cat("chr1:001 chr1:002", "-3 3 2 4",
 "ind1 A A A B","ind2 B B B B","ind3 NA B A A", 
 "ind4 A A A A","ind5 A B A A","ind6 A A A A",
 "ind7 NA A A A","ind8 A NA B B","ind9 B B B B","ind10 B B B B","ind11 A A B B",
 file = "ex4.data", sep = "\n")

#you must specify 'POPID=0' (meaning the column is absent) in read.data.
dat4 <- read.data(file="ex4.data",POPID=0,nprecol=1,NUMINDS=11,NUMLOCI=2,MISSINGVAL=NA,marker.info.file="ex_marker_info2.data",sourceAbsent=TRUE,ONEROW=1,MAPDISTANCE=1,precol.headers=0)

#Now prepare the data for analysis, this time we're not creating geno.data or locus.data objects.
prepdata= data.prep(
 data=dat4$data,
 loci=dat4$loci,
 sourceAbsent = TRUE,         #default is FALSE#
 marker.info=dat4$marker.info,#must be specified if sourceAbsent = TRUE#
 alleles=dat4$alleles,
 #S0,                         #leave blank in the absence of parental reference samples#
 #S1,                         #leave blank in the absence of parental reference samples#
 POPID.name="INDLABEL",     #***must be specified if there's no POPID column, to avoid an error***#
 precols=dat4$precols
)

#Now including parental reference samples, again with no POPID column.

#Create and save a marker info file. Not obligatory in the presence of parental reference samples.
cat("locus type","chr1:001 intronic","chr1:002 exonic", file = "ex_marker_info.data", sep = "\n")

#Read data in and set sourceAbsent=FALSE (the default setting) so that the code doesn't search for the obligatory columns and throw up an error.
dat5 <- read.data(file="ex4.data",POPID=0,nprecol=1,NUMINDS=11,NUMLOCI=2,MISSINGVAL=NA,marker.info.file="ex_marker_info.data",
 sourceAbsent=FALSE,ONEROW=1,MAPDISTANCE=1,precol.headers=0)

#Prepare the analysis object.
prepdata= data.prep(
 data=dat5$data,
 loci=dat5$loci,
 sourceAbsent = FALSE,                               #default is FALSE#
 marker.info=dat5$marker.info,                       #this now contains only the locus (obligatory) and type columns#
 alleles=dat5$alleles,
 S0=c("ind4","ind5","ind6","ind7"),                  #must be specified when sourceAbsent = FALSE#
 S1=c("ind8","ind9","ind10","ind11"),                #must be specified when sourceAbsent = FALSE#
 POPID.name = "INDLABEL",                            #***must be specified if there's no POPID column; the S0 and S1 sample 
                                                     #references must be present in the column specified here***#
 precols=dat5$precols,
 return.genotype.table=TRUE,                         #Optional#
 return.locus.table=TRUE                             #Optional#
)

#################################################################################
#Locus filtering based on parental reference allele frequencies and sample sizes#
#################################################################################

#Upload a real dataset (Italian sparrows), included as part of the package.
dat=read.data("RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,MISSINGVAL=NA,NUMINDS=569)

#The file contains data for 77 diploid markers for 569 male sparrow individuals.
#Run data.prep including the different filtering options. Try hashing out or changing these options to see how it affects the number of 
#loci in the resulting analysis table (see prepdata$Nloci.postfilter in the data.prep object).
prepdata=data.prep(
 data=dat$data,
 loci=dat$loci,
 alleles=dat$alleles,
 S0=c("Kralove","Oslo"), #POPID names for the first parental reference set#
 S1="LesinaSPANISH",     #POPID names for the second parental reference set#
 precols=dat$precols,
###Filtering below###
 max.S.MAF = 0.2,               #minor allele frq must be below this value in at least one of S0 or S1#
 min.diff = 0.2,                #loci with parental allele frequency difference < this value will be removed#
 min.allele.copies.S0 = 30,     #loci with fewer non-missing allele copies in the S0 parental reference set will be removed#
 min.allele.copies.S1 = 12,     #in this dataset the S1 sample size is much smaller than S0, so I'm being less strict with filtering for S1 sample size#
 AF.CIoverlap = FALSE,          #***RECOMMENDED*** filtering option if parental reference samples are included - removes all loci for which there is 
                                #overlap between S0 and S1 in the Bayesian posterior 95% credible intervals of allele frequency# 
)

unlink("ex_marker_info.data")#Tidy up#
unlink("ex_marker_info2.data")#Tidy up#
unlink("ex.data")#Tidy up#
unlink("ex3.data")#Tidy up#
unlink("ex4.data")#Tidy up#

## End(Not run)

ribailey/gghybrid documentation built on Feb. 2, 2024, 12:53 a.m.