data.prep: Prepare data for hybrid index and genomic cline estimation.

View source: R/data.prep.R

data.prepR Documentation

Prepare data for hybrid index and genomic cline estimation.

Description

Prepare data for hybrid index and genomic cline estimation.

Usage

data.prep(
  data,
  loci,
  alleles,
  marker.info = NULL,
  S0 = NULL,
  S1 = NULL,
  sourceAbsent = FALSE,
  precols,
  INDLABEL.name = "INDLABEL",
  POPID.name = "POPID",
  max.S.MAF = 0.5,
  min.diff = 0,
  min.allele.copies.S0 = 0,
  min.allele.copies.S1 = 0,
  AF.CIoverlap = TRUE,
  return.genotype.table = FALSE,
  return.locus.table = FALSE,
  prior.prop_1 = c(1, 1)
)

Arguments

data

A data object produced by read.data, or a custom data.table in the same format.

loci

A loci object produced by read.data, or a custom character vector with locus names.

alleles

An alleles object produced by read.data, or a custom data.table in the same format.

marker.info

A marker.info object uploaded using read.data, or a custom data.table in the same format, with one row per locus and any number of columns. Default is NULL.

S0

Character vector. Identifiers of the test subjects representing the first parental reference set (“Source 0”). Allele frequencies in this set represent a hybrid index of 0. By default data.prep searches for these identifiers in the POPID column of the data object, but a different column can be designated using POPID.name. Default is NULL, and parental reference samples are not needed if parental allele frequencies are included in a marker.info file in the correct format.

S1

Character vector. Identifiers of the test subjects representing the second parental reference set (“Source 1”). Allele frequencies in this set represent a hybrid index of 1. By default data.prep searches for these identifiers in the POPID column of the data object, but a different column can be designated using POPID.name. Default is NULL, and parental reference samples are not needed if parental allele frequencies are included in a marker.info file in the correct format.

sourceAbsent

Logical. Whether source (parental reference) samples are absent from the dataset. If TRUE, the parental allele frequencies will be taken from the marker.info object uploaded using read.data. Default is FALSE.

precols

Character vector. A precols object produced by read.data, or a custom character vector with names of all pre-marker columns in the data object. No default.

INDLABEL.name

Character string. The name of the column in data designated as INDLABEL. Default is “INDLABEL”.

POPID.name

Character string. The name of the column in data designated as POPID. If the data object contains only one pre-marker column, such as INDLABEL, this must be specified as the POPID.name to avoid an error. The main purpose of this option is to declare which column contains the identifiers for parental reference samples, and any non-marker column can be used. Although it must be specified at least by the default, it is ignored if there are no parental reference samples in the dataset. Default is “POPID”.

max.S.MAF

Numeric. Locus filtering option. Removes loci for which the smaller of the two parental minor allele frequencies is greater than this value. Included because only one parental reference minor allele frequency needs to be close to zero for the locus to be informative. Default is 0.5 (no filtering).

min.diff

Numeric. Locus filtering option. Removes loci for which the difference in parental reference allele frequency is less than this value. Default is 0 (no filtering).

min.allele.copies.S0

Numeric. Locus filtering option based on sample size. Removes loci for which the number of allele copies in the S0 parental reference set, excluding missing data, is less than this value. This option is only available when parental reference samples S0 and S1 are included in the dataset. Default is 0 (no filtering).

min.allele.copies.S1

Numeric. Locus filtering option based on sample size. Removes loci for which the number of allele copies in the S1 parental reference set, excluding missing data, is less than this value. This option is only available when parental reference samples S0 and S1 are included in the dataset. Default is 0 (no filtering).

AF.CIoverlap

Logical. Recommended locus filtering option, based on posterior credible intervals of parental allele frequencies. If FALSE, removes loci for which there is overlap between the allele frequency 95 percent credible intervals of the two parental reference populations. If parental reference sets are identified, the posterior credible intervals are calculated internally. If this option is to be used on parental allele frequencies uploaded in a marker.info file, that file must contain columns entitled ‘S0.prop_1_cred.int.upper’ and ‘S1.prop_1_cred.int.lower’, to indicate some kind of confidence limits for the parental allele frequencies. Only the upper limit is needed for S0, and only the lower limit is needed for S1. Default is TRUE (no filtering).

return.genotype.table

Logical. Whether to create a genotype table with individual genotypes for each locus scored from 0 to PLOIDY, based on their number of copies of the designated S1 allele (the allele with relatively higher frequency in the S1 parental reference set). Default is FALSE.

return.locus.table

Logical. Whether to return a table with one row per locus containing a wider array of locus-level data, such as parental allele frequency credible intervals and number of non-missing allele copies in each parental reference set. Default is FALSE to save memory, but this could be useful in many circumstances.

prior.prop_1

Numeric vector. Values of shape1 and shape2 parameters for a beta-distributed prior for parental allele frequencies (proportions) of the S1_allele. Default is shape1=shape2=1 (uniform distribution).

Details

The primary output of data.prep is a long-form data.table with 1 row per non-missing allele copy in the data set, to be used in downstream hybrid index and genomic cline estimation. It also optionally produces (1) a locus table, with one row per locus and containing useful locus-level data (see ‘Value’), and (2) a genotype table, with genotypes scored according to the number of alleles from parental reference set S1. This may be useful for example in estimating admixture-induced (‘ancestry’) linkage disequilibria. data.prep uses objects produced by read.data as input, or custom objects in the same format.

Even when all filtering options are left at default values, loci with exactly zero allele frequency difference between S0 and S1 will be removed. Therefore, it is often likely that the resulting analysis table will contain fewer loci than were loaded by read.data.

The optional marker.info file should be a file with one row per marker and any number of columns (including the marker name in a column named ‘locus’), each for a variable of interest for grouping markers, such as chromosome, map location, gene annotations, sliding window identifiers etc. These variables can then be used downstream for statistical comparison of hybrid index or clines among groups of loci. In the absence of parental reference samples, certain column names and information must be included in the marker.info file. See examples here and in read.data.

The locus data object includes posterior mean, mode and upper and lower 95 percent credible intervals for frequency of the Source 1 allele for each locus, assuming beta-distributed proportions. Note that if the locus is fixed in a parental reference set, its posterior mode will be 0 or 1 and will fall outside the credible intervals due to the influence of the non-zero prior shape parameters. If prior shape parameters are set to zero (equivalent to no prior), the best posterior estimate is the mean rather than the mode, but in this case credible intervals cannot be estimated when the locus is fixed for one allele.

Value

list containing at least two components: (1) data.prep, a data.table and data.frame long-form version of the imported data table, with one row per non-missing allele copy and the data columns required for analysis, listed below; (2) Nloci.postfilter, a numeric scalar indicating the number of loci remaining after filtering. If parental reference samples are identified, a third output is returned, sourceInfo, a data.table and data.frame listing all the unique entries in the declared ‘POPID.name’ column (default ‘POPID’) and indicating whether they represent ‘S0’, ‘S1’ or ‘TEST’. The two optional outputs are: (1) geno.data, a data.table and data.frame containing genotypes for each locus scored from 0 to PLOIDY, based on their number of copies of the designated S1 allele (the allele with relatively higher frequency in the S1 parental reference set); (2) locus.data, a data.table and data.frame with one row per locus and containing useful locus-specific information, including the contents of any uploaded marker.info object.

data.prep contains the pre-marker columns and the following fields, plus any columns from a declared marker.info object:

Source

whether the sample is from parental reference 0 (‘S0’), parental reference 1 (‘S1’) or is a test individual (‘TEST’).

locus

the locus names.

S0.prop_1

frequency (proportion) of the S1_allele in S0.

S1.prop_1

frequency (proportion) of the S1_allele in S1.

Source_allele

numeric. Whether the allele is the S0_allele (‘0’) or S1_allele (‘1’).

locus.data additionally contains, at a minimum:

S0_allele

name of the allele with relatively higher frequency in S0.

S1_allele

name of the allele with relatively higher frequency in S1.

Source.afdiff

difference in allele frequency between S0 and S1.

S0.MAF

S0 minor allele frequency.

S1.MAF

S1 minor allele frequency.

min.MAF

smaller of S0.MAF and S1.MAF.

If parental reference samples are present, locus.data further contains:

S0.N_1

number of non-missing copies of the S1_allele in S0.

S1.N_1

number of non-missing copies of the S1_allele in S1.

S0.prop_0

frequency (proportion) of the S0_allele in S0.

S1.prop_0

frequency (proportion) of the S0_allele in S1.

N_allele_copies.S0

total number of non-missing allele copies in S0.

N_allele_copies.S1

total number of non-missing allele copies in S1.

S0.prop_1_shape1

shape1 parameter estimate for the posterior beta distribution of the S1_allele frequency in S0.

S0.prop_1_shape2

shape2 parameter estimate for the posterior beta distribution of the S1_allele frequency in S0.

S0.prop_1_postmode

mode of the beta-distributed posterior for the frequency of the S1_allele in S0.

S0.prop_1_postmean

mean of the beta-distributed posterior for the frequency of the S1_allele in S0.

S0.prop_1_cred.int.lower

lower 95 percent credible interval (2.5 percent quantile of the posterior beta distribution) of the frequency of the S1_allele in S0.

S0.prop_1_cred.int.upper

upper 95 percent credible interval (97.5 percent quantile of the posterior beta distribution) of the frequency of the S1_allele in S0.

S1.prop_1_shape1

shape1 parameter estimate for the posterior beta distribution of the S1_allele frequency in S1.

S1.prop_1_shape2

shape2 parameter estimate for the posterior beta distribution of the S1_allele frequency in S1.

S1.prop_1_postmode

mode of the beta-distributed posterior for the frequency of the S1_allele in S1.

S1.prop_1_postmean

mean of the beta-distributed posterior for the frequency of the S1_allele in S1.

S1.prop_1_cred.int.lower

lower 95 percent credible interval (2.5 percent quantile of the posterior beta distribution) of the frequency of the S1_allele in S1.

S1.prop_1_cred.int.upper

upper 95 percent credible interval (97.5 percent quantile of the posterior beta distribution) of the frequency of the S1_allele in S1.

S1.prop_0_shape1

shape1 parameter estimate for the posterior beta distribution of the S0_allele frequency in S1.

S1.prop_0_shape2

shape2 parameter estimate for the posterior beta distribution of the S0_allele frequency in S1.

S0.prop_0_shape1

shape1 parameter estimate for the posterior beta distribution of the S0_allele frequency in S0.

S0.prop_0_shape2

shape2 parameter estimate for the posterior beta distribution of the S0_allele frequency in S0.

Author(s)

Richard Ian Bailey richardianbailey@gmail.com

Examples


## Not run: 
###############################################################
#Read in and prepare a file with no parental reference samples#
###############################################################

#Create an input data file and save to the working directory.
cat("INDLABEL POPID chr1:001 chr1:002","ind1 pop1 A A","ind1 pop1 A B","ind2 pop1 B B","ind2 pop1 B B","ind3 pop2 NA A","ind3 pop2 B A", file = "ex.data", sep = "\n")

#Create a marker.info file with the necessary parental reference information and save to the working directory.
cat("locus refAllele alternateAllele S0.prop_r S1.prop_r type","chr1:001 A B 0.1 0.9 intronic","chr1:002 B A 0.8 0.2 exonic", file = "ex_marker_info2.data", sep = "\n")

#Read in the data and marker.info file.
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,NUMLOCI=2,MISSINGVAL=NA,marker.info.file = "ex_marker_info2.data", sourceAbsent = TRUE)

#Run data.prep.
prepdata=data.prep(
 data=dat$data,               #part of the read.data output object#
 loci=dat$loci,               #part of the read.data output object#
 sourceAbsent = TRUE,         #Must be set to TRUE in the absence of parental reference samples. Default is FALSE#
 marker.info=dat$marker.info, #must be specified if sourceAbsent = TRUE; make sure it contains the headers and their 
                              #contents as specified above when 'sourceAbsent=TRUE'#
 alleles=dat$alleles,         #part of the read.data output object#
 #S0,                         #character vector identifying parental reference S0 samples. Leave blank in the absence of parental reference samples#
 #S1,                         #character vector identifying parental reference S1 samples. Leave blank in the absence of parental reference samples#
 precols=dat$precols,         #part of the read.data output object#
 return.genotype.table=TRUE,  #This returns an table, with loci in columns, of diploid (or other ploidy) genotypes, coded as number of copies of the 
                              #allele with higher frequency in S1 than S0#
 return.locus.table=TRUE      #Returns a table with one row per locus and all the individual marker data, including data already uploaded plus values 
                              #calculated within this function#
)

#####################################################################
#Read in and prepare a file that includes parental reference samples#
#####################################################################

#Make a new data input file with more entries, including S0 and S1 samples, in plink structure format but with missing data recoded to NA. This includes 
#a MAPDISTANCE row below the marker names.
cat("chr1:001 chr1:002","-3 3 2 4",
 "ind1 pop1 A A A B","ind2 pop1 B B B B","ind3 pop2 NA B A A", 
 "ind4 pop3 A A A A","ind5 pop3 A B A A","ind6 pop3 A A A A",
 "ind7 pop3 NA A A A","ind8 pop4 A NA B B","ind9 pop4 B B B B",
 "ind10 pop4 B B B B","ind11 pop5 A A B B",
 file = "ex3.data", sep = "\n")

#pop3 represents S0; pop4 and pop5 represent S1.

#Run read.data and data.prep.
dat3 <- read.data(file="ex3.data",MISSINGVAL=NA,NUMINDS=11,nprecol=2,ONEROW=1,MAPDISTANCE=1,precol.headers=0)

prepdata= data.prep(
 data=dat3$data,
 loci=dat3$loci,
 sourceAbsent = FALSE,                         #default#
 #marker.info=dat$marker.info,                 #a marker.info file is not obligatory when parental reference samples are declared#
 alleles=dat3$alleles,
 S0="pop3",                                    #first parental reference set; must be specified when sourceAbsent = FALSE#
 S1=c("pop4","pop5"),                          #second parental reference set; must be specified when sourceAbsent = FALSE#
 precols=dat3$precols,
 return.genotype.table=TRUE,                   #default is FALSE to save memory#
 return.locus.table=TRUE                       #default is FALSE to save memory#
)

#########################################
#Now completely exclude the POPID column#
#########################################

#Just to show that the functions work in the absence of this column#

#First with no parental reference samples. Modify the plink format 'ex3.data' above to remove the POPID column, 
#and use the existing marker info file, 'ex_marker_info2.data'#

cat("chr1:001 chr1:002", "-3 3 2 4",
 "ind1 A A A B","ind2 B B B B","ind3 NA B A A", 
 "ind4 A A A A","ind5 A B A A","ind6 A A A A",
 "ind7 NA A A A","ind8 A NA B B","ind9 B B B B","ind10 B B B B","ind11 A A B B",
 file = "ex4.data", sep = "\n")

#you must specify 'POPID=0' (meaning the column is absent) in read.data.
dat4 <- read.data(file="ex4.data",POPID=0,nprecol=1,NUMINDS=11,NUMLOCI=2,MISSINGVAL=NA,marker.info.file="ex_marker_info2.data",sourceAbsent=TRUE,ONEROW=1,MAPDISTANCE=1,precol.headers=0)

#Now prepare the data for analysis, this time we're not creating geno.data or locus.data objects.
prepdata= data.prep(
 data=dat4$data,
 loci=dat4$loci,
 sourceAbsent = TRUE,         #default is FALSE#
 marker.info=dat4$marker.info,#must be specified if sourceAbsent = TRUE#
 alleles=dat4$alleles,
 #S0,                         #leave blank in the absence of parental reference samples#
 #S1,                         #leave blank in the absence of parental reference samples#
 POPID.name="INDLABEL",     #***must be specified if there's no POPID column, to avoid an error***#
 precols=dat4$precols
)

#Now including parental reference samples, again with no POPID column.

#Create and save a marker info file. Not obligatory in the presence of parental reference samples.
cat("locus type","chr1:001 intronic","chr1:002 exonic", file = "ex_marker_info.data", sep = "\n")

#Read data in and set sourceAbsent=FALSE (the default setting) so that the code doesn't search for the obligatory columns and throw up an error.
dat5 <- read.data(file="ex4.data",POPID=0,nprecol=1,NUMINDS=11,NUMLOCI=2,MISSINGVAL=NA,marker.info.file="ex_marker_info.data",
 sourceAbsent=FALSE,ONEROW=1,MAPDISTANCE=1,precol.headers=0)

#Prepare the analysis object.
prepdata= data.prep(
 data=dat5$data,
 loci=dat5$loci,
 sourceAbsent = FALSE,                               #default is FALSE#
 marker.info=dat5$marker.info,                       #this now contains only the locus (obligatory) and type columns#
 alleles=dat5$alleles,
 S0=c("ind4","ind5","ind6","ind7"),                  #must be specified when sourceAbsent = FALSE#
 S1=c("ind8","ind9","ind10","ind11"),                #must be specified when sourceAbsent = FALSE#
 POPID.name = "INDLABEL",                            #***must be specified if there's no POPID column; the S0 and S1 sample 
                                                     #references must be present in the column specified here***#
 precols=dat5$precols,
 return.genotype.table=TRUE,                         #Optional#
 return.locus.table=TRUE                             #Optional#
)

#################################################################################
#Locus filtering based on parental reference allele frequencies and sample sizes#
#################################################################################

#Upload a real dataset (Italian sparrows), included as part of the package.
dat=read.data("RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,MISSINGVAL=NA,NUMINDS=569)

#The file contains data for 77 diploid markers for 569 male sparrow individuals.
#Run data.prep including the different filtering options. Try hashing out or changing these options to see how it affects the number of 
#loci in the resulting analysis table (see prepdata$Nloci.postfilter in the data.prep object).
prepdata=data.prep(
 data=dat$data,
 loci=dat$loci,
 alleles=dat$alleles,
 S0=c("Kralove","Oslo"), #POPID names for the first parental reference set#
 S1="LesinaSPANISH",     #POPID names for the second parental reference set#
 precols=dat$precols,
###Filtering below###
 max.S.MAF = 0.2,               #minor allele frq must be below this value in at least one of S0 or S1#
 min.diff = 0.2,                #loci with parental allele frequency difference < this value will be removed#
 min.allele.copies.S0 = 30,     #loci with fewer non-missing allele copies in the S0 parental reference set will be removed#
 min.allele.copies.S1 = 12,     #in this dataset the S1 sample size is much smaller than S0, so I'm being less strict with filtering for S1 sample size#
 AF.CIoverlap = FALSE,          #***RECOMMENDED*** filtering option if parental reference samples are included - removes all loci for which there is 
                                #overlap between S0 and S1 in the Bayesian posterior 95% credible intervals of allele frequency# 
)

unlink("ex_marker_info.data")#Tidy up#
unlink("ex_marker_info2.data")#Tidy up#
unlink("ex.data")#Tidy up#
unlink("ex3.data")#Tidy up#
unlink("ex4.data")#Tidy up#

## End(Not run)

ribailey/gghybrid documentation built on Feb. 2, 2024, 12:53 a.m.