data.prep | R Documentation |
Prepare data for hybrid index and genomic cline estimation.
data.prep(
data,
loci,
alleles,
marker.info = NULL,
S0 = NULL,
S1 = NULL,
sourceAbsent = FALSE,
precols,
INDLABEL.name = "INDLABEL",
POPID.name = "POPID",
max.S.MAF = 0.5,
min.diff = 0,
min.allele.copies.S0 = 0,
min.allele.copies.S1 = 0,
AF.CIoverlap = TRUE,
return.genotype.table = FALSE,
return.locus.table = FALSE,
prior.prop_1 = c(1, 1)
)
data |
A |
loci |
A |
alleles |
An |
marker.info |
A |
S0 |
Character vector. Identifiers of the test subjects representing the first parental reference set (“Source 0”).
Allele frequencies in this set represent a hybrid index of 0. By default |
S1 |
Character vector. Identifiers of the test subjects representing the second parental reference set (“Source 1”).
Allele frequencies in this set represent a hybrid index of 1. By default |
sourceAbsent |
Logical. Whether source (parental reference) samples are absent from the dataset. If |
precols |
Character vector. A |
INDLABEL.name |
Character string. The name of the column in |
POPID.name |
Character string. The name of the column in |
max.S.MAF |
Numeric. Locus filtering option. Removes loci for which the smaller of the two parental minor allele frequencies
is greater than this value. Included because only one parental reference minor allele frequency needs to be close to
zero for the locus to be informative. Default is |
min.diff |
Numeric. Locus filtering option. Removes loci for which the difference in parental reference allele frequency
is less than this value. Default is |
min.allele.copies.S0 |
Numeric. Locus filtering option based on sample size. Removes loci for which the number of allele
copies in the |
min.allele.copies.S1 |
Numeric. Locus filtering option based on sample size. Removes loci for which the number of allele
copies in the |
AF.CIoverlap |
Logical. Recommended locus filtering option, based on posterior credible intervals of parental allele frequencies.
If |
return.genotype.table |
Logical. Whether to create a genotype table with individual genotypes for each locus scored from
|
return.locus.table |
Logical. Whether to return a table with one row per locus containing a wider array of locus-level data,
such as parental allele frequency credible intervals and number of non-missing allele copies in each parental reference set.
Default is |
prior.prop_1 |
Numeric vector. Values of |
The primary output of data.prep
is a long-form data.table
with 1 row per non-missing allele copy in the
data set, to be used in downstream hybrid index and genomic cline estimation. It also optionally produces (1) a locus table,
with one row per locus and containing useful locus-level data (see ‘Value’), and (2) a genotype table, with genotypes
scored according to the number of alleles from parental reference set S1
. This may be useful for example in estimating
admixture-induced (‘ancestry’) linkage disequilibria. data.prep
uses objects produced by read.data
as input,
or custom objects in the same format.
Even when all filtering options are left at default values, loci with exactly zero allele frequency difference between S0 and S1 will
be removed. Therefore, it is often likely that the resulting analysis table will contain fewer loci than were loaded by read.data
.
The optional marker.info
file should be a file with one row per marker and any number of columns (including the marker name in
a column named ‘locus’), each for a variable of interest for grouping markers, such as chromosome, map location, gene
annotations, sliding window identifiers etc. These variables can then be used downstream for statistical comparison of hybrid index
or clines among groups of loci. In the absence of parental reference samples, certain column names and information must be included
in the marker.info
file. See examples here and in read.data
.
The locus data object includes posterior mean, mode and upper and lower 95 percent credible intervals for frequency of the Source 1 allele for each locus, assuming beta-distributed proportions. Note that if the locus is fixed in a parental reference set, its posterior mode will be 0 or 1 and will fall outside the credible intervals due to the influence of the non-zero prior shape parameters. If prior shape parameters are set to zero (equivalent to no prior), the best posterior estimate is the mean rather than the mode, but in this case credible intervals cannot be estimated when the locus is fixed for one allele.
list containing at least two components: (1) data.prep
, a data.table
and data.frame
long-form version of
the imported data table, with one row per non-missing allele copy and the data columns required for analysis, listed below; (2)
Nloci.postfilter
, a numeric scalar indicating the number of loci remaining after filtering. If parental reference samples are
identified, a third output is returned, sourceInfo
, a data.table
and data.frame
listing all the unique entries in
the declared ‘POPID.name’ column (default ‘POPID’) and indicating whether they represent ‘S0’, ‘S1’ or
‘TEST’. The two optional outputs are: (1) geno.data
, a data.table
and data.frame
containing genotypes for
each locus scored from 0
to PLOIDY
, based on their number of copies of the designated S1
allele (the allele with
relatively higher frequency in the S1
parental reference set); (2) locus.data
, a data.table
and data.frame
with one row per locus and containing useful locus-specific information, including the contents of any uploaded marker.info
object.
data.prep
contains the pre-marker columns and the following fields, plus any columns from a declared marker.info
object:
Source |
whether the sample is from parental reference 0 (‘S0’), parental reference 1 (‘S1’) or is a test individual (‘TEST’). |
locus |
the locus names. |
S0.prop_1 |
frequency (proportion) of the |
S1.prop_1 |
frequency (proportion) of the |
Source_allele |
numeric. Whether the allele is the |
locus.data
additionally contains, at a minimum:
S0_allele |
name of the allele with relatively higher frequency in |
S1_allele |
name of the allele with relatively higher frequency in |
Source.afdiff |
difference in allele frequency between |
S0.MAF |
|
S1.MAF |
|
min.MAF |
smaller of |
If parental reference samples are present, locus.data
further contains:
S0.N_1 |
number of non-missing copies of the |
S1.N_1 |
number of non-missing copies of the |
S0.prop_0 |
frequency (proportion) of the |
S1.prop_0 |
frequency (proportion) of the |
N_allele_copies.S0 |
total number of non-missing allele copies in |
N_allele_copies.S1 |
total number of non-missing allele copies in |
S0.prop_1_shape1 |
shape1 parameter estimate for the posterior beta distribution of the |
S0.prop_1_shape2 |
shape2 parameter estimate for the posterior beta distribution of the |
S0.prop_1_postmode |
mode of the beta-distributed posterior for the frequency of the |
S0.prop_1_postmean |
mean of the beta-distributed posterior for the frequency of the |
S0.prop_1_cred.int.lower |
lower 95 percent credible interval (2.5 percent quantile of the posterior beta distribution) of the frequency of the |
S0.prop_1_cred.int.upper |
upper 95 percent credible interval (97.5 percent quantile of the posterior beta distribution) of the frequency of the |
S1.prop_1_shape1 |
shape1 parameter estimate for the posterior beta distribution of the |
S1.prop_1_shape2 |
shape2 parameter estimate for the posterior beta distribution of the |
S1.prop_1_postmode |
mode of the beta-distributed posterior for the frequency of the |
S1.prop_1_postmean |
mean of the beta-distributed posterior for the frequency of the |
S1.prop_1_cred.int.lower |
lower 95 percent credible interval (2.5 percent quantile of the posterior beta distribution) of the frequency of the |
S1.prop_1_cred.int.upper |
upper 95 percent credible interval (97.5 percent quantile of the posterior beta distribution) of the frequency of the |
S1.prop_0_shape1 |
shape1 parameter estimate for the posterior beta distribution of the |
S1.prop_0_shape2 |
shape2 parameter estimate for the posterior beta distribution of the |
S0.prop_0_shape1 |
shape1 parameter estimate for the posterior beta distribution of the |
S0.prop_0_shape2 |
shape2 parameter estimate for the posterior beta distribution of the |
Richard Ian Bailey richardianbailey@gmail.com
## Not run:
###############################################################
#Read in and prepare a file with no parental reference samples#
###############################################################
#Create an input data file and save to the working directory.
cat("INDLABEL POPID chr1:001 chr1:002","ind1 pop1 A A","ind1 pop1 A B","ind2 pop1 B B","ind2 pop1 B B","ind3 pop2 NA A","ind3 pop2 B A", file = "ex.data", sep = "\n")
#Create a marker.info file with the necessary parental reference information and save to the working directory.
cat("locus refAllele alternateAllele S0.prop_r S1.prop_r type","chr1:001 A B 0.1 0.9 intronic","chr1:002 B A 0.8 0.2 exonic", file = "ex_marker_info2.data", sep = "\n")
#Read in the data and marker.info file.
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,NUMLOCI=2,MISSINGVAL=NA,marker.info.file = "ex_marker_info2.data", sourceAbsent = TRUE)
#Run data.prep.
prepdata=data.prep(
data=dat$data, #part of the read.data output object#
loci=dat$loci, #part of the read.data output object#
sourceAbsent = TRUE, #Must be set to TRUE in the absence of parental reference samples. Default is FALSE#
marker.info=dat$marker.info, #must be specified if sourceAbsent = TRUE; make sure it contains the headers and their
#contents as specified above when 'sourceAbsent=TRUE'#
alleles=dat$alleles, #part of the read.data output object#
#S0, #character vector identifying parental reference S0 samples. Leave blank in the absence of parental reference samples#
#S1, #character vector identifying parental reference S1 samples. Leave blank in the absence of parental reference samples#
precols=dat$precols, #part of the read.data output object#
return.genotype.table=TRUE, #This returns an table, with loci in columns, of diploid (or other ploidy) genotypes, coded as number of copies of the
#allele with higher frequency in S1 than S0#
return.locus.table=TRUE #Returns a table with one row per locus and all the individual marker data, including data already uploaded plus values
#calculated within this function#
)
#####################################################################
#Read in and prepare a file that includes parental reference samples#
#####################################################################
#Make a new data input file with more entries, including S0 and S1 samples, in plink structure format but with missing data recoded to NA. This includes
#a MAPDISTANCE row below the marker names.
cat("chr1:001 chr1:002","-3 3 2 4",
"ind1 pop1 A A A B","ind2 pop1 B B B B","ind3 pop2 NA B A A",
"ind4 pop3 A A A A","ind5 pop3 A B A A","ind6 pop3 A A A A",
"ind7 pop3 NA A A A","ind8 pop4 A NA B B","ind9 pop4 B B B B",
"ind10 pop4 B B B B","ind11 pop5 A A B B",
file = "ex3.data", sep = "\n")
#pop3 represents S0; pop4 and pop5 represent S1.
#Run read.data and data.prep.
dat3 <- read.data(file="ex3.data",MISSINGVAL=NA,NUMINDS=11,nprecol=2,ONEROW=1,MAPDISTANCE=1,precol.headers=0)
prepdata= data.prep(
data=dat3$data,
loci=dat3$loci,
sourceAbsent = FALSE, #default#
#marker.info=dat$marker.info, #a marker.info file is not obligatory when parental reference samples are declared#
alleles=dat3$alleles,
S0="pop3", #first parental reference set; must be specified when sourceAbsent = FALSE#
S1=c("pop4","pop5"), #second parental reference set; must be specified when sourceAbsent = FALSE#
precols=dat3$precols,
return.genotype.table=TRUE, #default is FALSE to save memory#
return.locus.table=TRUE #default is FALSE to save memory#
)
#########################################
#Now completely exclude the POPID column#
#########################################
#Just to show that the functions work in the absence of this column#
#First with no parental reference samples. Modify the plink format 'ex3.data' above to remove the POPID column,
#and use the existing marker info file, 'ex_marker_info2.data'#
cat("chr1:001 chr1:002", "-3 3 2 4",
"ind1 A A A B","ind2 B B B B","ind3 NA B A A",
"ind4 A A A A","ind5 A B A A","ind6 A A A A",
"ind7 NA A A A","ind8 A NA B B","ind9 B B B B","ind10 B B B B","ind11 A A B B",
file = "ex4.data", sep = "\n")
#you must specify 'POPID=0' (meaning the column is absent) in read.data.
dat4 <- read.data(file="ex4.data",POPID=0,nprecol=1,NUMINDS=11,NUMLOCI=2,MISSINGVAL=NA,marker.info.file="ex_marker_info2.data",sourceAbsent=TRUE,ONEROW=1,MAPDISTANCE=1,precol.headers=0)
#Now prepare the data for analysis, this time we're not creating geno.data or locus.data objects.
prepdata= data.prep(
data=dat4$data,
loci=dat4$loci,
sourceAbsent = TRUE, #default is FALSE#
marker.info=dat4$marker.info,#must be specified if sourceAbsent = TRUE#
alleles=dat4$alleles,
#S0, #leave blank in the absence of parental reference samples#
#S1, #leave blank in the absence of parental reference samples#
POPID.name="INDLABEL", #***must be specified if there's no POPID column, to avoid an error***#
precols=dat4$precols
)
#Now including parental reference samples, again with no POPID column.
#Create and save a marker info file. Not obligatory in the presence of parental reference samples.
cat("locus type","chr1:001 intronic","chr1:002 exonic", file = "ex_marker_info.data", sep = "\n")
#Read data in and set sourceAbsent=FALSE (the default setting) so that the code doesn't search for the obligatory columns and throw up an error.
dat5 <- read.data(file="ex4.data",POPID=0,nprecol=1,NUMINDS=11,NUMLOCI=2,MISSINGVAL=NA,marker.info.file="ex_marker_info.data",
sourceAbsent=FALSE,ONEROW=1,MAPDISTANCE=1,precol.headers=0)
#Prepare the analysis object.
prepdata= data.prep(
data=dat5$data,
loci=dat5$loci,
sourceAbsent = FALSE, #default is FALSE#
marker.info=dat5$marker.info, #this now contains only the locus (obligatory) and type columns#
alleles=dat5$alleles,
S0=c("ind4","ind5","ind6","ind7"), #must be specified when sourceAbsent = FALSE#
S1=c("ind8","ind9","ind10","ind11"), #must be specified when sourceAbsent = FALSE#
POPID.name = "INDLABEL", #***must be specified if there's no POPID column; the S0 and S1 sample
#references must be present in the column specified here***#
precols=dat5$precols,
return.genotype.table=TRUE, #Optional#
return.locus.table=TRUE #Optional#
)
#################################################################################
#Locus filtering based on parental reference allele frequencies and sample sizes#
#################################################################################
#Upload a real dataset (Italian sparrows), included as part of the package.
dat=read.data("RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,MISSINGVAL=NA,NUMINDS=569)
#The file contains data for 77 diploid markers for 569 male sparrow individuals.
#Run data.prep including the different filtering options. Try hashing out or changing these options to see how it affects the number of
#loci in the resulting analysis table (see prepdata$Nloci.postfilter in the data.prep object).
prepdata=data.prep(
data=dat$data,
loci=dat$loci,
alleles=dat$alleles,
S0=c("Kralove","Oslo"), #POPID names for the first parental reference set#
S1="LesinaSPANISH", #POPID names for the second parental reference set#
precols=dat$precols,
###Filtering below###
max.S.MAF = 0.2, #minor allele frq must be below this value in at least one of S0 or S1#
min.diff = 0.2, #loci with parental allele frequency difference < this value will be removed#
min.allele.copies.S0 = 30, #loci with fewer non-missing allele copies in the S0 parental reference set will be removed#
min.allele.copies.S1 = 12, #in this dataset the S1 sample size is much smaller than S0, so I'm being less strict with filtering for S1 sample size#
AF.CIoverlap = FALSE, #***RECOMMENDED*** filtering option if parental reference samples are included - removes all loci for which there is
#overlap between S0 and S1 in the Bayesian posterior 95% credible intervals of allele frequency#
)
unlink("ex_marker_info.data")#Tidy up#
unlink("ex_marker_info2.data")#Tidy up#
unlink("ex.data")#Tidy up#
unlink("ex3.data")#Tidy up#
unlink("ex4.data")#Tidy up#
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.