View source: R/geneplot_calcs.R
calc_logprob | R Documentation |
Run GenePlot calculations to obtain Log-Genotype-Probabilities for all
individuals from populations in refpopnames
or includepopnames
,
with respect to each of the populations in refpopnames
.
calc_logprob(
dat,
refpopnames,
locnames,
includepopnames = NULL,
prior = "Rannala",
saddlepoint = T,
leave_one_out = T,
logten = T,
min_loci = 6,
quantiles = c(0.01, 1),
Ndraw = 1e+05
)
dat |
The data, in a data frame, with two columns labelled as 'id' and
'pop', and with two additional columns per locus. Missing data at any
locus should be marked as '0' for each allele.
The locus columns must be labelled in the format Loc1.a1, Loc1.a2,
Loc2.a1, Loc2.a2, etc.
Missing data must be for BOTH alleles at any locus. Missing data for
ONE allele at any locus will produce an error.
See |
refpopnames |
Character vector of reference population names, that must
match the values in the 'pop' column of |
locnames |
Character vector, names of the loci, which must match the
column names in the data so e.g. if dat has columns
id, pop, EV1.a1, EV1.a2, EV14.a1, EV14.a2, etc.
then you could use 'locnames = c("EV1","EV14") etc.
The locnames do not need to be in any particular order but all of them
must be in |
includepopnames |
Character vector (default NULL) of population names to
be included in the calculations. All individuals with 'pop' value in
|
prior |
(default="Rannala") String, either "Rannala" or "Baudouin", giving the choice of prior parameter for the Dirichlet priors for the allele frequency estimates. Both options define parameter values that depend on the number of alleles at each locus, k. "Baudouin" gives slightly more weight to rare alleles than "Rannala" does, or less weight to the data, so Baudouin may be more suitable for small reference samples, but there is no major difference between them. For more details, see McMillan and Fewster (2017), Biometrics. Additional options are "Half" or "Quarter" which specify parameters 1/2 or 1/4, respectively. These options have priors whose parameters do not depend on the number of alleles at each locus, and so may be more suitable for microsatellite data with varying numbers of alleles at each locus. |
saddlepoint |
Boolean (default TRUE), indicates whether or not to use the saddlepoint method for imputing missing data/leave-one-out results. For more details, see McMillan and Fewster (2017), Biometrics. |
leave_one_out |
Boolean (default TRUE), indicates whether or not to calculate leave-one-out results for any individual from the reference pops. If TRUE, any individual from a reference population will have their Log-Genotype-Probability with respect to their own reference population after temporarily removing the individual's genotype from the sample data for that reference population. The individual's Log-Genotype-Probabilities with respect to all populations they are not a member of will be calculated as normal. We STRONGLY RECOMMEND using leave-one-out=TRUE for any small reference samples (<30). |
logten |
(default TRUE) Boolean, indicates whether to use base 10 for the logarithms, or base e (i.e. natural logarithms). logten=TRUE is default because it's easier to recalculate the original non-log numbers in your head when looking at the plots. Use FALSE for natural logarithms. |
min_loci |
(default 6) is the minimum number of loci that an individual
must have (within the set of loci defined in |
quantiles |
(default c(0.01,1.00)) Vector of probabilities, specifying the quantiles of the posterior distribution to be calculated. Default plots the 1% and 100% quantiles of the Log-Genotype-Probability distributions for each of the reference populations. For example, only 1% of all possible genotypes that could arise from the given population will have Log-Genotype-Probabilities below the 1% quantile, and 99% of all possible genotypes arising from that population will have Log-Genotype-Probabilities above the 1% quantile. The 100% quantile is the maximum possible Log-Genotype-Probability that any genotype can have with respect to this population. Quantile values will be provided as attributes to the output object of calc_logprob (see the Value section.) If no quantiles are wanted, supply quantiles=NULL. |
Ndraw |
(default 100000) is only used if saddlepoint=FALSE. Defines the number of draws that will be taken from the distribution of log-posterior genotype probabilities for each reference population. These draws, i.e. simulated genotypes from the posterior distributions of the reference populations, are used when imputing the log-genotype-probabilities for individuals with missing data, or when calculating quantiles of the distribution. For more details, see McMillan and Fewster (2017), Biometrics. |
All individuals in dat
whose population label in the 'pop' column of
dat
matches one of the populations in refpopnames
or
includepopnames
will be included.
NOTE that if a population is not in includepopnames
/refpopnames
,
then any alleles private to that population will NOT be included in the
prior / posterior. Thus the posterior for a given refpop will change slightly
depending on which populations are in includepopnames
/refpopnames
.
Leave-one-out will be used when calculating the log-genotype probability for an individual with respect to their own reference population, if specified in the inputs (default is NON leave-one-out). Default is NON leave-one-out but WE STRONGLY RECOMMEND USING LEAVE-ONE-OUT, ESPECIALLY FOR SMALL SAMPLES (<30).
The structure of the output from calc_logprob
and/or geneplot
is a data frame, with one row per individual.
The first two columns are "id" and "pop", as in the input data.
The next column (col3) is "status" which is "complete" or "impute" depending on whether the individual had data for all loci, or had some loci missing.
The next column (col4) is "nloci" which is how many loci the individual has data for
The next columns are the final/imputed log-genotype probabilities for the individual with respect to each of the reference populations. They are named in the form "Pop1", "Pop2" etc. corresponding to the names in the refpopnames input.
Then the final columns are the "raw" log-genotype probabilities for the same pops. These are named in the form "Pop1.raw", Pop2.raw", etc. again corresponding to the names in refpopnames.
For individuals with full data at all loci, i.e. no missing data, these two sets of columns will be the same, and give the individual's log-genotype probabilities with respect to each of the reference populations.
For individuals with missing data at some loci then the raw values are the log-genotype probabilities calculated based on the loci that *are* present in the data, and the final/imputed columns, at the start of the results data frame, are the imputed log-genotype probabilities for the full set of loci i.e. the final LGPs for the missing-data individuals are comparable to the final LGPs for the complete-data individuals.
—- Additional attributes of the results object —————————-
At the end of calc_logprob
the details of the algorithm used to calculate
the results are attached as attributes to the results object.
If your call to calc_logprob
or geneplot
is e.g.
Pop1_vs_Pop2_results <- calc_logprob(dat, c("Pop1","Pop2"), locnames=whaleLocnames)
then you would find out the attributes using attributes(Pop1_vs_Pop2_results)$saddlepoint
etc.
Other attributes attached to the results object are:
attributes(results)$min_loci
– the minimum number of loci to require
for any individual to be assigned, so any individual with fewer loci
will be excluded from analysis
attributes(results)$n_too_few
– the number of individuals that have
been excluded from the analysis because they had too few loci
attributes(results)$percent_missing
– the percentage of individuals
that have been excluded, out of all those in the samples listed in
allpopnames
attributes(results)$qmat
– the values of the plotted quantiles
for the populations, with the % labels of the quantiles as the column names
e.g. if quantiles=c(0.05,0.99) was the input to chart.func then
qmat will be of the form
5% | 99% | |
Pop1 | xx | xx |
Pop2 | xx | xx |
attributes(results)$allele_freqs
– the posterior estimates of the
allele frequencies for the populations, as a list, where each element
of the list corresponds to one locus (and the list elements are named
with the loci names), and at a single locus the allele frequencies are
given as a matrix with the allele type names as the columns and the
reference populations as the rows e.g. one locus example:
$TR3G2
150 | 158 | 168 | 172 | 176 | 180 | |
Pop1 | 0.125 | 0.125 | 12.125 | 26.125 | 22.125 | 12.125 |
Pop2 | 1.125 | 2.125 | 13.125 | 29.125 | 21.125 | 10.125 |
These are allele COUNT estimates, NOT PROPORTION estimates, so they do not need to add up to 1.
attributes(results)$allpopnames
– a vector of refpopnames, followed by includepopnames
i.e. allpopnames <- c(refpopnames, includepopnames)
attributes(results)$refpopnames
– vector of reference population names
attributes(results)$includepopnames
– vector of included pop names for assignment
attributes(results)$saddlepoint
– TRUE/FALSE for whether saddlepoint was used
attributes(results)$leave_one_out
– TRUE/FALSE for whether leave_one_out was used
attributes(results)$logten
– TRUE/FALSE for whether log_10 was used (TRUE) or
log_e was used (FALSE)
attributes(results)$prior
– "Rannala"/"Baudouin", for whether Rannala
and Mountain or Baudouin and Lebrun prior was used (see McMillan & Fewster, 2017 Biometrics)
Log-Genotype-Probability calculations based on the method of Rannala and Mountain (1997) as implemented in GeneClass2, updated to allow for individuals with missing data and to enable accurate calculations of quantiles of the Log-Genotype-Probability distributions of the reference populations. See McMillan and Fewster (2017) for details. Coded by Rachel Fewster and Louise McMillan.
McMillan, L. and Fewster, R. "Visualizations for genetic assignment analyses using the saddlepoint approximation method" (2017) Biometrics.
Rannala, B., and Mountain, J. L. (1997). Detecting immigration by using multilocus genotypes. Proceedings of the National Academy of Sciences 94, 9197–9201.
Piry, S., Alapetite, A., Cornuet, J.-M., Paetkau, D., Baudouin, L., and Estoup, A. (2004). GENECLASS2: A software for genetic assignment and first-generation migrant detection. Journal of Heredity 95, 536–539.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.