geneplot: Run GenePlot and produce results and plots for the selected...
In lfmcmillan/geneplot: Genetic Assignment and Plotting

geneplot

R Documentation

Run GenePlot and produce results and plots for the selected populations.

Description

All individuals in dat whose population matches one of refpopnames will be included. Leave-one-out will be used when calculating the log-genotype probability for individual with respect to their own reference population, unless specified in the inputs (default is use leave-one-out).

Usage

geneplot(
  dat,
  refpopnames,
  locnames,
  includepopnames = NULL,
  prior = "Rannala",
  saddlepoint = T,
  leave_one_out = T,
  logten = T,
  min_loci = 6,
  quantiles = c(0.01, 1),
  Ndraw = 1e+05,
  plotit = T,
  plot_type = switch(as.character(length(refpopnames)), `2` = "twopop", "manypop"),
  plot_bars = F,
  colvec = NA,
  shapevec = NA,
  mark_impute = F,
  txt = "points",
  use_legend = T,
  legend_pos = "bottomleft",
  xyrange = NULL,
  orderpop = NULL,
  axispop = NULL,
  axis_labels = NULL,
  short_axis_labels = F,
  grayscale_quantiles = F,
  dim1 = 1,
  dim2 = 2,
  layout_already_set = F,
  cexpts = 1.4
)

Arguments

`dat`	The data, in a data frame, with two columns labelled as 'id' and 'pop', and with two additional columns per locus. Missing data at any locus should be marked as '0' for each allele. The locus columns must be labelled in the format Loc1.a1, Loc1.a2, Loc2.a1, Loc2.a2, etc. Missing data must be for BOTH alleles at any locus. Missing data for ONE allele at any locus will produce an error. See `read_genepop_format` for details of how to import Genepop format data files into the appropriate format.
`refpopnames`	Character vector of reference population names, that must match the values in the 'pop' column of `dat`.
`locnames`	Character vector, names of the loci, which must match the column names in the data so e.g. if dat has columns id, pop, EV1.a1, EV1.a2, EV14.a1, EV14.a2, etc. then you could use 'locnames = c("EV1","EV14") etc. The locnames do not need to be in any particular order but all of them must be in `dat`.
`includepopnames`	Character vector (default NULL) of population names to be included in the calculations. All individuals with 'pop' value in `includepopnames` will have their Log-Genotype-Probabilities calculated with respect to all populations in `refpopnames`. This input parameter should be used to specify the population labels of any new/additional individuals that you want to compare to the reference pops. For example, if the reference pops are Pop1 and Pop2, and you have some new individuals which you have labelled as PopNew, then use `includepopnames=c("PopNew")` to compare those individuals to Pop1 and Pop2. You can specify the populations in any order, provided that they are all in `dat`. If NULL (default) then only the individuals from `refpopnames` will have their Log-Genotype-Probabilities calculated.
`prior`	(default="Rannala") String, either "Rannala" or "Baudouin", giving the choice of prior parameter for the Dirichlet priors for the allele frequency estimates. Both options define parameter values that depend on the number of alleles at each locus, k. "Baudouin" gives slightly more weight to rare alleles than "Rannala" does, or less weight to the data, so Baudouin may be more suitable for small reference samples, but there is no major difference between them. For more details, see McMillan and Fewster (2017), Biometrics. Additional options are "Half" or "Quarter" which specify parameters 1/2 or 1/4, respectively. These options have priors whose parameters do not depend on the number of alleles at each locus, and so may be more suitable for microsatellite data with varying numbers of alleles at each locus.
`saddlepoint`	Boolean (default TRUE), indicates whether or not to use the saddlepoint method for imputing missing data/leave-one-out results. For more details, see McMillan and Fewster (2017), Biometrics.
`leave_one_out`	Boolean (default TRUE), indicates whether or not to calculate leave-one-out results for any individual from the reference pops. If TRUE, any individual from a reference population will have their Log-Genotype-Probability with respect to their own reference population after temporarily removing the individual's genotype from the sample data for that reference population. The individual's Log-Genotype-Probabilities with respect to all populations they are not a member of will be calculated as normal. We STRONGLY RECOMMEND using leave-one-out=TRUE for any small reference samples (<30).
`logten`	(default TRUE) Boolean, indicates whether to use base 10 for the logarithms, or base e (i.e. natural logarithms). logten=TRUE is default because it's easier to recalculate the original non-log numbers in your head when looking at the plots. Use FALSE for natural logarithms.
`min_loci`	(default 6) is the minimum number of loci that an individual must have (within the set of loci defined in `locnames`) for the individual to have its Log-Genotype-Probabilities calculated. Any individuals with fewer loci than this (i.e. with too many missing loci) will not have its Log-Genotype-Probabilities plotted. Thus if the calculations are based on 9 loci i.e. locnames is length 9, and min_loci=6, then every individual included in the results has data for at least 6 of these 9 loci.
`quantiles`	(default c(0.01,1.00)) Vector of probabilities, specifying the quantiles of the posterior distribution to be calculated. Default plots the 1% and 100% quantiles of the Log-Genotype-Probability distributions for each of the reference populations. For example, only 1% of all possible genotypes that could arise from the given population will have Log-Genotype-Probabilities below the 1% quantile, and 99% of all possible genotypes arising from that population will have Log-Genotype-Probabilities above the 1% quantile. The 100% quantile is the maximum possible Log-Genotype-Probability that any genotype can have with respect to this population. Quantile values will be provided as attributes to the output object of calc_logprob (see the Value section.) If no quantiles are wanted, supply quantiles=NULL.
`Ndraw`	(default 100000) is only used if saddlepoint=FALSE. Defines the number of draws that will be taken from the distribution of log-posterior genotype probabilities for each reference population. These draws, i.e. simulated genotypes from the posterior distributions of the reference populations, are used when imputing the log-genotype-probabilities for individuals with missing data, or when calculating quantiles of the distribution. For more details, see McMillan and Fewster (2017), Biometrics.
`plotit`	(default=TRUE) is whether to produce a plot, or whether to just do the calculations and spit out the table of log-genotype probabilities. FALSE just runs calc_logprob but not plot_logprob, so is equivalent to a call to calc_logprob.
`plot_type`	(default NULL) Can be used to specify "twopop" or "manypop" plots. Defaults to "twopop" for 2 reference pops (i.e. 2 pops listed in `refpopnames` in the call to `geneplot` or `calc_logprob`) and "manypop" for >2 reference pops.
`plot_bars`	(default FALSE) Specify what type of plot to use for >2 reference populations. FALSE (default) plots PCA of the outputs from`calc_logprob` i.e. runs PCA the log-genotype-probabilities for all the reference pops and plots two of the PCs (by default, PC1 and PC2). TRUE plots multiple bar charts, one per reference pop, with all individuals as bars, coloured according to their original pop. For the bar plots, individuals that are in one of the reference pops are ordered according to their Log-Genotype-Probability with respect to their own pop, and that ordering is then used to display them in all the other bar plots as well, so that all the bar plots show the individuals in the same order.
`colvec`	(default=rep(RColorBrewer::brewer.pal(12,"Paired")[c(1:10,12)], npop)[1:npop]) Vector of colours for plotting. The colours correspond to populations specified in the order of `c(refpopnames, includepopnames)`. Thus the first element of colvec corresponds to the first element of `refpopnames`; the last element of colvec corresponds to the last element of `includepopnames`. Colours can be specified using rgb objects, hexadecimal codes, or any of the R colour names (see http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf for a PDF of R colours).
`shapevec`	Vector of shapes for the plotting points. These are named shapes from the following list: "Circle", "Square", "Diamond", "TriangleUp", "TriangleDown", "OpenSquare", "OpenCircle", "OpenTriangleUp", "Plus", "Cross", "OpenDiamond", "OpenTriangleDown", "Asterisk" which correspond to the following pch values for R plots: 21, 22, 23, 24, 25, 0, 1, 2, 3, 4, 5, 6, 8. Do not use the numbers, use the words, which will be automatically converted within plot_logprob into the appropriate codes. The elements of shapevec correspond to the populations specified in the order of `c(refpopnames, includepopnames)`. Thus the first element of shapevec corresponds to the first element of `refpopnames`; the last element of shapevec corresponds to the last element of `includepopnames`. Defaults to the list above, looping through as many times as required for all the populations.
`mark_impute`	(default FALSE) Boolean, indicates whether to mark individuals with missing data using asterisks.
`txt`	(default "points") Defines whether to plot individuals as points on the GenePlot (`"points"`), or whether to display the name of their population (`"pop"`), subpopulation (`"subpop"`), or ID (`"id"`) as text. For `"subpop"`, must have 'subpop' as one of the columns in the `dat` input to calc_logprob. Then 'subpop' will automatically be included in the `logprob_results` object.
`use_legend`	(default TRUE) Plot the legend (or FALSE for don't plot the legend).
`legend_pos`	(default "bottomleft") Define where to plot the legend, uses the same position labels as in the `legend` function e.g. "topright".
`xyrange`	(default NULL) Specify the xyrange as a vector, will be the same range for both axes. Default is slightly wider than the range of the calculated Log-Genotype-Probabilities for all individuals in the plot.
`orderpop`	Specify the plotting order for the populations. E.g. if orderpop=c("Pop4", "Pop2"), then points for individuals from Pop4 will be plotted first, then individuals from Pop2 will be plotted over the top of them, etc. Default is NULL, in which case populations are plotted in order of size, so the population with the largest number of points is plotted at the bottom, and the population with the smallest number of individuals/points is plotted over the top, so as not to be obscured.
`axispop`	is used when `length(refpopnames) == 2` i.e. when plot_type="twopop". It is of the form axispop=c(x="Pop1", y="Pop2"), meaning that the Pop1 reference population will be plotted on the x axis, and the Pop2 reference population will be plotted on the y axis. Default is NULL, which plots the first population in `refpopnames` on the x axis and the second population in `refpopnames` on the y axis.
`axis_labels`	(default NULL) Used for plots with 2 reference pops. Character vector, 2 elements, can be used to specify more readable axis labels. Defaults to the 'pop' labels in `logprob_results`.
`short_axis_labels`	(default FALSE) Used for plots with 2 reference pops. FALSE (default) gives full-length axis labels of the form "Log10 genotype probability for population Pop1" TRUE gives short-form axis labels of the form "LGP10 for population Pop1"
`grayscale_quantiles`	(default FALSE) Used for plots with 2 reference pops. FALSE (default) plots the quantile lines using colvec colours TRUE plots the quantile lines in gray (as the default colours can be quite pale, the grayscale quantile lines can be easier to see than the default coloured ones).
`dim1`	(default 1) Used for plots with more than 2 reference pops, when plot_bars=FALSE. Specifies which principal component should be plotted on x-axis.
`dim2`	(default 2) Used for plots with more than 2 reference pops, when plot_bars=FALSE. Specifies which principal component should be plotted on y-axis.
`layout_already_set`	(default=FALSE) Boolean, used for plots with more than 2 reference pops, when plot_bars=TRUE. Indicates whether the `layout` command, for arranging plots, has already been called by a higher-level function (or if the `par(mfrow)` command has been called earlier). TRUE prevents the layout command within `plot_logprob_barplot` from clashing with the higher-level command.
`cexpts`	(default 1.4) Specify the size of the points in the plot.

Details

Default is leave-one-out. We STRONGLY RECOMMEND USING LEAVE-ONE-OUT, ESPECIALLY FOR SMALL SAMPLES (<30) but also for larger samples.

includepopnames specifies which additional populations are plotted. It defaults to NULL, in which case only refpopnames will be plotted. includepopnames can specify the populations in any order. If refpopnames are missing from includepopnames, they will be added.

NOTE that if a population is not in includepopnames/refpopnames, then any alleles private to that population will NOT be included in the prior / posterior. Thus the posterior for a given refpop will change slightly depending on which populations are in includepopnames/refpopnames.

Value

The structure of the output from calc_logprob and/or geneplot is a data frame, with one row per individual.

The first two columns are "id" and "pop", as in the input data.

The next column (col3) is "status" which is "complete" or "impute" depending on whether the individual had data for all loci, or had some loci missing.

The next column (col4) is "nloci" which is how many loci the individual has data for

The next columns are the final/imputed log-genotype probabilities for the individual with respect to each of the reference populations. They are named in the form "Pop1", "Pop2" etc. corresponding to the names in the refpopnames input.

Then the final columns are the "raw" log-genotype probabilities for the same pops. These are named in the form "Pop1.raw", Pop2.raw", etc. again corresponding to the names in refpopnames.

For individuals with full data at all loci, i.e. no missing data, these two sets of columns will be the same, and give the individual's log-genotype probabilities with respect to each of the reference populations.

For individuals with missing data at some loci then the raw values are the log-genotype probabilities calculated based on the loci that *are* present in the data, and the final/imputed columns, at the start of the results data frame, are the imputed log-genotype probabilities for the full set of loci i.e. the final LGPs for the missing-data individuals are comparable to the final LGPs for the complete-data individuals.

—- Additional attributes of the results object —————————-

At the end of calc_logprob the details of the algorithm used to calculate the results are attached as attributes to the results object. If your call to calc_logprob or geneplot is e.g.

Pop1_vs_Pop2_results <- calc_logprob(dat, c("Pop1","Pop2"), locnames=whaleLocnames)

then you would find out the attributes using attributes(Pop1_vs_Pop2_results)$saddlepoint etc.

Other attributes attached to the results object are: attributes(results)$min_loci – the minimum number of loci to require for any individual to be assigned, so any individual with fewer loci will be excluded from analysis

attributes(results)$n_too_few – the number of individuals that have been excluded from the analysis because they had too few loci

attributes(results)$percent_missing – the percentage of individuals that have been excluded, out of all those in the samples listed in allpopnames

attributes(results)$qmat – the values of the plotted quantiles for the populations, with the % labels of the quantiles as the column names e.g. if quantiles=c(0.05,0.99) was the input to chart.func then qmat will be of the form

	5%	99%
Pop1	xx	xx
Pop2	xx	xx

attributes(results)$allele_freqs – the posterior estimates of the allele frequencies for the populations, as a list, where each element of the list corresponds to one locus (and the list elements are named with the loci names), and at a single locus the allele frequencies are given as a matrix with the allele type names as the columns and the reference populations as the rows e.g. one locus example:

$TR3G2

	150	158	168	172	176	180
Pop1	0.125	0.125	12.125	26.125	22.125	12.125
Pop2	1.125	2.125	13.125	29.125	21.125	10.125

These are allele COUNT estimates, NOT PROPORTION estimates, so they do not need to add up to 1.

attributes(results)$allpopnames – a vector of refpopnames, followed by includepopnames i.e. allpopnames <- c(refpopnames, includepopnames)

attributes(results)$refpopnames – vector of reference population names

attributes(results)$includepopnames – vector of included pop names for assignment

attributes(results)$saddlepoint – TRUE/FALSE for whether saddlepoint was used

attributes(results)$leave_one_out – TRUE/FALSE for whether leave_one_out was used

attributes(results)$logten – TRUE/FALSE for whether log_10 was used (TRUE) or log_e was used (FALSE)

attributes(results)$prior – "Rannala"/"Baudouin", for whether Rannala and Mountain or Baudouin and Lebrun prior was used (see McMillan & Fewster, 2017 Biometrics)

Author(s)

Log-Genotype-Probability calculations based on the method of Rannala and Mountain (1997) as implemented in GeneClass2, updated to allow for individuals with missing data and to enable accurate calculations of quantiles of the Log-Genotype-Probability distributions of the reference populations. See McMillan and Fewster (2017) for details.

References

McMillan, L. and Fewster, R. "Visualizations for genetic assignment analyses using the saddlepoint approximation method" (2017) Biometrics.

Rannala, B., and Mountain, J. L. (1997). Detecting immigration by using multilocus genotypes. Proceedings of the National Academy of Sciences 94, 9197–9201.

Piry, S., Alapetite, A., Cornuet, J.-M., Paetkau, D., Baudouin, L., and Estoup, A. (2004). GENECLASS2: A software for genetic assignment and first-generation migrant detection. Journal of Heredity 95, 536–539.

Examples

## Example dataset created directly within R (usually you would read in data from a file instead):
ratLocnames <- c("D10Rat20","D11Mgh5","D15Rat77","D16Rat81","D18Rat96",
                 "D19Mit2","D20Rat46","D2Rat234","D5Rat83","D7Rat13")
ratData <- rbind(
c("Ki001","Kai",96,128,246,280,234,250,155,165,226,232,219,231,149,149,101,127,174,176,164,182),
c("Ki002","Kai",122,126,246,276,238,238,155,165,226,232,223,231,187,187,107,121,174,174,164,164),
c("Ki003","Kai",122,122,276,280,234,234,157,165,244,244,231,231,187,187,107,107,174,174,164,182),
c("Ki004","Kai",130,130,276,280,238,238,157,165,0,0,223,231,187,187,101,111,168,176,184,184),
c("Ki009","Kai",122,122,276,276,234,236,165,165,240,244,229,231,187,187,89,101,174,176,164,164),
c("Ki010","Kai",122,122,278,280,236,236,155,165,236,244,219,231,185,187,101,101,168,174,164,164),
c("Ki011","Kai",120,128,280,282,236,238,155,165,226,236,223,231,149,149,99,101,174,174,164,164),
c("Bi01","Brok",96,126,280,280,236,250,165,165,232,246,231,231,185,187,89,89,170,176,154,164),
c("Bi02","Brok",96,126,280,280,250,262,155,155,232,232,231,233,149,185,127,127,174,174,164,166),
c("Bi03","Brok",96,126,280,280,258,262,165,165,232,232,231,231,185,187,89,127,174,174,164,164),
c("Bi04","Brok",96,126,280,280,238,262,155,155,232,232,231,233,149,185,127,127,174,174,164,164),
c("Bi05","Brok",96,122,280,280,250,258,155,155,226,244,231,231,187,187,107,127,174,176,164,164),
c("Bi06","Brok",96,96,280,280,238,262,155,155,232,232,231,231,187,187,123,127,174,174,164,164),
c("Bi11","Brok",96,96,278,280,234,250,165,165,226,240,231,231,149,187,89,99,170,170,154,164),
c("Bi12","Brok",96,96,276,280,234,250,165,165,240,240,231,231,187,187,89,99,170,174,154,164),
c("Bi13","Brok",96,126,276,276,246,250,165,165,226,244,231,231,149,187,99,99,174,174,164,164),
c("Bi14","Brok",96,126,276,276,262,262,155,165,226,244,231,231,149,187,89,107,170,174,154,164),
c("Ki092","Main",122,126,280,282,234,238,165,165,236,240,231,231,149,187,95,95,0,0,164,164),
c("Ki093","Main",122,126,282,282,238,238,165,165,236,240,231,231,149,187,95,107,166,174,164,182),
c("Ki094","Main",122,126,280,282,238,238,165,165,226,240,231,231,173,187,95,127,174,176,154,182),
c("Ki095","Main",120,126,280,280,234,236,155,165,244,246,231,231,161,187,123,127,174,174,154,154),
c("Ki097","Main",122,126,280,280,236,236,163,165,236,242,219,231,149,161,107,115,166,174,164,166),
c("Ki098","Main",96,122,276,280,236,238,155,165,242,244,233,233,149,187,99,107,174,174,164,164),
c("Ki100","Main",122,122,280,280,234,234,155,165,236,236,219,235,0,0,107,107,174,176,164,164),
c("Ki101","Main",122,126,276,280,234,238,155,155,236,244,229,231,0,0,101,101,0,0,164,182),
c("Ki102","Main",122,126,0,0,0,0,155,163,0,0,229,231,0,0,107,107,0,0,0,0),
c("Ki103","Main",122,122,280,280,234,236,163,165,0,0,231,233,0,0,99,107,0,0,164,184),
c("Ki104","Main",96,126,276,280,236,238,157,165,230,246,231,231,149,187,107,107,0,0,164,164),
c("Ki105","Main",122,126,276,280,238,250,157,165,226,244,217,231,0,0,111,121,174,174,164,164),
c("R01","Erad10",128,128,280,288,234,244,155,165,242,244,231,231,149,149,107,107,174,174,164,166),
c("R02","Erad10",128,130,276,288,238,244,155,155,228,244,223,231,149,149,101,111,174,174,164,166),
c("R03","Erad10",128,130,276,288,238,244,155,155,244,244,223,231,149,187,107,111,174,176,164,166))
ratData <- as.data.frame(ratData, stringsAsFactors=FALSE)
names(ratData) <- c("id","pop","D10Rat20.a1","D10Rat20.a2","D11Mgh5.a1","D11Mgh5.a2",
                    "D15Rat77.a1","D15Rat77.a2","D16Rat81.a1","D16Rat81.a2",
                    "D18Rat96.a1","D18Rat96.a2","D19Mit2.a1","D19Mit2.a2",
                    "D20Rat46.a1","D20Rat46.a2","D2Rat234.a1","D2Rat234.a2",
                    "D5Rat83.a1","D5Rat83.a2","D7Rat13.a1","D7Rat13.a2")

## Run GenePlot for 2 reference populations:
geneplot(dat=ratData,refpopnames=c("Kai","Main"),locnames=ratLocnames,
         prior="Baudouin", leave_one_out=TRUE,
         colvec=c("darkorchid4","steelblue"), shapevec=c("Circle","Square"),
         axis_labels=c("Log10 genotype probability for Kaikoura Island",
                       "Log10 Genotype probability for Mainland"))

## Run GenePlot for 2 reference populations, and include an extra group of
## individuals who are going to be compared to the 2 reference populations to
## see which reference population they are most similar to (note that we
## specify an additional colour and shape for the new individuals, but the
## axis labels stay the same because they correspond to the two reference
## populations):
results <- geneplot(dat=ratData,refpopnames=c("Kai","Main"),locnames=ratLocnames,
         includepopnames=c("Erad10"), prior="Baudouin", leave_one_out=TRUE,
         colvec=c("darkorchid4","steelblue","chartreuse4"),
         shapevec=c("Circle","Square","TriangleUp"),
         axis_labels=c("Log10 genotype probability for Kaikoura Island",
                       "Log10 Genotype probability for Mainland"))

## Barplot:
plot_logprob(results, plot_bars=TRUE,
         colvec=c("darkorchid4","steelblue","chartreuse4"))

## Rename Kai and Main as a single population and compare that population to Brok:
ratData2 <- ratData
ratData2$pop[which(ratData2$pop %in% c("Kai","Main"))] <- "KaiMain"
geneplot(dat=ratData2,refpopnames=c("KaiMain","Brok"),locnames=ratLocnames,
         prior="Rannala", leave_one_out=TRUE,
         colvec=c("forestgreen","darkgoldenrod"),
         shapevec=c("TriangleDown","Diamond"),
         axis_labels=c("Log10 genotype probability for Kaikoura and Mainland",
                       "Log10 Genotype probability for Broken Islands"))

## Example code for reading in a Genepop-format file
## Typical code to run locally on your home machine:
## genepopDat <- read_genepop_format("/home/data/genepop_format_example.gen",digits_per_allele=3)
## Working code to demonstrate the example, using a file in the package:
genepopDat <- read_genepop_format(system.file("extdata",
    "genepop_format_example.gen", package = "geneplot"),digits_per_allele=3)

## Extract the loci names (that were read in from the top of the file):
locnames <- genepopDat$locnames
## Separate out the data:
dat <- genepopDat$pop_data
## You could then run GenePlot on the data and locnames.
## Note that by default, data read in from Genepop format will have populations
## called Pop1, Pop2 etc. unless the individuals in that pop have non-unique
## names, in which case they will be given the ID of the first individual
## in that pop as their pop name and will be given auto-generated unique IDs.
geneplot(dat, refpopnames=c("Mahu","Taik"), includepopnames=c("Flat"), locnames=genepopDat$locnames)

lfmcmillan/geneplot documentation built on Nov. 27, 2024, 1:35 a.m.