fitTetra: Function to fit multiple mixture models to signal ratios of a...

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

This function takes a data frame with allele signal ratios for multiple bi-allelic markers and samples, and fits multiple mixture models to a selected marker. It returns a list, reporting on the performance of these models, selecting the best one based on the BIC criterion, optionally plotting results.

Usage

1
2
3
4
fitTetra(marker, data, diplo = NA, select = TRUE, diploselect = TRUE, maxiter = 40, 
         maxn.bin = 200, nbin = 200, sd.threshold = 0.1, p.threshold = 0.99, 
         call.threshold = 0.6, peak.threshold = 0.85, try.HW = TRUE, dip.filter = 1, 
         sd.target =  NA, plot = "none", plot.type = "png", plot.dir = NA)

Arguments

marker

integer: specifies the marker number to analyze. "marker" is the index to the alphabetically sorted MarkerNames (see argument "data")

data

data frame for tetraploid samples, with (at least) columns "MarkerName", "SampleName", and "ratio", where ratio is the a allele signal divided by the sum of the a and b allele signals (ratio = a/(a+b)).

diplo

data frame like "data" with diploid samples. Facultative, does not affect model fitting. Diploid samples will be plotted in the plot of the best-fitting model if argument plot is "fitted" or "all". Genotypic scores for diploid samples are calculated according to the best-fitting model and are therefore from 0 (nulliplex) to 4 (quadruplex), as for the tetraploid samples.

select

boolean vector, recycled if shorter than the columns in data: indicates which rows are to be used (default: select=TRUE, i.e. keep all rows)

diploselect

as select, for diplo instead of data

maxiter

integer: the maximum number of times the nls function is called in CodomMarker (0 = no limit, default=500)

maxn.bin

integer, passed to CodomMarker, see there for explanation

nbin

integer, passed to CodomMarker, see there for explanation

sd.threshold

the maximum value allowed for the (constant) standard deviation on the arcsine - square root transformed scale, default 0.1. If the optimal model has a larger standard deviation the marker is rejected.

p.threshold

the minimum P-value required to assign a genotype to a sample; default 0.99. If the P-value for all 5 possible genotypes is less than p.threshold the sample is assigned genotype NA.

call.threshold

the minimum fraction of samples to have genotypes assigned ("called"); default 0.6. If under the optimal model the fraction of "called" samples is less than call.threshold the marker is rejected.

peak.threshold

the maximum allowed fraction of the scored samples that are in one peak; default 0.85. If any of the possible genotypes (peaks in the ratio histogram) contains more than peak.threshold of the samples the marker is rejected (because the remaining samples offer too little information for reliable model fitting)

try.HW

boolean: if TRUE (default), try models with and without a constraint on the mixing proportions according to Hardy-Weinberg equilibrium ratios. If FALSE, only try models without this constraint.

dip.filter

integer: if 1 (default), select best model only from models that do not have a dip (a lower peak surrounded by higher peaks: these are not expected under Hardy-Weinberg equilibrium or in cross progenies). If all fitted models have a dip still the best of these is selected. If 2, similar, but if all fitted models have a dip the marker is rejected. If 0, select from all fitted models including those with a dip.

sd.target

If the fitted standard deviation on the transformed scale is larger than sd.target a penalty is given (see Details); default NA i.e. no penalty is given.

plot

string, "none" (default), "fitted" or "all". If "fitted" a plot of the best fitting model and the assigned genotypes is saved with filename <marker number><marker name>.<plot.type>, preceded by "rejected_" if the marker was rejected. If "all" small images of all models are saved to files (8 per file) with filename <"plots"><marker number><A/B/C/D><marker name>. <plot.type> in addition to the plot of the best fitting model.

plot.type

string, "png" (default), "emf", "svg" or "pdf". Indicates format for saving the plots. If "emf" and the operating system is not Windows, "png" is used. If "emf" and the package "devEMF" is not installed, or if the specified format canot be produced for any other reason, "png" is used.

plot.dir

The directory where the plot files are to be saved; default NA. i.e. plot files saved in working directory.

Details

fitTetra fits a series of mixture models for the given marker by repeatedly calling CodomMarker and selects the optimal one. The models tested have four different models for the means of the mixture components: mutype 1, 2, 5 and 6 as described for CodomMarker, and one or two (depending on argument try.HW) models for the mixing proportions. These four or eight models are run using 2 or 3 different start configurations. The model with the smallest Bayesian Information Criterion (BIC) is selected, within the constraints specified by dip.filter. If sd.target is specified, the selection criterion is equal to BIC for models where (on the transformed scale) sd<sd.target, and to (sd.target/sd)*BIC where sd>sd.target (since BIC is negative, a larger sd results in a larger selection criterion which is less likely to be the minimum). The final model selected according to these criteria is then checked against call.threshold and peak.threshold and may still be rejected, in which case no fitted model is reported.

Value

a list with components:

log

a character vector with the lines of the log text

modeldata

a data frame with one row with the marker number, marker name, number of samples and (if the marker is not rejected) data of the fitted model (see below)

allmodeldata

a data frame with for each tried model one row with the marker number, marker name, number of samples and (if the marker is not rejected) data of the fitted model (see below)

scores

a data frame with the name and data for all samples (including NA's for the samples that were not selected, see parameter select): marker (same as argument marker), MarkerName, SampleName, model (a string describing the model),select (value of argument select for this data point),ratio (the given ratio from argument data), P0,P1,P2,P3,P4 (the probabilities that this sample belongs to each of the five mixture components), maxgeno (the genotype = mixture component with the highest P value), maxP (the P value for this genotype) and geno (the assigned genotype number: same as maxgeno, or NA if maxP<p.threshold). Maxgeno and geno numbers from 0 to 4: the allele dosage of the a allele.

diploscores

a data frame like scores for the samples in the data frame supplied with argument diplo. If diplo is NA also diploscores will be NA.

The modeldata and allmodeldata data frames present data on a fitted model. modeldata presents data on the selected model; allmodeldata lists all attempted. Both data frames contain the following columns:

marker

the sequential number of the marker (marker names are ordered alphabetically)

markername

the name of the marker

m

the number of the attempted or selected fit. The 8 (or 4 if try.HW is FALSE) models are tried with 2 or 3 start configurations, so m can range from 1 to 16 or 1 to 24.

model

the fitted model. Possible values are "b1", "b2", "b1,q", "b2,q", "b1 HW", "b2 HW", "b1,q HW" and "b2,q HW" where b1 and b2 indicate whether 1 or two parameters for signal background were fitted, q indicates that a quadratic term in the signal response was fitted, and HW indicates that the mixing proportions were constrained according to Hardy-Weinberg equilibrium ratios. For more details see Voorrips et al (2011)

nsamp

the number of samples for this marker for which select==TRUE, i.e. the number on which the call rate is based.

nsel

the number of these samples that have a non-NA ratio value

npar

the number of free parameters fitted

iter

the number of iterations to reach convergence

dip

0, 1 or 2, parameter passed to CodomMarker, see there for explanation.

LL

the log-likelihood of the fitted model

AIC

Akaike's Information Criterion

BIC

Bayesian Information Criterion

minsepar

a measure of the minimum peak separation. Each difference of the means of two successive mixture components is divided by the average of the standard deviations of the two components. The minimum of the four values is reported. All calculations are on the arcsine-square root transformed scale.

selcrit

The selection criterion; the model with the lowest selcrit is selected. If argument sd.target is NA selcrit is equal to BIC, else selcrit is larger than BIC, see Details for details.

meanP

For each sample the maximum probability of belonging to any mixture component is calculated. The average of these P values is reported in meanP

P80, P90, P95, P975, P99

the fraction of samples that have a probability of at least 0.8, 0.9, 0.95, 0.975 or 0.99 to belong to one of the five mixture components (by default a level of 0.99 is required to assign a genotype score to a sample)

muact0, muact1, muact2, muact3, muact4

the actual means of the samples in each of the five mixture components on the arcsine-square root transformed scale

sdact0, sdact1, sdact2, sdact3, sdact4

the actual standard deviations of the samples in each of the five mixture components on the transformed scale

mutrans0, mutrans1, mutrans2, mutrans3, mutrans4

the model means of the mixture components on the transformed scale

sdtrans0, sdtrans1, sdtrans2, sdtrans3, sdtrans4

the model standard deviations of the mixture components on transformed scale

P0, P1, P2, P3, P4

the mixing proportions of the five components

mu0, mu1, mu2, mu3, mu4

the model means of the five mixture components back-transformed to the original scale

sd0, sd1, sd2, sd3, sd4

the model standard deviations of the five mixture components back-transformed to the original scale

message

if no model was fitted, the reason is reported here

Author(s)

Roeland Voorrips: <roeland.voorrips@wur.nl>

References

Voorrips, RE, G Gort, B Vosman (2011) Genotype calling in tetraploid species from bi-allelic marker data using mixture models BMC Bioinformatics 12: 172.

See Also

saveMarkerModels CodomMarker fitTetra-package

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
data(tetra.potato.SNP)
data(diplo.potato.SNP)
df.tetra <- with(tetra.potato.SNP, data.frame(MarkerName=MarkerName, 
                 SampleName=SampleName, ratio=X_Raw/(X_Raw+Y_Raw)))
df.diplo <- with(diplo.potato.SNP, data.frame(MarkerName=MarkerName, 
                 SampleName=SampleName, ratio=X_Raw/(X_Raw+Y_Raw)))
# Single marker, multiple mixture models
unmix <- fitTetra(marker=4, data=df.tetra)
#unmix <- fitTetra(marker=4, data=df.tetra, plot='fitted')
#unmix <- fitTetra(marker=4, data=df.tetra, diplo=df.diplo, plot='all')

Example output



fitTetra documentation built on May 2, 2019, 3:52 p.m.