checkF1: Identify the best-fitting F1 segregation types

View source: R/exported_functions.R

checkF1R Documentation

Identify the best-fitting F1 segregation types

Description

For a given set of F1 and parental samples, this function finds the best-fitting segregation type using either discrete or probabilistic input data. It can also perform a dosage shift prior to selecting the segregation type.

Usage

checkF1(
  input_type = "discrete",
  dosage_matrix,
  probgeno_df,
  parent1,
  parent2,
  F1,
  ancestors = character(0),
  polysomic,
  disomic,
  mixed,
  ploidy,
  ploidy2,
  outfile = "",
  critweight = c(1, 0.4, 0.4),
  Pvalue_threshold = 1e-04,
  fracInvalid_threshold = 0.05,
  fracNA_threshold = 0.25,
  shiftmarkers,
  parentsScoredWithF1 = TRUE,
  shiftParents = parentsScoredWithF1,
  showAll = FALSE,
  append_shf = FALSE
)

Arguments

input_type

Can be either one of 'discrete' or 'probabilistic'. For the former (default), a dosage_matrix must be supplied, while for the latter a probgeno_df must be supplied.

dosage_matrix

An integer matrix with markers in rows and individuals in columns.

probgeno_df

A data frame as read from the scores file produced by function saveMarkerModels of R package fitPoly, or alternatively, a data frame containing the following columns:

SampleName

Name of the sample (individual)

MarkerName

Name of the marker

P0

Probabilities of dosage score '0'

P1...

Probabilities of dosage score '1' etc. (up to max dosage, e.g. P4 for tetraploid population)

maxP

Maximum genotype probability identified for a particular individual and marker combination

maxgeno

Most probable dosage for a particular individual and marker combination

geno

Most probable dosage for a particular individual and marker combination, if maxP exceeds a user-defined threshold (e.g. 0.9), otherwise NA

parent1

character vector with the sample names of parent 1

parent2

character vector with the sample names of parent 2

F1

character vector with the sample names of the F1 individuals

ancestors

character vector with the sample names of any other ancestors or other samples of interest. The dosages of these samples will be shown in the output (shifted if shiftParents TRUE) but they are not used in the selection of the segregation type.

polysomic

if TRUE at least all polysomic segtypes are considered; if FALSE these are not specifically selected (but if e.g. disomic is TRUE, any polysomic segtypes that are also disomic will still be considered)

disomic

if TRUE at least all disomic segtypes are considered (see polysomic)

mixed

if TRUE at least all mixed segtypes are considered (see polysomic). A mixed segtype occurs when inheritance in one parent is polysomic (random chromosome pairing) and in the other parent disomic (fully preferential chromosome pairing)

ploidy

The ploidy of parent 1 (must be even, 2 (diploid) or larger).

ploidy2

The ploidy of parent 2. If omitted it is assumed to be equal to ploidy.

outfile

the tab-separated text file to write the output to; if NA a temporary file checkF1.tmp is created in the current working directory and deleted at end

critweight

NA or a numeric vector containing the weights of three quality criteria; do not need to sum to 1. If NA, the output will not contain a column qall_weights. Else the weights specify how qall_weights will be calculated from quality parameters q1, q2 and q3.

Pvalue_threshold

a minimum threshold value for the Pvalue of the bestParentfit segtype (with a smaller Pvalue the q1 quality parameter will be set to 0)

fracInvalid_threshold

a maximum threshold for the fracInvalid of the bestParentfit segtype (with a larger fraction of invalid dosages in the F1 the q1 quality parameter will be set to 0)

fracNA_threshold

a maximum threshold for the fraction of unscored F1 samples (with a larger fraction of unscored samples in the F1 the q3 quality parameter will be set to 0)

shiftmarkers

if specified, shiftmarkers must be a data frame with columns MarkerName and shift; for the markernames that match exactly (upper/lowercase etc) those in the input (either dosage_matrix or probgeno_df), the dosages are increased by the amount specified in column shift, e.g. if shift is -1, dosages 2..ploidy are converted to 1..(ploidy-1) and dosage 0 is a combination of old dosages 0 and 1, for all samples. The segregation check is then performed with the shifted dosages. A shift=NA is allowed, these markers will not be shifted. The sets of markers in the input (either dosage_matrix or probgeno_df) and shiftmarkers may be different, but markers may occur only once in shiftmarkers. A column shift is added at the end of the returned data frame.
If parameter shiftParents is TRUE, the parental and ancestor scores are shifted as the F1 scores, if FALSE they are not shifted.

parentsScoredWithF1

TRUE if parents are scored in the same experiment and the same fitPoly run as the F1, else FALSE. If TRUE, their fraction missing scores and conflicts tell something about the quality of the scoring. If FALSE (e.g. when the F1 is triploid and the parents are diploid and tetraploid) the quality of the F1 scores can be independent of that of the parents.
If not specified, TRUE is assumed if ploidy2 == ploidy and FALSE if ploidy2 != ploidy

shiftParents

only used if parameter shiftmarkers is specified. If TRUE, apply the shifts also to the parental and ancestor scores. By default TRUE if parentsScoredWithF1 is TRUE

showAll

(default FALSE) if TRUE, for each segtype 3 columns are added to the returned data frame with the frqInvalid, Pvalue and matchParents values for these segtype (see the description of the return value)

append_shf

if TRUE and parameter shiftmarkers is specified, _shf is appended to all marker names where shift is not 0. This is not required for any of the functions in this package but may prevent duplicated marker names when using other software.

Details

For each marker is tested how well the different segregation types fit with the observed parental and F1 dosages. The results are summarized by columns bestParentfit (which is the best fitting segregation type, taking into account the F1 and parental dosages) and columns qall_mult and/or qall_weights (how good is the fit of the bestParentfit segtype: 0=bad, 1=good).
Column bestfit in the results gives the segtype best fitting the F1 segregation without taking account of the parents. This bestfit segtype is used by function correctDosages, which tests for possible "shifts" in the marker models.
In case the parents are not scored together with the F1 (e.g. if the F1 is triploid and the parents are diploid and tetraploid) dosage_matrix should be edited to contain the parental as well as the F1 scores. In case the diploid and tetraploid parent are scored in the same run of function saveMarkerModels (from package fitPoly) the diploid is initially scored as nulliplex-duplex-quadruplex (dosage 0, 2 or 4); that must be converted to the true diploid dosage scores (0, 1 or 2). Similar corrections are needed with other combinations, such as a diploid parent scored together with a hexaploid population etc.

Value

A list containing two elements, checked_F1 and meta. meta is itself a list that stores the parameter settings used in running checkF1 which can be useful for later reference. The first element (checked_F1) contains the actual results: a data frame with one row per marker, with the following columns:

  • m: the sequential number of the marker (as assigned by fitPoly)

  • MarkerName: the name of the marker, with _shf appended if the marker is shifted and append_shf is TRUE

  • parent1: consensus dosage score of the samples of parent 1

  • parent2: consensus dosage score of the samples of parent 2

  • F1_0 ... F1_<ploidy>: the number of F1 samples with dosage scores 0 ... <ploidy>

  • F1_NA: the number of F1 samples with a missing dosage score

  • sample names of parents and ancestors: the dosage scores for those samples

  • bestfit: the best fitting segtype, considering only the F1 samples

  • frqInvalid_bestfit: for the bestfit segtype, the frequency of F1 samples with a dosage score that is invalid (that should not occur). The frequency is calculated as the number of invalid samples divided by the number of non-NA samples

  • Pvalue_bestfit: the chisquare test P-value for the observed distribution of dosage scores vs the expected fractions. For segtypes where only one dosage is expected (1_0, 1_1 etc) the binomial probability of the number of invalid scores is given, assuming an error rate of seg_invalidrate (hard-coded as 0.03)

  • matchParent_bestfit: indication how the bestfit segtype matches the consensus dosages of parent 1 and 2: "Unknown"=both parental dosages unknown; "No"=one or both parental dosages known and conflicting with the segtype; "OneOK"= only one parental dosage known, not conflicting with the segtype; "Yes"=both parental dosages known and combination matching with the segtype. This score is initially assigned based on only high-confidence parental consensus scores; if low-confidence dosages are confirmed by the F1, the matchParent for (only) the selected segtype is updated, as are the parental consensus scores.

  • bestParentfit: the best fitting segtype that does not conflict with the parental consensus scores

  • frqInvalid_bestParentfit, Pvalue_bestParentfit, matchParent_bestParentfit: same as the corresponding columns for bestfit. Note that matchParent_bestParentfit cannot be "No".

  • q1_segtypefit: a value from 0 (bad) to 1 (good), a measure of the fit of the bestParentfit segtype based on Pvalue, invalidP and whether bestfit is equal to bestParentfit

  • q2_parents: a value from 0 (bad) to 1 (good), based either on the quality of the parental scores (the number of missing scores and of conflicting scores, if parentsScoredWithF1 is TRUE) or on matchParents (No=0, Unknown=0.65, OneOK=0.9, Yes=1, if parentsScoredWithF1 is FALSE)

  • q3_fracscored: a value from 0 (bad) to 1 (good), based on the fraction of F1 samples that have a non-missing dosage score

  • qall_mult: a value from 0 (bad) to 1 (good), a summary quality score equal to the product q1*q2*q3. Equal to 0 if any of these is 0, hence sensitive to thresholds; a natural selection criterion would be to accept all markers with qall_mult > 0

  • qall_weights: a value from 0 (bad) to 1 (good), a weighted average of q1, q2 and q3, with weights as specified in parameter critweight. This column is present only if critweight is specified. In this case there is no "natural" threshold; a threshold for selection of markers must be obtained by inspecting XY-plots of markers over a range of qall_weights values

  • shift: if shiftmarkers is specified a column shift is added with for all markers the applied shift (for the unshifted markers the shift value is 0)

qall_mult and/or qall_weights can be used to compare the quality of the SNPs within one analysis and one F1 population but not between analyses or between different F1 populations.
If parameter showAll is TRUE there are 3 additional columns for each segtype with names frqInvalid_<segtype>, Pvalue_<segtype> and matchParent_<segtype>; see the corresponding columns for bestfit for an explanation. These extra columns are inserted directly before the bestfit column.

Examples

## Not run: 
data("ALL_dosages")
chk1<-checkF1(input_type="discrete",dosage_matrix=ALL_dosages,parent1="P1",parent2="P2",
F1=setdiff(colnames(ALL_dosages),c("P1","P2")),polysomic=T,disomic=F,mixed=F,
ploidy=4)
data("gp_df")
chk1<-checkF1(input_type="probabilistic",probgeno_df=gp_df,parent1="P1",parent2="P2",
F1=setdiff(levels(gp_df$SampleName),c("P1","P2")),polysomic=T,disomic=F,mixed=F,
ploidy=4)

## End(Not run)

polymapR documentation built on Nov. 5, 2023, 1:09 a.m.