R/utils.outflank.r

Defines functions pTwoSidedFromChiSq pOutlierFinderChiSqNoCorr outputDFStarterNoCorr utils.outflank

Documented in utils.outflank

#' OutFLANK:  An Fst outlier approach by Mike Whitlock and Katie Lotterhos,
#' University of British Columbia.
#'
#' This function is the original implementation of Outflank by Whitlock and
#' Lotterhos. dartR simply provides a convenient wrapper around their functions
#' and an easier install being an r package (for information please refer to
#' their github repository)
#'
#' This method looks for Fst outliers from a list of Fst's for different loci.
#' It assumes that each locus has been genotyped in all populations with
#' approximately equal coverage.
#'
#' OutFLANK estimates the distribution of Fst based on a trimmed sample of Fst's.
#' It assumes that the majority of loci in the center of the distribution are
#' neutral and infers the shape of the distribution of neutral Fst using a
#' trimmed set of loci. Loci with the highest and lowest Fst's are trimmed from
#' the data set before this inference, and the distribution of Fst df/(mean Fst)
#' is assumed to'follow a chi-square distribution. Based on this inferred
#' distribution, each locus is given a q-value based on its quantile in the
#' inferred null'distribution.
#'
#' The main procedure is called OutFLANK -- see comments in that function
#' immediately below for input and output formats. The other functions here are
#' necessary and must be uploaded, but are not necessarily needed by the user
#' directly.
#'
#' Steps:
#'
#' @param FstDataFrame A data frame that includes a row for each locus, with
#' columns as follows:
#' \itemize{
#'                   \item $LocusName: a character string that uniquely names
#'                   each locus.
#'                    \item $FST: Fst calculated for this locus. (Kept here to
#'                     report the unbased Fst of the results)
#'                    \item $T1: The numerator of the estimator for Fst
#'                    (necessary, with $T2, to calculate mean Fst)
#'                    \item $T2: The denominator of the estimator of Fst
#'                    \item $FSTNoCorr: Fst calculated for this locus without
#'                    sample size correction. (Used to find outliers)
#'                    \item $T1NoCorr: The numerator of the estimator for Fst
#'                    without sample size correction (necessary, with $T2, to
#'                    calculate mean Fst)
#'                    \item $T2NoCorr: The denominator of the estimator of Fst
#'                    without sample size correction
#'                    \item $He: The heterozygosity of the locus (used to screen
#'                    out low heterozygosity loci that have a different distribution)
#'                    }
#'
#' @param LeftTrimFraction The proportion of loci that are trimmed from the
#' lower end of the range of Fst before the likelihood funciton is applied
#' [default 0.05].
#' @param RightTrimFraction The proportion of loci that are trimmed from the
#' upper end of the range of Fst before the likelihood funciton is applied
#' [default 0.05].
#' @param Hmin The minimum heterozygosity required before including calculations
#' from a locus [default 0.1].
#' @param NumberOfSamples The number of spatial locations included in the data
#' set.
#' @param qthreshold The desired false discovery rate threshold for calculating
#'  q-values [default 0.05].
#' @return
#'
#' The function returns a list with seven elements:
#' \itemize{
#' \item FSTbar: the mean FST inferred from loci not marked as outliers
#' \item FSTNoCorrbar: the mean FST (not corrected for sample size -gives an
#' upwardly biased estimate of FST)
#' \item dfInferred: the inferred number of degrees of freedom for the
#' chi-square distribution of neutral FST
#' \item numberLowFstOutliers: Number of loci flagged as having a significantly
#' low FST (not reliable)
#' \item numberHighFstOutliers: Number of loci identified as having
#' significantly high FST
#' \item results: a data frame with a row for each locus. This data frame
#' includes all the original columns in the
#'                    data set, and six new ones:
#'                    \itemize{
#'              \item $indexOrder (the original order of the input data set),
#'              \item $GoodH (Boolean variable which is TRUE if the expected
#'               heterozygosity is greater than the Hemin set by input),
#'              \item $OutlierFlag (TRUE if the method identifies the locus as
#'              an outlier, FALSE otherwise), and
#'              \item $q (the q-value for the test of neutrality for the locus)
#'              \item $pvalues (the p-value for the test of neutrality for the
#'              locus)
#'              \item $pvaluesRightTail the one-sided (right tail) p-value for
#'               a locus
#'              }
#'  }
#' @export
#' @author Bernd Gruber (bugs? Post to
#'  \url{https://groups.google.com/d/forum/dartr}); original implementation of
#'   Whitlock & Lotterhos

utils.outflank <- function(FstDataFrame,
                           LeftTrimFraction = 0.05,
                           RightTrimFraction = 0.05,
                           Hmin = 0.1,
                           NumberOfSamples,
                           qthreshold = 0.05) {
  # Setting up necessary columns in dataframe
  Fstdata <- outputDFStarterNoCorr(FstDataFrame, Hmin)

  # making working dataframe with real Fst (no NAs), storing NAs to add back
  # later
  # This also removes loci with He values lower than Hmin from the working data
  # frame
  nonkeepers <- which((is.na(Fstdata$FSTNoCorr)) | (Fstdata$He < Hmin))
  if (length(nonkeepers) > 0) {
    workingDataFrame <- Fstdata[-nonkeepers, ]
  } else {
    workingDataFrame <- Fstdata
  }

  storedDataFrameNA <- Fstdata[nonkeepers, ]

  # Finding upper and lower bounds for trimming (eliminating NAs, but not
  # negative FSTs)
  sortedDataFrame <-
    workingDataFrame[order(workingDataFrame$FSTNoCorr), ]

  NLociTotal <- length(sortedDataFrame$FSTNoCorr)
  SmallestKeeper <- ceiling(NLociTotal * LeftTrimFraction)
  LargestKeeper <- floor(NLociTotal * (1 - RightTrimFraction))
  LowTrimPoint <- sortedDataFrame$FSTNoCorr[[SmallestKeeper]]
  HighTrimPoint <- sortedDataFrame$FSTNoCorr[[LargestKeeper]]

  if (LowTrimPoint < 0) {
    writeLines(
      error(
        "ERROR: The smallest FST in the trimmed set must be > 0. Please use a larger LeftTrimFraction."
      )
    )
    return()
  }

  if (HighTrimPoint >= 1) {
    writeLines(
      error(
        "ERROR: The largest FST in the trimmed set must be < 1. Please use a larger RightTrimFraction."
      )
    )
    return()
  }

  # finding dfInferred and Fstbar iteratively
  putativeNeutralListTemp <- ifelse(workingDataFrame$FSTNoCorr > 0, TRUE, FALSE)

  oldOutlierFlag <- rep(FALSE, NLociTotal)

  # Note: All negative FST loci are maked as putative outliers, which will
  # need to be tested with the coalescent model later. In the
  # meantime, they are removed so as to not confuse the likelihood function.

  keepGoing <- TRUE
  count <- 0
  # writeLines(paste(mean(workingDataFrame$FSTNoCorr[putativeNeutralListTemp])))

  while (keepGoing) {
    count <- count + 1
    if (count > 19) {
      keepGoing <- FALSE
      writeLines(important("Exceeded iteration maximum.")) ### Try with
      # increased maximum value for count two lines above.
    }

    FstbarNoCorrTemp <- fstBarCalculatorNoCorr(workingDataFrame[putativeNeutralListTemp, ])

    dfInferredTemp <- EffectiveNumberSamplesMLE(
      workingDataFrame$FSTNoCorr[putativeNeutralListTemp],
      FstbarNoCorrTemp,
      NumberOfSamples,
      LowTrimPoint,
      HighTrimPoint
    )
    workingDataFrame <- pOutlierFinderChiSqNoCorr(
      workingDataFrame,
      FstbarNoCorrTemp,
      dfInferredTemp,
      qthreshold
    )

    #### mark all negative FSTs as outliers if lowest nonneg FST is outlier
    # (because negative Fst estimates can't be evaluated through
    #### the chi-square approach on their own)
    if (any(workingDataFrame$OutlierFlag[workingDataFrame$FSTNoCorr < LowTrimPoint])) {
      workingDataFrame$OutlierFlag[workingDataFrame$FSTNoCorr < 0] <- TRUE
    }

    #### Any loci previously marked as $OutlierFlag=TRUE remain so, even if
    # the new iteration doesn''t flag them as outliers
    #### workingDataFrame$OutlierFlag=!as.logical((!workingDataFrame$OutlierFlag)*(!oldOutlierFlag))

    # Resetting neutral list, and checking whether the outlier list has
    # stabilized
    putativeNeutralListTemp <- ifelse((!workingDataFrame$OutlierFlag), TRUE, FALSE)
    if (sum(putativeNeutralListTemp) == 0) {
      writeLines(error("No loci in neutral list...\n"))
      return(error("FAIL"))
    }

    if (identical(oldOutlierFlag, workingDataFrame$OutlierFlag)) {
      keepGoing <- FALSE
    }

    ###### if all in trimmed get IDed as outlier - return to user with
    # warning
    if (all(workingDataFrame$OutlierFlag[workingDataFrame$FSTNoCorr < LowTrimPoint])) {
      writeLines(
        error(
          "All loci with Fst below the lower (lefthand) trim point were marked as outliers. Re-run with larger LeftTrimFraction or smaller qthreshold."
        )
      )
      return(0)
    }

    if (all(workingDataFrame$OutlierFlag[workingDataFrame$FSTNoCorr > HighTrimPoint])) {
      writeLines(
        error(
          "All loci with Fst above the upper (righthand) trim point were marked as outliers. Re-run with smaller RightTrimFraction or smaller qthreshold."
        )
      )
      return(0)
    }

    oldOutlierFlag <- workingDataFrame$OutlierFlag

    # writeLines(paste(as.character(count),' ',as.character(sum(putativeNeutralListTemp))))
  }

  if (count > 19) {
    writeLines(important("Loop iteration limit exceeded."))
  }

  numberLowFstOutliers <- sum(workingDataFrame$OutlierFlag[(workingDataFrame$FSTNoCorr < LowTrimPoint)])
  numberHighFstOutliers <- sum(workingDataFrame$OutlierFlag[(workingDataFrame$FSTNoCorr > HighTrimPoint)])

  FSTbar <- fstBarCalculator(workingDataFrame[putativeNeutralListTemp, ])

  # merge NA list back to working list, and sort by original order\t
  resultsDataFrame <- rbind(workingDataFrame, storedDataFrameNA)
  resultsDataFrame <- resultsDataFrame[order(resultsDataFrame$indexOrder), ]
  # return new dataframe
  list(
    FSTbar = FSTbar,
    FSTNoCorrbar = FstbarNoCorrTemp,
    dfInferred = dfInferredTemp,
    numberLowFstOutliers = numberLowFstOutliers,
    numberHighFstOutliers = numberHighFstOutliers,
    results = resultsDataFrame
  )
}

outputDFStarterNoCorr <- function(FstDataFrame,
                                  Hmin = 0.1) {
  # This will take a given dataframe with $LocusName, $FST,$He, $T1, $T2, etc.
  # and initialize $indexOrder,$GoodH,$OutlierFlag (to 0),
  # and $q (to 1).

  # Hmin is the smallest allowable He for which a locus should be included in
  # the initial calculations. By default this requires that a
  # locus have heterozygosity equal to 10% or more.

  len <- length(FstDataFrame$FSTNoCorr)
  indexOrder <- seq(1, len)
  GoodH <- ifelse(FstDataFrame$He < Hmin, "lowH", "goodH")
  OutlierFlag <- ifelse(is.na(FstDataFrame$FSTNoCorr), NA, FALSE)
  qvalues <- rep(NA, len)
  pvalues <- rep(NA, len)
  pvaluesRightTail <- rep(NA, len)
  cbind(
    FstDataFrame,
    indexOrder,
    GoodH,
    qvalues,
    pvalues,
    pvaluesRightTail,
    OutlierFlag
  )
}

pOutlierFinderChiSqNoCorr <- function(DataList,
                                      Fstbar,
                                      dfInferred,
                                      qthreshold = 0.05) {
  # Finds outliers based on chi-squared distribution Takes given values of
  # dfInferred and Fstbar, and returns a list of p-values and
  # q-values for all loci based on chi-square. Assumes that the DataList
  # input has a column called $FSTNoCorr and that empty columns
  # exist for $qvalues and $OutlierFlag

  # Divide DataList into 3 lists: DataListGood has $FST>0; DataListNeg has
  # cases where $FST <=0; and DataListNA has cases where $FST is
  # NA. DataListNeg is necessary to keep separate here because these cases
  # do not have meaningful results with the chi-square aprach;
  # however, they do carry information.

  DataListGood <- DataList[which(DataList$FSTNoCorr > 0), ]
  DataListNonPosFst <- DataList[which(DataList$FSTNoCorr <= 0), ]
  DataListNA <- DataList[which(is.na(DataList$FSTNoCorr)), ]

  pList <- pTwoSidedFromChiSq(DataListGood$FSTNoCorr * (dfInferred) / Fstbar, dfInferred)
  pListRightTail <- 1 - pchisq(DataListGood$FSTNoCorr * (dfInferred) /
    Fstbar, dfInferred)

  # Note: Change made 13 June 2014; q-values now only calcualted on
  # right-tail one-sided p-values
  qtemp <- qvalue::qvalue(pListRightTail,
    fdr.level = qthreshold,
    pi0.method = "bootstrap"
  )
  # Note: Using the bootstrap method here seems OK, but if this causes
  # problems remove the pi0.method='bootstrap' in the previous line
  # to revert to the default.

  DataListGood$pvalues <- pList
  DataListGood$pvaluesRightTail <- pListRightTail
  DataListGood$qvalues <- qtemp$qvalues
  DataListGood$OutlierFlag <- qtemp$significant
  rbind(DataListGood, DataListNonPosFst, DataListNA)
}

pTwoSidedFromChiSq <- function(x,
                               df) {
  # Takes a value x, finds the two-sided p-value for comparison to a
  # chi-square distribution with df degrees of freedom.
  pOneSided <- pchisq(x, df)
  ifelse(pOneSided > 0.5, (1 - pOneSided) * 2, pOneSided * 2)
}

Try the dartR.popgen package in your browser

Any scripts or data that you put into this service are public.

dartR.popgen documentation built on June 28, 2024, 1:06 a.m.