summix_local: summix_local
In hendriau/Summix: Summix2: A suite of methods to estimate, adjust, and leverage substructure in genetic summary data

View source: R/summix_local.R

summix_local

R Documentation

summix_local

Description

Estimates local substructure mixture proportions in genetic summary data; Also performs a selection scan (optional) that identifies potential regions of selection along the given chromosome.

Usage

summix_local(
  data,
  reference,
  observed,
  goodness.of.fit = TRUE,
  type = "variants",
  algorithm = "fastcatch",
  minVariants = 0,
  maxVariants = 0,
  maxWindowSize = 0,
  minWindowSize = 0,
  windowOverlap = 200,
  maxStepSize = 1000,
  diffThreshold = 0.02,
  NSimRef = NULL,
  override_fit = FALSE,
  override_removeSmallAnc = FALSE,
  selection_scan = FALSE,
  position_col = "POS",
  nSimSE = 1000
)

Arguments

`data`	a data frame of the observed group and reference group allele frequencies for N genetic variants on a single chromosome. Must contain a column specifying the genetic variant positions.
`reference`	a character vector of the column names for K reference groups.
`observed`	a character value that is the column name for the observed group.
`goodness.of.fit`	an option to override the default scaled objective to return the raw loss from slsqp
`type`	user choice of how to define window size; options "variants" and "bp" are available where "variants" defines window size as the number of variants in a given window and "bp" defines window size as the number of base pairs in a given window. Default is "variants".
`algorithm`	user choice of algorithm to define local substructure blocks; options "fastcatch" and "windows" are available. "windows" uses a fixed window in a sliding windows algorithm. "fastcatch" allows dynamic window sizes. The "fastcatch" algorithm is recommended- though it is computationally slower. Default is "fastcatch".
`minVariants`	Used if algorithm = "fastcatch" and type = "variants". A numeric value that specifies the minimum number of genetic variants allowed to define a given window.
`maxVariants`	Used if type = "variants". A numeric value that specifies the maximum number of genetic variants allowed to define a given window.
`maxWindowSize`	Used if type = "bp". A numeric value that defines the maximum allowed window size by the number of base pairs in a given window.
`minWindowSize`	Used if algorithm = "fastcatch" and type = "bp". A numeric value that specifies the minimum number of base pairs allowed to define a given window.
`windowOverlap`	Used if algorithm = "windows". A numeric value that defines the number of variants or the number of base pairs that overlap between the given sliding windows. Default is 200.
`maxStepSize`	a numeric value that defines the maximum gap in base pairs between two consecutive genetic variants within a given window. Default is 1000.
`diffThreshold`	Used if algorithm = "fastcatch". A numeric value that defines the percent difference threshold to mark the end of a local substructure block. Default is 0.02.
`NSimRef`	Used if f selection_scan = TRUE. A numeric vector of the sample sizes for each of the K reference groups that is in the same order as the reference parameter. This is used in a simulation framework that calculates within local substructure block standard error.
`override_fit`	default is FALSE. If set as TRUE, the user will override the auto-stop of summix_local() that occurs if the global goodness of fit value is greater than 1.5 (indicating a poor fit of the reference data to the observed data).
`override_removeSmallAnc`	default is FALSE. If set as TRUE, the user will override the automatic removal of reference ancestries with <2% global proportions – this is not recommended.
`selection_scan`	user option to perform a selection scan on the given chromosome. Default is FALSE. If set as TRUE, a test statistic will be calculated for each local substructure block. Note: the user can expect extended computation time if this option is set as TRUE.
`position_col`	a character value that is the column name for the genetic variants positions. Default is "POS".
`nSimSE`	user choice of number of internal simulations to run to calculate standard error of estimates. Default is 1000.

Value

data frame with a row for each local substructure block and the following columns:

goodness.of.fit: scaled objective reflecting the fit of the reference data. Values between 0.5-1.5 are considered moderate fit and should be used with caution. Values greater than 1.5 indicate poor fit, and users should not perform further analyses using summix

iterations: number of iterations for SLSQP algorithm

time: time in seconds of SLSQP algorithm

filtered: number of SNPs not used in estimation due to missing values

K columns of mixture proportions of reference groups input into the function

nSNPs: number of SNPs in the given local substructure block

Author(s)

Hayley Wolff (Stoneman), hayley.wolff@cuanschutz.edu

Audrey Hendricks, audrey.hendricks@cuanschutz.edu

References

https://github.com/hendriau/Summix2

Examples

data(ancestryData)
results <- summix_local(data = ancestryData,
                        reference = c("reference_AF_afr",
                                      "reference_AF_eas",
                                      "reference_AF_eur",
                                      "reference_AF_iam",
                                      "reference_AF_sas"),
                        NSimRef = c(704,787,741,47,545),
                        observed="gnomad_AF_afr",
                        goodness.of.fit = TRUE,
                        type = "variants",
                        algorithm = "fastcatch",
                        minVariants = 150,
                        maxVariants = 250,
                        maxStepSize = 1000,
                        diffThreshold = .02,
                        override_fit = FALSE,
                        override_removeSmallAnc = TRUE,
                        selection_scan = FALSE,
                        position_col = "POS")
print(results$results)

hendriau/Summix documentation built on Nov. 13, 2024, 6:53 a.m.