In hendriau/Summix: Summix2: A suite of methods to estimate, adjust, and leverage substructure in genetic summary data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

Summix2

Summix2 is a suite of methods that detect and leverage substructure in genetic summary data. This package builds on Summix, a method that estimates and adjusts for substructure in genetic summary that was developed by the Hendricks Research Team at the University of Colorado Denver.

Find more details about Summix in our manuscript published in the American Journal of Human Genetics.

For individual function specifics in Summix2:

summix --- fast forward to example

adjAF --- fast forward to example

summix_local --- fast forward to example

Package Installation

To install this package, start R (version “4.3”) and run the following commands:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Summix")

summix {#summix}

The summix() function estimates mixture proportions of reference groups within genetic summary (allele frequency) data using sequential quadratic programming performed with the slsqp() function in the nloptr package.

summix() Input

Mandatory parameters are:

data: A data frame of the observed and reference group allele frequencies for N genetic variants.
reference: A character vector of the column names for K reference groups.
observed: A character value that is the column name for the observed group.

Optional parameters are:

pi.start: Numeric vector of length K containing the user's initial guess for the reference group proportions. If not specified, this defaults to 1/K where K is the number of reference groups.
goodness.of.fit: Default value is TRUE. If set as FALSE, the user will override the default goodness of fit measure and return the raw objective loss from slsqp().
override_removeSmallRef: Default value is FALSE. If set as TRUE, the user will override the automatic removal of reference groups with \<1% global proportions - this is not recommended.
network: Default value is FALSE. If set as TRUE, function will return a network diagram with nodes as estimated substructure proportions and edges as degree of similarity between the given node pair.
N_reference: numeric vector of the sample sizes for each of the K reference groups; must be specified if network = "TRUE".
reference_colors: A character vector of length K that specifies the color each reference group node in the network plot. If not specified, this defaults to K random colors.

summix() Output

A data frame with the following columns:

goodness.of.fit: Scaled objective loss from slsqp() reflecting the fit of the reference data. Values between 0.5-1.5 are considered moderate fit and should be used with caution. Values greater than 1.5 indicate poor fit, and users should not perform further analyses using Summix2.
iterations: The number of iterations for the SLSQP algorithm before best-fit reference group proportion estimates are found.
time: The time in seconds before best-fit reference group mixture proportion estimations are found by the SLSQP algorithm.
filtered: The number of genetic variants not used in the reference group mixture proportion estimation due to missing values.
K columns of mixture proportions of reference groups input into the function.

Additional Output if summix_network = TRUE:

Summix Network: A network diagram with nodes as estimated substructure proportions and edges as degree of similarity between the given node pair; exportable as a png.

adjAF {#adjaf}

The adjAF() function adjusts allele frequencies to match reference group substructure mixture proportions in a given target group or individual.

adjAF() Input

Mandatory parameters are:

data: A data frame containing the unadjusted allele frequency for the observed group and K reference group allele frequencies for N genetic variants.
reference: A character vector of the column names for K reference groups.
observed: A character value that is the column name for the observed group.
pi.target: A numeric vector of the mixture proportions for K reference groups in the target group or individual.
pi.observed: A numeric vector of the mixture proportions for K reference groups in the observed group.
N_reference: A numeric vector of the sample sizes for each of the K reference groups that is in the same order as the reference parameter.
N_observed: A numeric value of the sample size of the observed group.
adj_method: User choice of method for the allele frequency adjustment; options "average" and "leave_one_out" are available. Defaults to "average".
filter: Sets adjusted allele frequencies equal to 1 if > 1, to 0 if > -.005 and \< 0, and removes adjusted allele frequencies \< -.005. Default is TRUE.

adjAF() Output

A data frame with the following columns:

pi: A table of input reference groups, pi.observed, and pi.target.
observed.data: The name of the data column for the observed group from which the adjusted allele frequencies are estimated.
Nsnps: The number of SNPs for which adjusted AF is estimated.
adjusted.AF: A data frame of original data with an appended column of adjusted allele frequencies.
effective.sample.size: The sample size of individuals effectively represented by the adjusted allele frequencies.

summix_local {#summix_local}

The summix_local() function estimates local substructure mixture proportions in genetic summary data using the same slspq() functionality as summix(). summix_local() also performs a selection scan (optional) that identifies regions of selection along the given chromosome.

summix_local() Input

Mandatory parameters are:

data: A data frame of the observed group and reference group allele frequencies for N genetic variants on a single chromosome. Must contain a column specifying the genetic variant positions.
reference: A character vector of the column names for K reference groups.
observed: A character value that is the column name for the observed group.
position_col: A character value that is the column name for the genetic variants positions. Default is "POS".
maxStepSize: A numeric value that defines the maximum gap in base pairs between two consecutive genetic variants within a given window. Default is 1000.

Optional parameters are:

algorithm: User choice of algorithm to define local substructure blocks; options "fastcatch" and "windows" are available. "windows" uses a fixed window in a sliding windows algorithm. "fastcatch" allows dynamic window sizes. The "fastcatch" algorithm is recommended- though it is computationally slower. Default is "fastcatch".
type: User choice of how to define window size; options "variants" and "bp" are available where "variants" defines window size as the number of variants in a given window and "bp" defines window size as the number of base pairs in a given window. Default is "variants".
override_fit: Default is FALSE. If set as TRUE, the user will override the auto-stop of summix_local() that occurs if the global goodness of fit value is greater than 1.5 (indicating a poor fit of the reference data to the observed data).
override_removeSmallAnc: Default is FALSE. If set as TRUE, the user will override the automatic removal of reference ancestries with \<2% global proportions -- this is not recommended.
selection_scan: User option to perform a selection scan on the given chromosome. Default is FALSE. If set as TRUE, a test statistic will be calculated for each local substructure block. Note: the user can expect extended computation time if this option is set as TRUE.

Conditional parameters are:

If algorithm = "windows":

windowOverlap: A numeric value that defines the number of variants or the number of base pairs that overlap between the given sliding windows. Default is 200.

If algorithm = "fastcatch":

diffThreshold: A numeric value that defines the percent difference threshold to mark the end of a local substructure block. Default is 0.02.

If type = "variants":

maxVariants: A numeric value that specifies the maximum number of genetic variants allowed to define a given window.

If type = "bp":

maxWindowSize: A numeric value that defines the maximum allowed window size by the number of base pairs in a given window.

If algorithm = "fastcatch" and type = "variants":

minVariants: A numeric value that specifies the minimum number of genetic variants allowed to define a given window.

If algorithm = "fastcatch" and type = "bp":

minWindowSize: A numeric value that specifies the minimum number of base pairs allowed to define a given window.

If selection_scan = TRUE:

NSimRef: A numeric vector of the sample sizes for each of the K reference groups that is in the same order as the reference parameter. This is used in a simulation framework that calculates within local block substructure standard error.

summix_local() Output

A data frame with a row for each local substructure block and the following columns:

goodness.of.fit: Scaled objective loss from slsqp() reflecting the fit of the reference data. Values between 0.5-1.5 are considered moderate fit and should be used with caution. Values greater than 1.5 indicate poor fit, and users should not perform further analyses using Summix2.
iterations: The number of iterations for the SLSQP algorithm before best-fit reference group mixture proportion estimations are found.
time: The time in seconds before best-fit reference group mixture proportion estimations are found by the SLSQP algorithm.
filtered: The number of genetic variants not used in the reference group mixture proportion estimation due to missing values.
K columns of mixture proportions of reference group in the given local substructure block.
nSNPs: The number of genetic variants in the given local substructure block.

Additional Output if selection_scan = TRUE:

K columns of local substructure test statistics for each reference group in the given local substructure block.
K columns of p-values for each reference group in the given local substructure block. P-values calculated using the Student's t-distribution with degrees of freedom=(nSNPs in the block)-1.

Examples using toy data in the Summix package

For quick runs of all demos, we suggest using the data saved within the Summix library called ancestryData.

A quick demo of summix() {#a-quick-demo-of-summix}

The commands:

library(Summix)

# load the data
data("ancestryData")

# Estimate 5 reference group proportion values for the gnomAD African/African American group
# using a starting guess of .2 for each estimated proportion.
summix(data = ancestryData,
    reference=c("reference_AF_afr",
        "reference_AF_eas",
        "reference_AF_eur",
        "reference_AF_iam",
        "reference_AF_sas"),
    observed="gnomad_AF_afr",
    pi.start = c(.2, .2, .2, .2, .2),
    goodness.of.fit=TRUE)

Below is an example of creating a Summix Network plot.

Summix_output<- summix(data = ancestryData,
    reference=c("reference_AF_afr",
        "reference_AF_eas",
        "reference_AF_eur",
        "reference_AF_iam",
        "reference_AF_sas"),
    observed="gnomad_AF_afr",
    pi.start = c(.2, .2, .2, .2, .2),
    goodness.of.fit=TRUE,
    network = TRUE, 
    N_reference = c(704, 787, 741, 47, 545), reference_colors = c("#FDE725FF", "#5DC863FF", "#21908CFF", "#3B528BFF", "#440154FF"))

Summix_Network <- Summix_output[[2]]

Summix_Network

A quick demo of adjAF() {#a-quick-demo-of-adjaf}

The commands:

library(Summix)

# load the data
data("ancestryData")


adjusted_data<-adjAF(data = ancestryData,
     reference = c("reference_AF_afr", "reference_AF_eur"),
     observed = "gnomad_AF_afr",
     pi.target = c(1, 0),
     pi.observed = c(.85, .15),
     adj_method = 'average',
     N_reference = c(704,741),
     N_observed = 20744,
     filter = TRUE)
print(adjusted_data$adjusted.AF[1:5,])

A quick demo of summix_local() {#a-quick-demo-of-summix_local}

The commands:

library(Summix)

# load the data
data("ancestryData")

results <- summix_local(data = ancestryData,
                        reference = c("reference_AF_afr", 
                                      "reference_AF_eas", 
                                      "reference_AF_eur", 
                                      "reference_AF_iam", 
                                      "reference_AF_sas"),
                        NSimRef = c(704,787,741,47,545),
                        observed="gnomad_AF_afr",
                        goodness.of.fit = T,
                        type = "variants",
                        algorithm = "fastcatch",
                        minVariants = 150,
                        maxVariants = 250,
                        maxStepSize = 1000,
                        diffThreshold = .02,
                        override_fit = F,
                        override_removeSmallAnc = TRUE,
                        selection_scan = T,
                        position_col = "POS")

print(results$results)

Below is an example of plotting the reference group proportions estimated in each block using summix_local(); where asterisks indicate local substructure blocks that are at least nominally significant (p-value\<=.05).

suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(viridis))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(ggplot2))

makeSubstructurePlot <- function(testResults, reference,
                             title = "Substructure Plot", addMarks = F,
                             include_table = F, include_plot = T,
                             totalBlocks = NULL, nominal = F) {
  colors <- viridis(length(reference))
  reference <- sort(reference)
  testResults$Block <- as.numeric(rownames(testResults))
  #make long format for plotting
  plotData <- testResults %>% pivot_longer(all_of(reference))
  if(addMarks == F) {
    ggplot(plotData, aes(fill = name, x = (End_Pos + Start_Pos)/2,
                         y = round(value, 4))) +
      geom_bar(position = "fill", stat = "identity",
               width = plotData$End_Pos - plotData$Start_Pos) +
      scale_fill_manual(values = colors) +
      scale_x_continuous(labels = comma) + xlab("Position") +
      ylab("Proportion") + ggtitle(title)
  } else {
    allMarks <- data.frame(matrix(nrow = nrow(testResults), ncol = 3))
    colnames(allMarks) <- c("x", "y", "mark")
    pvals <- testResults[grep("p.", colnames(testResults))]
    pvals <- pvals[,order(colnames(pvals))]

    # for any block that exceeds 95th percentile of loss/1000SNPs set
    # pvals = 1 so they won't be significant
    toSet <- which(testResults$objective > quantile(testResults$objective, .95))
    pvals[toSet,] <- 1
    pvals2 <- apply(pvals, 1, min)

    allMarks$x <- (testResults$End_Pos+testResults$Start_Pos)/2
    allMarks$y <- 1.03
    allMarks$i <- 1:nrow(allMarks)
    if(nominal == F) {
      allMarks <- allMarks %>% slice(which(pvals2 < .05/totalBlocks/(length(reference)-1)))
    } else {
      allMarks <- allMarks %>% slice(which(pvals2 < .05))
    }


    if(nrow(allMarks) > 0) {
      for(i in 1:nrow(allMarks)) {
        allMarks[i,]$mark <- "*"
      }
    }
    plotData <- testResults %>%
      pivot_longer(all_of(reference))
    if(nominal == F) {
      p <- ggplot(plotData, aes(fill = name, x = (End_Pos + Start_Pos)/2,
                                y = round(value, 4))) +
        geom_bar(position = "fill", stat = "identity",
                 width = plotData$End_Pos - plotData$Start_Pos) +
        scale_fill_manual(values = colors) +
        scale_x_continuous(labels = comma) + xlab("Position") +
        ylab("Proportion") + ggtitle(title) + theme_bw() +
        annotate("text", x = allMarks$x, y = allMarks$y, label = allMarks$mark,
                 color = "black", size = 6)
    } else {
      p <- ggplot(plotData, aes(fill = name, x = (End_Pos + Start_Pos)/2,
                                y = round(value, 4))) +
        geom_bar(position = "fill", stat = "identity",
                 width = plotData$End_Pos - plotData$Start_Pos) +
        scale_fill_manual(values = colors) +
        scale_x_continuous(labels = comma) + xlab("Position") +
        ylab("Proportion") + ggtitle(title) + theme_bw() +
        annotate("text", x = allMarks$x, y = allMarks$y, label = allMarks$mark,
                 color = "grey50", size = 4)
    }


    if(nrow(allMarks) == 0 | include_table == F) {
      p
    } else {
      allMarks$`Start Position` <- testResults[allMarks$i,]$Start_Pos
      allMarks$`End Position` <- testResults[allMarks$i,]$End_Pos
      allMarks <- cbind(allMarks, round(pvals[allMarks$i,],8))
      allMarks$x <- NULL
      allMarks$y <- NULL
      allMarks$mark <-NULL
      allMarks$i <- NULL
      t <- tableGrob(allMarks, theme = ttheme_minimal(base_size = 9),
                     rows = rep("", nrow(allMarks)))
      if(include_plot == F) {
        grid.arrange(t)
      } else {
        grid.arrange(p, t, nrow = 2, heights = c(2.5, 1))
      }
    }
  }
}

# makeSubstructurePlot(results$results,
#                  reference = c("reference_AF_afr", "reference_AF_eas", "reference_AF_eur", "reference_AF_iam", "reference_AF_sas"),
#                  title = "Example Local Substructure",
#                  totalBlocks = dim(results$results)[1],
#                  addMarks = T,
#                  nominal = T)

hendriau/Summix documentation built on Nov. 13, 2024, 6:53 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

hendriau/Summix
Summix2: A suite of methods to estimate, adjust, and leverage substructure in genetic summary data

In hendriau/Summix: Summix2: A suite of methods to estimate, adjust, and leverage substructure in genetic summary data

Summix2

Package Installation

summix {#summix}

summix() Input

summix() Output

adjAF {#adjaf}

adjAF() Input

adjAF() Output

summix_local {#summix_local}

summix_local() Input

summix_local() Output

Examples using toy data in the Summix package

A quick demo of summix() {#a-quick-demo-of-summix}

A quick demo of adjAF() {#a-quick-demo-of-adjaf}

A quick demo of summix_local() {#a-quick-demo-of-summix_local}

R Package Documentation

Browse R Packages

We want your feedback!

hendriau/Summix Summix2: A suite of methods to estimate, adjust, and leverage substructure in genetic summary data

In hendriau/Summix: Summix2: A suite of methods to estimate, adjust, and leverage substructure in genetic summary data

Summix2

Package Installation

summix {#summix}

summix() Input

summix() Output

adjAF {#adjaf}

adjAF() Input

adjAF() Output

summix_local {#summix_local}

summix_local() Input

summix_local() Output

Examples using toy data in the Summix package

A quick demo of summix() {#a-quick-demo-of-summix}

A quick demo of adjAF() {#a-quick-demo-of-adjaf}

A quick demo of summix_local() {#a-quick-demo-of-summix_local}

R Package Documentation

Browse R Packages

We want your feedback!

hendriau/Summix
Summix2: A suite of methods to estimate, adjust, and leverage substructure in genetic summary data