R/multiple_marker_test.R
In qgg: Statistical Tools for Quantitative Genetic Analyses

Documented in gsea magma mapSets pops vegas

####################################################################################################################
#    Module 2: Marker set tests
####################################################################################################################
#'
#' Gene set enrichment analysis
#'
#' @description
#' The function gsea can perform several different gene set enrichment analyses. The general procedure is to obtain
#' single marker statistics (e.g. summary statistics), from which it is possible to compute and evaluate a test statistic
#' for a set of genetic markers that measures a joint degree of association between the marker set and the phenotype.
#' The marker set is defined by a genomic feature such as genes, biological pathways, gene interactions,
#' gene expression profiles etc.
#'
#' Currently, four types of gene set enrichment analyses can be conducted with gsea; sum-based, count-based,
#' score-based, and our own developed method, the covariance association test (CVAT). For details and comparisons of
#' test statistics consult doi:10.1534/genetics.116.189498.
#'
#' The sum test is based on the sum of all marker summary statistics located within the feature set. The single marker
#' summary statistics can be obtained from linear model analyses (from PLINK or using the qgg glma approximation),
#' or from single or multiple component REML analyses (GBLUP or GFBLUP) from the greml function. The sum test is powerful
#' if the genomic feature harbors many genetic markers that have small to moderate effects.
#'
#' The count-based method is based on counting the number of markers within a genomic feature that show association
#' (or have single marker p-value below a certain threshold) with the phenotype. Under the null hypothesis (that the
#' associated markers are picked at random from the total number of markers, thus, no enrichment of markers in any
#' genomic feature) it is assumed that the observed count statistic is a realization from a hypergeometric distribution.
#'
#' The score-based approach is based on the product between the scaled genotypes in a genomic feature and the residuals
#' from the liner mixed model (obtained from greml).
#'
#' The covariance association test (CVAT) is derived from the fit object from greml (GBLUP or GFBLUP), and measures
#' the covariance between the total genomic effects for all markers and the genomic effects of the markers within the
#' genomic feature.
#'
#' The distribution of the test statistics obtained from the sum-based, score-based and CVAT is unknown, therefore
#' a circular permutation approach is used to obtain an empirical distribution of test statistics.
#'
#' @param stat vector or matrix of single marker statistics (e.g. coefficients, t-statistics, p-values)
#' @param sets list of marker sets - names corresponds to row names in stat
#' @param nperm number of permutations used for obtaining an empirical p-value
#' @param ncores number of cores used in the analysis
#' @param Glist list providing information about genotypes stored on disk
#' @param W matrix of centered and scaled genotypes (used if method = cvat or score)
#' @param fit list object obtained from a linear mixed model fit using the greml function
#' @param g vector (or matrix) of genetic effects obtained from a linear mixed model fit (GBLUP of GFBLUP)
#' @param e vector (or matrix) of residual effects obtained from a linear mixed model fit (GBLUP of GFBLUP)
#' @param method including sum, cvat, hyperg, score
#' @param threshold used if method='hyperg' (threshold=0.05 is default)

#' @return Returns a dataframe or a list including
#' \item{stat}{marker set test statistics}
#' \item{m}{number of markers in the set}
#' \item{p}{enrichment p-value for marker set}
#' 
#' 
#' @examples
#'
#'
#'  # Simulate data
#'  W <- matrix(rnorm(1000000), ncol = 1000)
#'  colnames(W) <- as.character(1:ncol(W))
#'  rownames(W) <- as.character(1:nrow(W))
#'  y <- rowSums(W[, 1:10]) + rowSums(W[, 501:510]) + rnorm(nrow(W))
#'
#'  # Create model
#'  data <- data.frame(y = y, mu = 1)
#'  fm <- y ~ 0 + mu
#'  X <- model.matrix(fm, data = data)
#'
#'  # Single marker association analyses
#'  stat <- glma(y=y,X=X,W=W)
#'
#'  # Create marker sets
#'  f <- factor(rep(1:100,each=10), levels=1:100)
#'  sets <- split(as.character(1:1000),f=f)
#'
#'  # Set test based on sums
#'  b2 <- stat[,"stat"]**2
#'  names(b2) <- rownames(stat)
#'  mma <- gsea(stat = b2, sets = sets, method = "sum", nperm = 100)
#'  head(mma)
#'
#'  # Set test based on hyperG
#'  p <- stat[,"p"]
#'  names(p) <- rownames(stat)
#'  mma <- gsea(stat = p, sets = sets, method = "hyperg", threshold = 0.05)
#'  head(mma)
#'
#' \donttest{
#'  G <- grm(W=W)
#'  fit <- greml(y=y, X=X, GRM=list(G=G), theta=c(10,1))
#'
#'  # Set test based on cvat
#'  mma <- gsea(W=W,fit = fit, sets = sets, nperm = 1000, method="cvat")
#'  head(mma)
#'
#'  # Set test based on score
#'  mma <- gsea(W=W,fit = fit, sets = sets, nperm = 1000, method="score")
#'  head(mma)
#'
#' }
#' 
#' @author Peter Soerensen
#' 
#' @export
#' 
gsea <- function(stat = NULL, sets = NULL, Glist = NULL, W = NULL, fit = NULL, g = NULL, e = NULL, threshold = 0.05, method = "sum", nperm = 1000, ncores = 1) {
  if(is.data.frame(stat)) {
    colstat <- !colnames(stat)%in%c("rsids","chr","pos","ea","nea","eaf")
    if(any(!rownames(stat)==stat$rsids)) stop("Row names of stat does not match stat$rsids")
    if(any(colstat)) stat <- as.matrix(stat[,colstat])**2
    if(!any(colstat)) stat <- as.matrix(stat[,colstat])
  }
  if (method == "sum") {
    if (is.matrix(stat)) sets <- mapSets(sets = sets, rsids = rownames(stat), index = TRUE)
    if (is.vector(stat)) sets <- mapSets(sets = sets, rsids = names(stat), index = TRUE)
    nsets <- length(sets)
    msets <- sapply(sets, length)
    if (is.matrix(stat)) {
      p <- apply(stat, 2, function(x) {
        gsets(stat = x, sets = sets, ncores = ncores, np = nperm)
      })
      setstat <- apply(stat, 2, function(x) {
        sapply(sets, function(y) {
          sum(x[y])
        })
      })
      rownames(setstat) <- rownames(p) <- names(msets) <- names(sets)
      res <- list(m = msets, stat = setstat, p = p)
    }
    if (is.vector(stat)) {
      setstat <- sapply(sets, function(x) {
        sum(stat[x])
      })
      p <- gsets(stat = stat, sets = sets, ncores = ncores, np = nperm, method = method)
      res <- cbind(m = msets, stat = setstat, p = p)
      rownames(res) <- names(sets)
      res <- as.data.frame(res)
    }
    return(res)
  }
  if (method == "cvat") {
    if (!is.null(W)) res <- cvat(fit = fit, g = g, W = W, sets = sets, nperm = nperm)
    if (!is.null(Glist)) {
      sets <- mapSets(sets = sets, rsids = Glist$rsids, index = TRUE)
      nsets <- length(sets)
      msets <- sapply(sets, length)
      ids <- fit$ids
      Py <- fit$Py
      g <- as.vector(fit$g)
      Sg <- fit$theta[1]
      stat <- gstat(method = "cvat", Glist = Glist, g = g, Sg = Sg, Py = Py, ids = ids)
      setstat <- sapply(sets, function(x) {
        sum(stat[x])
      })
      p <- gsets(stat = stat, sets = sets, ncores = ncores, np = nperm, method = "sum")
      res <- cbind(m = msets, stat = setstat, p = p)
      rownames(res) <- names(sets)
    }
    return(res)
  }
  if (method == "score") {
    if (!is.null(W)) res <- scoretest(e = fit$e, W = W, sets = sets, nperm = nperm)
    if (!is.null(Glist)) {
        sets <- mapSets(sets = sets, rsids = Glist$rsids, index = TRUE)
      nsets <- length(sets)
      msets <- sapply(sets, length)
      ids <- fit$ids
      e <- fit$e
      stat <- gstat(method = "score", Glist = Glist, e = e, ids = ids)
      setstat <- sapply(sets, function(x) {
        sum(stat[x])
      })
      p <- gsets(stat = stat, sets = sets, ncores = ncores, np = nperm, method = "sum")
      res <- cbind(m = msets, stat = setstat, p = p)
      rownames(res) <- names(sets)
    }
    return(res)
  }
  if (method == "hyperg") {
    res <- hgtest(p = stat, sets = sets, threshold = threshold)
    return(as.data.frame(res))
  }
}

gsets <- function(stat = NULL, sets = NULL, ncores = 1, np = 1000, method = "sum") {
  m <- length(stat)
  nsets <- length(sets)
  msets <- sapply(sets, length)
  setstat <- sapply(sets, function(x) {
    sum(stat[x])
  })
  p <- .Call("_qgg_psets", msets = msets,
              setstat = setstat,
              stat = stat,
              np = np)
  p <- p/np 
  return(p)
}


#' Map Sets to rsids
#'
#' This function maps sets to rsids. If a `Glist` is provided, `rsids` are extracted from the `Glist`.
#' It returns a list of matched RSIDs for each set.
#'
#' @param sets A list of character vectors where each vector represents a set of items. If the names
#'   of the sets are not provided, they are named as "Set1", "Set2", etc.
#' @param rsids A character vector of RSIDs. If `Glist` is provided, this parameter is ignored.
#' @param Glist A list containing an element `rsids` which is a character vector of RSIDs.
#' @param index A logical. If `TRUE` (default), it returns indices of RSIDs; otherwise, it returns the RSID names.
#' 
#' @return A list where each element represents a set and contains matched RSIDs or their indices.
#' 
#' @keywords internal
#' 
#' @export
#' 
mapSets <- function(sets = NULL, rsids = NULL, Glist = NULL, index = TRUE) {
  if (!is.null(Glist)) rsids <- unlist(Glist$rsids)
  nsets <- sapply(sets, length)
  if(is.null(names(sets))) names(sets) <- paste0("Set",1:length(sets))
  rs <- rep(names(sets), times = nsets)
  rsSets <- unlist(sets, use.names = FALSE)
  rsSets <- match(rsSets, rsids)
  inW <- !is.na(rsSets)
  rsSets <- rsSets[inW]
  if (!index) rsSets <- rsids[rsSets]
  rs <- rs[inW]
  rs <- factor(rs, levels = unique(rs))
  rsSets <- split(rsSets, f = rs)
  return(rsSets)
}

gstat <- function(method = NULL, Glist = NULL, g = NULL, Sg = NULL, Py = NULL, e = NULL, msize = 100, rsids = NULL,
                  impute = TRUE, scale = TRUE, ids = NULL, ncores = 1) {
  n <- Glist$n
  rws <- match(ids, Glist$ids)
  if (any(is.na(rws))) stop("Some ids in fit object not found in Glist")
  nr <- length(rws)
  nbytes <- ceiling(n / 4)
  cls <- 1:Glist$m
  if (!is.null(rsids)) cls <- match(rsids, Glist$rsids)
  nc <- length(cls)
  cls <- split(cls, ceiling(seq_along(cls) / msize))
  msets <- sapply(cls, length)
  nsets <- length(msets)
  setstat <- NULL
  fnRAW <- Glist$fnRAW
  for (j in 1:nsets) {
    nc <- length(cls[[j]])
    direction <- rep(1, nc)

# Check this again - should be getW using a bedfile
      W <- getW(Glist = Glist, rws = rws, cls = cls, scale = scale)
# End check this again

    if (method == "cvat") {
      s <- crossprod(W / nc, Py) * Sg
      Ws <- t(t(W) * as.vector(s))
      setstat <- c(setstat, colSums(g * Ws))
    }
    if (method == "score") {
      we2 <- as.vector((t(W) %*% e)**2)
      setstat <- c(setstat, we2)
    }
    message(paste("Finished block", j, "out of", nsets, "blocks"))
  }
  return(setstat)
}

settest <- function(stat = NULL, W = NULL, sets = NULL, nperm = NULL, method = "sum", threshold = 0.05) {
  if (method == "sum") setT <- sumtest(stat = stat, sets = sets, nperm = nperm)
  if (method == "hyperG") setT <- hgtest(p = stat, sets = sets, threshold = threshold)
  return(setT)
}

sumtest <- function(stat = NULL, sets = NULL, nperm = NULL, method = "sum") {
  if (method == "mean") {
    setT <- sapply(sets, function(x) {
      mean(stat[x])
    })
  }
  if (method == "sum") {
    setT <- sapply(sets, function(x) {
      sum(stat[x])
    })
  }
  if (method == "max") {
    setT <- sapply(sets, function(x) {
      max(stat[x])
    })
  }
  if (!is.null(nperm)) {
    p <- rep(0, length(sets))
    n <- length(stat)
    nset <- sapply(sets, length)
    rws <- 1:n
    names(rws) <- names(stat)
    sets <- lapply(sets, function(x) {
      rws[x]
    })
    for (i in 1:nperm) {
      rws <- sample(1:n, 1)
      o <- c(rws:n, 1:(rws - 1))
      pstat <- stat[o]
      if (method == "mean") {
        setTP <- sapply(sets, function(x) {
          mean(pstat[x])
        })
      }
      if (method == "sum") {
        setTP <- sapply(sets, function(x) {
          sum(pstat[x])
        })
      }
      if (method == "max") {
        setTP <- sapply(sets, function(x) {
          max(pstat[x])
        })
      }
      p <- p + as.numeric(setT > setTP)
    }
    p <- 1 - p / nperm
    setT <- data.frame(setT, nset, p)
  }
  return(setT)
}

cvat <- function(fit = NULL, s = NULL, g = NULL, W = NULL, sets = NULL, nperm = 100) {
  if (!is.null(fit)) {
    s <- crossprod(W / ncol(W), fit$Py) * fit$theta[1]
  }
  Ws <- t(t(W) * as.vector(s))
  if (is.null(g)) g <- W %*% s
  cvs <- colSums(as.vector(g) * Ws)
  setT <- settest(stat = cvs, sets = sets, nperm = nperm, method = "sum")
  if (!is.null(names(sets))) rownames(setT) <- names(sets)
  return(setT)
}


scoretest <- function(e = NULL, W = NULL, sets = NULL, nperm = 100) {
  we2 <- as.vector((t(W) %*% e)**2)
  names(we2) <- colnames(W)
  setT <- settest(stat = we2, sets = sets, nperm = nperm, method = "sum")$p
  return(setT)
}

hgtest <- function(p = NULL, sets = NULL, threshold = 0.05) {
  population_size <- length(p)
  sample_size <- sapply(sets, length)
  n_successes_population <- sum(p < threshold)
  n_successes_sample <- sapply(sets, function(x) {
    sum(p[x] < threshold)
  })
  phyperg <- rep(1,length(sets))
  names(phyperg) <- names(sets)
  for (i in 1:length(sets)) {
    phyperg[i] <- 1.0-phyper(n_successes_sample[i]-1, n_successes_population,
                             population_size-n_successes_population,
                             sample_size[i])
  }
  ef <- (n_successes_sample/sample_size)/
    (n_successes_population/population_size)
  
  # Create data frame for table
  res <- data.frame(ng = sample_size,
                   nag = n_successes_sample,
                   ef=ef,
                   p = phyperg)
  res
}


# hgtest <- function(p = NULL, sets = NULL, threshold = 0.05) {
#   N <- length(p)
#   Na <- sum(p < threshold)
#   Nna <- N - Na
#   Nf <- sapply(sets, length)
#   Naf <- sapply(sets, function(x) {
#     sum(p[x] < threshold)
#   })
#   Nnaf <- Nf - Naf
#   Nanf <- Na - Naf
#   Nnanf <- Nna - Nnaf
#   phyperg <- 1 - phyper(Naf - 1, Nf, N - Nf, Na)
#   phyperg
# }

#' Bayesian Multi-marker Analysis of Genomic Annotation (Bayesian MAGMA)
#'
#' This function analyzes feature sets using MAGMA or Bayesian methods for association testing. 
#' It supports joint or marginal testing, as well as Bayesian linear regression using different 
#' priors (`bayesC`, `bayesR`).
#'
#' @param stat A numeric vector or matrix of summary statistics, where rows represent features and columns represent phenotypes.
#' @param sets A list of feature sets (e.g., genes, SNPs) to be analyzed.
#' @param method A string specifying the method to use. Options are `"magma"`, `"blr"`, `"bayesC"`, or `"bayesR"`. Default is `"magma"`.
#' @param type A string specifying the type of analysis to perform. Options are `"joint"` (default) or `"marginal"`. Only used with `method = "magma"`.
#' @param test A string specifying the statistical test. Options are `"one-sided"` (default) or `"two-sided"`. Only used with `method = "magma"`.
#' @param pi A numeric value specifying the proportion of non-zero effects. Used for Bayesian methods. Default is `0.001`.
#' @param nit An integer specifying the number of iterations for Bayesian methods. Default is `5000`.
#' @param nburn An integer specifying the number of burn-in iterations for Bayesian methods. Default is `1000`.
#' 
#' @return 
#' A data frame or list with analysis results.
#'
#' @details
#' The function uses either the MAGMA approach for set-based testing or Bayesian linear regression 
#' to estimate effect sizes and probabilities of association for feature sets. For Bayesian methods, 
#' a spike-and-slab prior is applied.
#' 
#' The `stat` input must have row names corresponding to feature identifiers. The `sets` input must 
#' be a named list, where each element corresponds to a feature set.
#'
#' @export
#' 
magma <- function(stat = NULL, sets = NULL, 
                  method="magma", type = "joint", test = "one-sided",
                  pi=0.001, nit=5000, nburn=1000) {
  
  # Check if stat and sets are provided
  if (is.null(stat) || is.null(sets)) {
    stop("Both 'stat' and 'sets' must be provided.")
  }
  
  if(is.vector(stat)) stat <- as.matrix(stat)
  if(is.null(rownames(stat))) stop("Please provide names or rownames to stat object")
  y <- scale(stat, center=TRUE, scale=TRUE)

  sets <- mapSets(sets=sets,rsids=rownames(stat), index=FALSE)
  
    
  # Compute X for feature sets (sparse format)
  X <- designMatrix(sets = sets, rowids = rownames(y))

  m <- sapply(sets, length)
  
  if(method=="magma") {
    
    # Compute summary stat for feature sets
    stat_summary <- computeStat(X = X, y = y[rownames(X), ], scale = TRUE)
    
    bMAGMA <- solve(stat_summary$XX + diag(0.001, nrow(stat_summary$XX))) %*% stat_summary$Xy
    bMARG <- (1 / diag(stat_summary$XX)) * stat_summary$Xy
    sebMAGMA <- sqrt(diag(solve(stat_summary$XX + diag(0.001, nrow(stat_summary$XX)))))
    sebMARG <- sqrt((1 / diag(stat_summary$XX)))
    zMAGMA <- bMAGMA / sebMAGMA
    zMARG <- bMARG / sebMARG
    
    # Two-sided
    if(test=="two-sided") {
      pMAGMA <- pnorm(abs(zMAGMA), mean = 0, sd = 1, lower.tail = FALSE)
      pMARG <- pnorm(abs(zMARG), mean = 0, sd = 1, lower.tail = FALSE)
    }
    
    # One-sided
    if(test=="one-sided") {
      pMAGMA <- pnorm(zMAGMA, mean = 0, sd = 1, lower.tail = FALSE)
      pMARG <- pnorm(zMARG, mean = 0, sd = 1, lower.tail = FALSE)
    }
    if (type == "marginal") df <- data.frame(ID=names(sets), m = m, 
                                             b = bMARG, seb = sebMARG, z = zMARG, p = pMARG)
    if (type == "joint") df <- data.frame(ID=names(sets), m = m, 
                                          b = bMAGMA, seb = sebMAGMA, z = zMAGMA, p = pMAGMA)
    o <- order(df$p, decreasing=FALSE)
    df[,3:5] <- round(df[,3:5],4)
    rownames(df) <- NULL
    return(df[o,])
  }
  if(method%in%c("blr","bayesC","bayesR")) {
    
    if(method=="blr") method <- "bayesC" 
    
    # Compute summary stat for feature sets
    stat <- computeStat(X = X, y = y[rownames(X), ], scale = TRUE)

    if(ncol(y)==1) {
      # Fit BLR model
      fit <- blr(yy=stat$yy, XX=stat$XX, Xy=stat$Xy, n=stat$n,
                       method=method, pi=pi,
                       nit=nit, nburn=nburn)
      
      
      # Dataframe with BLR results
      df <- data.frame(ID=names(sets), 
                       m = m, 
                       b = fit$bm, 
                       PIP=fit$dm)
      o <- order(df$PIP, decreasing=TRUE)
      df[,3:4] <- round(df[,3:4],4)
      rownames(df) <- NULL
      return(df[o,])
    }  
    if(ncol(y)>1) {
      stat$XX <- rep(list(stat$XX),ncol(y))
      # Fit BLR model
      fit <- mtblr(yy=stat$yy, XX=stat$XX, Xy=stat$Xy, n=stat$n,
                       method=method, pi=pi,
                       nit=nit, nburn=nburn)
      
      o <- order(rowSums(fit$dm), decreasing=TRUE)
      
      # Dataframe with BLR results
      df <- list( feature=data.frame(ID=names(sets),
                        m = m)[o,],
                        b=round(fit$bm[o,], 4),
                        PIP=round(fit$dm[o,], 4))
      return(df)
    }  
    
  }
}

#' Perform VEGAS Gene-Based Association Analysis
#'
#' This function performs VEGAS (Versatile Gene-based Association Study) to analyze gene-level associations
#' using marker statistics and linkage disequilibrium (LD) structure from a reference panel.
#'
#' @param Glist A list containing genomic information, such as LD matrices or genotype data. Required.
#' @param sets A list of sets (e.g., genes with their associated markers) to analyze. Required.
#' @param stat A data frame containing marker-level statistics. Must include `rsids` (marker IDs) and `p` (p-values).
#' @param p A numeric matrix of p-values for markers across multiple studies. If provided, `stat` should be NULL.
#' @param threshold A numeric value specifying the lower bound for p-values to avoid numerical issues. Default is `1e-10`.
#' @param tol A numeric value specifying the tolerance for eigenvalues in LD matrices. Default is `1e-7`.
#' @param minsize An integer specifying the minimum number of markers required for a set to be analyzed. Default is `2`.
#' @param verbose A logical value indicating whether to print progress messages. Default is `FALSE`.
#' 
#' @return 
#' A data frame with the results
#'
#' @details
#' The function uses marker-level statistics to compute gene-level association statistics,
#' accounting for LD structure among markers. The LD structure is retrieved from `Glist`, which
#' should include precomputed LD matrices or genotype data for the markers.
#'
#' Two modes are supported:
#' - **`stat` Mode**: Uses marker statistics (e.g., p-values) from a single study to compute gene-level statistics.
#' - **`p` Mode**: Uses marker p-values across multiple studies for meta-analysis of gene-level statistics.
#'
#' @export
#' 
vegas <- function(Glist=NULL, sets=NULL, stat=NULL, p=NULL, threshold=1e-10, tol=1e-7, minsize=2, verbose=FALSE) {
  
  if(is.null(Glist)) stop("Please provide Glist object")
  #if(is.null(stat)) stop("Please provide stat object")
  if(is.null(sets)) stop("Please provide sets object")
  
  if(!is.null(stat)) {
    if(verbose) message("Map stat to markers in Glist")
    sets <- mapSets(sets=sets, rsids=stat$rsids, index=FALSE)
    sets <- mapSets(sets=sets, rsids=unlist(Glist$rsids), index=FALSE)
    
    isets <- mapSets(sets=sets, rsids=stat$rsids, index=TRUE)
    
    # Compute marker statistics
    p <- as.numeric(stat$p)
    p[p<threshold] <- threshold
    chisq <- qchisq(p, df = 1, lower.tail = FALSE)
    
    chistat <- sapply(isets,function(x){sum(chisq[x])})
    chr <- sapply(isets,function(x){stat$chr[x][1]})
    # This is just a preliminary fix
    if(length(Glist$bedfiles)==1) chr <- rep(1,length(chr))
    m <- sapply(sets,function(x){length(x)})
    
    pg <- rep(1,length(sets))
    names(pg) <- names(chr) <- names(m) <- names(sets)
    for(i in 1:length(sets)) {
      if(length(sets[[i]])>1) {
        B <- getG(Glist=Glist, chr[i], rsids=sets[[i]], scale=TRUE)
        ev <- eigen(cor(B))$values
        ev[ev < tol] <- tol
        try(pg[i] <- pchisqsum(chistat[i], df = rep(1, length(ev)), a = ev, lower.tail = FALSE))
        if(verbose) message(paste("Finished processing gene" ,i))
      } 
    }
    zstat <- -qnorm(pg/2,TRUE)
    df <- data.frame(Gene=names(pg),Chr=chr,m=m,x=chistat,z=zstat,p=pg)
    colnames(df) <- c("EnsemblID", "Chr", "m", "X2","z","p")
    return(df)
  }
  if(!is.null(p)) {
    if(is.vector(p)) p <- as.matrix(p)
    rsids <- rownames(p)
    nstudy <- ncol(p)
    if(is.null(rsids)) stop("Please provide names/rownames in for your p object")
    #p <- apply(p,2,as.numeric)
    p[p<threshold] <- threshold
    #rownames(p) <- rsids
    
    sets <- mapSets(sets=sets, rsids=rsids, index=FALSE)
    if(!is.null(Glist$rsidsLD)) sets <- mapSets(sets=sets, rsids=unlist(Glist$rsidsLD), index=FALSE)
    if(is.null(Glist$rsidsLD)) sets <- mapSets(sets=sets, rsids=unlist(Glist$rsids), index=FALSE)
    
    # Compute some relevant statistics
    msets <- sapply(sets,function(x){length(x)})
    sets <- sets[msets>minsize]
    msets <- sapply(sets,function(x){length(x)})
    chrSets <- mapSets(sets=sets, Glist=Glist, index=TRUE)
    chr <- unlist(Glist$chr)
    chr <- sapply(chrSets,function(x){as.numeric(unique(chr[x]))[1]})
    # This is just a preliminary fix
    if(length(Glist$bedfiles)==1) chr <- rep(1,length(chr))
      
    # set indices
    isets <- mapSets(sets=sets, rsids=rsids, index=TRUE)
    
    # Compute marker statistics
    chisq <- qchisq(p, df = 1, lower.tail = FALSE)
    
    # Compute gene statistics
    chistat <- sapply(isets,function(x){colSums(chisq[x,])})
    chistat <- t(chistat)
    
    pg <- matrix(1,ncol=nstudy, nrow=length(sets))
    rownames(pg) <- names(sets)
    colnames(pg) <- colnames(p)
    for(i in 1:length(sets)) {
      B <- getG(Glist=Glist, chr[i], rsids=sets[[i]], scale=TRUE)
      ev <- eigen(cor(B))$values
      ev[ev < tol] <- tol
      for (j in 1:nstudy) {
        try(pg[i,j] <- pchisqsum(chistat[i,j], df = rep(1, length(ev)), a = ev, lower.tail = FALSE))
      }
      if(verbose) message(paste("Finished processing gene" ,i))
    }
    zstat <- -qnorm(pg/2,TRUE)
    
    if(ncol(pg)==1) {
      res <- data.frame("EnsemblID"=rownames(pg),chr=chr,m=msets,X=chistat,Z=zstat,p=pg)
      return(res)
    }
    if(ncol(pg)>1) {
      res <- list( genes=data.frame(EnsemblID=rownames(pg),Chr=chr,m=msets),
                   X2=chistat,z=zstat,p=pg)
      return(res)
    }
  }
}

pchisqsum <- function (x, df, a, lower.tail = TRUE) {
  sat <- satterthwaite(a, df)
  guess <- pchisq(x / sat$scale, sat$df, lower.tail = lower.tail)
  for (i in seq(length = length(x))) {
    lambda <- rep(a, df)
    sad <- sapply(x, saddle, lambda = lambda)
    if (lower.tail) sad <- 1 - sad
    guess <- ifelse(is.na(sad), guess, sad)
  }
  return(guess)
}

satterthwaite <- function(a, df) {
  if (any(df > 1)) {
    a <- rep(a, df)
  }
  tr <- mean(a)
  tr2 <- mean(a^2) / (tr^2)
  list(scale = tr * tr2, df = length(a) / tr2)
}

saddle <- function(x, lambda) {
  d <- max(lambda)
  lambda <- lambda / d
  x <- x / d
  k0 <- function(zeta) {
    -sum(log(1 - 2 * zeta * lambda)) / 2
  }
  kprime0 <- function(zeta) {
    sapply(zeta, function(zz) sum(lambda / (1 - 2 * zz * lambda)))
  }
  kpprime0 <- function(zeta) {
    2 * sum(lambda^2 / (1 - 2 * zeta * lambda)^2)
  }
  if (any(lambda < 0)) {
    lmin <- max(1 / (2 * lambda[lambda < 0])) * 0.99999
  } else if (x > sum(lambda)) {
    lmin <- -0.01
  } else {
    lmin <- -length(lambda) / (2 * x)
  }
  lmax <- min(1 / (2 * lambda[lambda > 0])) * 0.99999
  hatzeta <- uniroot(function(zeta) kprime0(zeta) - x, lower = lmin,
                     upper = lmax, tol = 1e-08)$root
  w <- sign(hatzeta) * sqrt(2 * (hatzeta * x - k0(hatzeta)))
  v <- hatzeta * sqrt(kpprime0(hatzeta))
  if (abs(hatzeta) < 1e-04) {
    NA
  }  else {
    pnorm(w + log(v / w) / w, lower.tail = FALSE)
  }
}

#' Bayesian Polygenic Prioritisation Scoring (Bayesian POPS)
#'
#' This function performs Polygenic Prioritisation Scoring (POPS) using Bayesian regression (`bayesC` or `bayesR`) or ridge regression (`rr`). 
#' It maps features to sets, performs optional feature selection based on p-value thresholds, and calculates predictive scores for prioritisation.
#'
#' @param stat A numeric vector or matrix of summary statistics (e.g., phenotypic values or effect sizes), where rows represent features (e.g., SNPs) and columns represent traits. Required.
#' @param sets A list of feature sets (e.g., genes or SNP groups) to map to the rows of `stat`. Required.
#' @param validate An optional validation set. If provided, cross-validation results are returned instead of fitting the model.
#' @param threshold A numeric value specifying a p-value threshold for feature selection. If provided, only features with p-values below this threshold are included in the model.
#' @param method A string specifying the regression method. Options are `"bayesC"` (default), `"bayesR"`, or `"rr"` (ridge regression).
#' @param pi A numeric value specifying the proportion of non-zero effects for Bayesian methods. Default is `0.001`.
#' @param nit An integer specifying the number of iterations for Bayesian methods. Default is `5000`.
#' @param nburn An integer specifying the number of burn-in iterations for Bayesian methods. Default is `1000`.
#' @param updateB A logical value indicating whether to update marker effects in Bayesian methods. Default is `TRUE`.
#' @param updateE A logical value indicating whether to update residual variances in Bayesian methods. Default is `TRUE`.
#' @param updatePi A logical value indicating whether to update the proportion of non-zero effects in Bayesian methods. Default is `TRUE`.
#' @param updateG A logical value indicating whether to update the genomic variances in Bayesian methods. Default is `TRUE`.
#'
#' @return 
#' A matrix of predicted prioritisation scores (`ypred`) for each feature, ordered by their predictive values. 
#' If a validation set is provided, cross-validation results are returned instead.
#'
#' @export
pops <- function(stat = NULL, sets = NULL, validate=NULL, threshold=NULL,
                 method="bayesC", pi=0.001, nit=5000, nburn=1000,
                 updateB=TRUE, updateE=TRUE, updatePi=TRUE, updateG=TRUE) {
  
  if(!is.null(validate)) {
    fit <- cvpops
    return(fit)
  }
  # Check if stat and sets are provided
  if (is.null(stat) || is.null(sets)) {
    stop("Both 'stat' and 'sets' must be provided.")
  }
  
  if(is.vector(stat)) stat <- as.matrix(stat)
  if(is.null(rownames(stat))) stop("Please provide names or rownames to stat object")
  
  # Map sets to rownames in stat
  sets <- mapSets(sets=sets,rsids=rownames(stat), index=FALSE)
  
  # Center and scale y
  y <- scale(stat, center=TRUE, scale=TRUE)
  orig_stat <- stat
  
  # Compute X for feature sets (sparse format)
  X <- designMatrix(sets = sets, rowids = rownames(y))
  X <- X[rownames(X)%in%rownames(y),]
  y <- as.matrix(y[rownames(y)%in%rownames(X),])
  X <- X[rownames(y),]
  
  if(!is.null(threshold)) {
    fit <- magma(stat=orig_stat, sets=sets, type = "marginal", test = "one-sided",
                       method="magma")
    selected <- fit$p<threshold
    if(sum(selected)<2) stop("Number of selected features less than 2")
    cls <- fit$ID[selected]
    X <- X[,cls]
  }
  
  # Compute summary stat for feature sets
  stat <- computeStat(X = X, y = y, scale = TRUE)
  isNA <- is.na(stat$Xy)
  stat$Xy <- stat$Xy[!isNA]
  stat$XX <- stat$XX[!isNA,!isNA]
  
  if (method=="rr") {
    b <- solve(stat$XX + diag(0.001, nrow(stat$XX))) %*% stat$Xy
  }
  
  if (method%in%c("bayesC","bayesR")) {
    # Fit BLR model
    fit <- blr(yy=stat$yy, XX=stat$XX, Xy=stat$Xy, n=stat$n,
                     method=method, pi=pi,
                     nit=nit, nburn=nburn,
                     updateB=updateB, updateE=updateE, updatePi=updatePi)
    b <- fit$bm
  }
  
  ypred <- as.matrix(X%*%b)
  o <- order(rowSums(ypred),decreasing=TRUE)
  ypred <- as.matrix(ypred[o,])
  colnames(ypred) <- colnames(orig_stat)
  return(ypred)
}


cvpops <- function(stat = NULL, sets = NULL, validate=NULL, threshold=NULL,
                 method="bayesC", pi=0.001, nit=5000, nburn=1000) {
  
  # Check if stat and sets are provided
  if (is.null(stat) || is.null(sets)) {
    stop("Both 'stat' and 'sets' must be provided.")
  }
  
  if(is.vector(stat)) stat <- as.matrix(stat)
  if(is.null(rownames(stat))) stop("Please provide names or rownames to stat object")
  
  # Map sets to rownames in stat
  sets <- mapSets(sets=sets,rsids=rownames(stat), index=FALSE)
  
  # Center and scale y
  y <- scale(stat, center=TRUE, scale=TRUE)
  orig_stat <- stat
  
  # Compute X for feature sets (sparse format)
  X <- designMatrix(sets = sets, rowids = rownames(y))
  X <- X[rownames(X)%in%rownames(y),]
  y <- as.matrix(y[rownames(y)%in%rownames(X),])
  X <- X[rownames(y),]
  
  
  if (is.matrix(validate)) {
    cvnames <- colnames(validate)
    validate <- as.data.frame(validate, stringsAsFactors=FALSE)
    names(validate) <- cvnames
  }
  nv <- length(validate)
  cvnames <- names(validate)
  if(is.null(cvnames)) {
    cvnames <- paste0("CV",1:nv)
    names(validate) <- cvnames
  }
  
  validate <- lapply(validate, function(x){x[x%in%rownames(y)]})
  
  nv <- length(validate)
  pa <- NULL
  
  for(v in 1:nv) {
    
    ensg <- validate[[v]]
    train <- !rownames(y)%in%ensg
    
    Xt <- X
    
    
    if(!is.null(threshold)) {
      fit <- magma(stat=orig_stat[train,], sets=sets, type = "marginal", test = "one-sided",
                         method="magma")
      selected <- fit$p<threshold
      if(sum(selected)<2) stop("Number of selected fetaures less than 2")
      cls <- fit$ID[selected]
      Xt <- X[,cls]
    }
    
    # Compute summary stat for feature sets
    #stat <- computeStat(X = X[train,], y = y[train, ], scale = TRUE)
    stat <- computeStat(X = Xt[train,], y = y[train, ], scale = TRUE)
    isNA <- is.na(stat$Xy)
    stat$Xy <- stat$Xy[!isNA]
    stat$XX <- stat$XX[!isNA,!isNA]
    
    if (method=="rr") {
      b <- solve(stat$XX + diag(0.001, nrow(stat$XX))) %*% stat$Xy
    }
    
    if (method%in%c("bayesC","bayesR")) {
      # Fit BLR model
      fit <- blr(yy=stat$yy, XX=stat$XX, Xy=stat$Xy, n=stat$n,
                       method=method, pi=pi,
                       nit=nit, nburn=nburn)
      b <- fit$bm
    }
    
    ypred <- Xt[,!isNA]%*%b
    yobs <- y[rownames(y)%in%ensg,]
    pa <- rbind(pa,acc(yobs=yobs, ypred=ypred[names(yobs),]))
  }
  return(pa)
}
Any scripts or data that you put into this service are public.
qgg documentation built on April 12, 2025, 1:32 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
qgg
Statistical Tools for Quantitative Genetic Analyses

R/multiple_marker_test.R
In qgg: Statistical Tools for Quantitative Genetic Analyses

Defines functions cvpops pops saddle satterthwaite vegas magma hgtest scoretest cvat sumtest settest gstat mapSets gsets gsea

Documented in gsea magma mapSets pops vegas

Try the qgg package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

qgg Statistical Tools for Quantitative Genetic Analyses

R/multiple_marker_test.R In qgg: Statistical Tools for Quantitative Genetic Analyses

Defines functions cvpops pops saddle satterthwaite vegas magma hgtest scoretest cvat sumtest settest gstat mapSets gsets gsea

Documented in gsea magma mapSets pops vegas

Try the qgg package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

qgg
Statistical Tools for Quantitative Genetic Analyses

R/multiple_marker_test.R
In qgg: Statistical Tools for Quantitative Genetic Analyses