Home

/

GitHub

/

kevinblighe/RegParallel

/

RegParallel: Standard regression functions in R enabled for parallel...

RegParallel: Standard regression functions in R enabled for parallel...
In kevinblighe/RegParallel: Standard regression functions in R enabled for parallel processing over large data-frames

View source: R/RegParallel.R

RegParallel

R Documentation

Standard regression functions in R enabled for parallel processing over large data-frames

Description

In many analyses, a large amount of variables have to be tested independently against the trait/endpoint of interest, and also adjusted for covariates and confounding factors at the same time. The major bottleneck in these is the amount of time that it takes to complete these analyses. With RegParallel, a large number of tests can be performed simultaneously. On a 12-core system, 144 variables can be tested simultaneously, with 1000s of variables processed in a matter of seconds via 'nested' parallel processing. Works for logistic regression, linear regression, conditional logistic regression, Cox proportional hazards and survival models, and Bayesian logistic regression. Also caters for generalised linear models that utilise survey weights created by the 'survey' CRAN package and that utilise 'survey::svyglm'.

Usage

RegParallel(
  data,
  design = NULL,
  formula,
  FUN,
  FUNtype,
  variables,
  blocksize = 500,
  cores = 3,
  nestedParallel = FALSE,
  p.adjust = 'none',
  conflevel = 95,
  excludeTerms = NULL,
  excludeIntercept = TRUE)

Arguments

`data`	A data-frame that contains all model terms to be tested. Variables that have all zeros will, automatically, be removed. REQUIRED.
`design`	A survey design, created by survey::svydesign. DEFAULT = NULL. OPTIONAL.
`formula`	A valid formula. Excluding the '[*]' term, which is reserved for RegParallel and indicates the position in the formula for the variable of interest, must pass as.formula() check. REQUIRED.
`FUN`	Regression function. Must be of form, for example: function(formula, data) glm(formula = formula, family=binomial, data = data). REQUIRED.
`FUNtype`	Regression function type. Must be one of 'glm', 'lm', 'coxph', 'clogit', 'bayesglm', or 'svyglm'. REQUIRED.
`variables`	Vector of variable names in data to be tested independently. Each variable will take the place of '[*]' in the supplied formula. REQUIRED.
`blocksize`	Number of variables to test in each foreach loop. DEFAULT = 500. OPTIONAL.
`cores`	CPU cores / threads. DEFAULT = 3. OPTIONAL.
`nestedParallel`	In RegParallel, parallelisation initially occurs at the block level, ie., multiple blocks of models are processed in parallel. If nestedParallel is enabled, a second level of parallelisation occurs within each block in addition. Warning! - this doubles the usage of cores. DEFAULT = FALSE. OPTIONAL.
`p.adjust`	Method for adjusting p-values for false discovery rate. Must be one of 'holm', 'hochberg', 'hommel', 'bonferroni', 'BH', 'BY', 'fdr', 'none'. See ?p.adjust for further details. DEFAULT = 'none'. OPTIONAL
`conflevel`	Confidence level for calculating odds or hazard ratios. DEFAULT = 95. OPTIONAL.
`excludeTerms`	Remove these terms from the final output. These will simply be grepped out. DEFAULT = NULL. OPTIONAL.
`excludeIntercept`	Remove intercept terms from the final output. DEFAULT = TRUE. OPTIONAL.

Details

In many analyses, a large amount of variables have to be tested independently against the trait/endpoint of interest, and also adjusted for covariates and confounding factors at the same time. The major bottleneck in these is the amount of time that it takes to complete these analyses. With RegParallel, a large number of tests can be performed simultaneously. On a 12-core system, 144 variables can be tested simultaneously, with 1000s of variables processed in a matter of seconds via 'nested' parallel processing. Works for logistic regression, linear regression, conditional logistic regression, Cox proportional hazards and survival models, and Bayesian logistic regression.

Value

A data.table object.

Author(s)

Kevin Blighe <kevin@clinicalbioinformatics.co.uk>

Examples


  options(scipen=10)
  options(digits=6)

  col <- 20000
  row <- 20
  mat <- matrix(
    rexp(col*row, rate = .1),
    ncol = col)
  colnames(mat) <- paste0('gene', 1:ncol(mat))
  rownames(mat) <- paste0('sample', 1:nrow(mat))

  modelling <- data.frame(
    cell = rep(c('B', 'T'), nrow(mat) / 2),
    group = c(rep(c('treatment'), nrow(mat) / 2), rep(c('control'), nrow(mat) / 2)),
    dosage = t(data.frame(matrix(rexp(row, rate = 1), ncol = row))),
    mat,
    row.names = rownames(mat))

  data <- modelling[,1:2000]
  variables <- colnames(data)[4:ncol(data)]
  res1 <- RegParallel(
    data = data,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 700,
    cores = 2,
    nestedParallel = TRUE,
    #p.adjust = "bonferroni",
    conflevel = 99,
    excludeTerms = NULL,
    excludeIntercept = TRUE
  )

  # spot checks
  m <- glm(factor(group) ~ gene265 + (cell:dosage) ^ 2, data=data, family=binomial)
  summary(m)$coefficients
  exp(cbind("Odds ratio" = coef(m), confint.default(m, level = 0.99)))
  res1[which(res1$Variable == 'gene265'),]

  m <- glm(factor(group) ~ gene1688 + (cell:dosage) ^ 2, data=data, family=binomial)
  summary(m)$coefficients
  exp(cbind("Odds ratio" = coef(m), confint.default(m, level = 0.99)))
  res1[which(res1$Variable == 'gene1688'),]


  ###


  data <- modelling[,1:500]
  variables <- colnames(data)[4:ncol(data)]
  res2 <- RegParallel(
    data = data,
    formula = '[*] ~ cell:dosage',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = gaussian,
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 496,
    cores = 2,
    nestedParallel = TRUE,
    p.adjust = "none",
    conflevel = 90,
    excludeTerms = NULL,
    excludeIntercept = FALSE
  )

  # spot checks
  m <- glm(gene29 ~ cell:dosage, data=data, family=gaussian)
  summary(m)$coefficients
  exp(cbind("Odds ratio" = coef(m), confint.default(m, level = 0.90)))
  res2[which(res2$Variable == 'gene29'),]


  ###


  data <- modelling[,1:500]
  variables <- colnames(data)[4:ncol(data)]
  res3 <- RegParallel(
    data = data,
    formula = 'as.numeric([*]) ~ dosage ^ 3',
    FUN = function(formula, data)
      lm(formula = formula,
        data = data),
    FUNtype = 'lm',
    variables = variables,
    blocksize = 200,
    cores = 2,
    nestedParallel = FALSE,
    p.adjust = "holm",
    conflevel = 99.999,
    excludeTerms = NULL,
    excludeIntercept = FALSE
  )

  # spot checks
  m <- lm(as.numeric(gene454) ~ dosage ^ 3, data=data)
  summary(m)$coefficients
  exp(cbind("Odds ratio" = coef(m), confint.default(m, level = 0.99999)))
  res3[which(res3$Variable == 'gene454'),]


  ###


  require(survival)
  data <- modelling[,1:800]
  variables <- colnames(data)[4:ncol(data)]
  data$time <- c(100,200,400,300,200,250,600,1000,886,450,
    c(100,200,400,300,200,250,600,1000,886,450)*1.5)
  data$alive <- c(0,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1,1,1,1)
  res4 <- RegParallel(
    data = data,
    formula = 'Surv(time, as.integer(alive)) ~ group * [*] + cell',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = variables,
    blocksize = 399,
    cores = 2,
    nestedParallel = FALSE,
    p.adjust = "hommel",
    conflevel = 97.5,
    excludeTerms = c('group', 'cell'),
    excludeIntercept = FALSE
  )

  # spot checks
  m <- coxph(formula = Surv(time, as.integer(factor(alive))) ~ group * gene12 + cell, data = data, ties = 'breslow', singular.ok = TRUE)
  summary(m)
  exp(cbind("Hazards ratio" = coef(m), confint.default(m, level = 0.975)))
  res4[which(res4$Variable == 'gene12'),]

  m <- coxph(formula = Surv(time, as.integer(factor(alive))) ~ group * gene267 + cell, data = data, ties = 'breslow', singular.ok = TRUE)
  summary(m)
  exp(cbind("Hazards ratio" = coef(m), confint.default(m, level = 0.975)))
  res4[which(res4$Variable == 'gene267'),]


  ###


  data <- modelling[,1:500]
  variables <- colnames(data)[4:ncol(data)]
  res5 <- RegParallel(
    data = data,
    formula = 'as.integer(factor(group)) ~ [*] * strata(cell) + dosage',
    FUN = function(formula, data)
      clogit(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'clogit',
    variables = variables,
    blocksize = 200,
    cores = 2,
    nestedParallel = FALSE,
    p.adjust = "fdr",
    conflevel = 50,
    excludeTerms = 'non-existent term',
    excludeIntercept = FALSE
  )

  # spot checks
  m <- clogit(formula = as.integer(factor(group)) ~ gene145 * strata(cell) + dosage, data = data, ties = 'breslow', singular.ok = TRUE)
  summary(m)
  exp(cbind("Hazards ratio" = coef(m), confint.default(m, level = 0.5)))
  res5[which(res5$Variable == 'gene145'),]

  m <- clogit(formula = as.integer(factor(group)) ~ gene34 * strata(cell) + dosage, data = data, ties = 'breslow', singular.ok = TRUE)
  summary(m)
  exp(cbind("Hazards ratio" = coef(m), confint.default(m, level = 0.5)))
  res5[which(res5$Variable == 'gene34'),]


  ###


  data <- modelling[,1:5000]
  variables <- colnames(data)[4:ncol(data)]
  res6 <- RegParallel(
    data = data,
    formula = 'as.numeric(factor(cell)) ~ [*]:dosage',
    FUN = function(formula, data)
      bayesglm(formula = formula,
        data = data,
        prior.mean = 2),
    FUNtype = 'bayesglm',
    variables = variables,
    blocksize = 500,
    cores = 2,
    nestedParallel = FALSE,
    p.adjust = "fdr",
    conflevel = 99,
    excludeTerms = NULL,
    excludeIntercept = FALSE
  )

  # spot checks
  m <- bayesglm(formula = as.numeric(factor(cell)) ~ gene1645:dosage, data = data, prior.mean = 2)
  summary(m)
  exp(cbind("Odds ratio" = coef(m), confint.default(m, level = 0.99)))
  res6[which(res6$Variable == 'gene1645'),]

  m <- bayesglm(formula = as.numeric(factor(cell)) ~ gene3664:dosage, data = data, prior.mean = 2)
  summary(m)
  exp(cbind("Odds ratio" = coef(m), confint.default(m, level = 0.99)))
  res6[which(res6$Variable == 'gene3664'),]

kevinblighe/RegParallel documentation built on Oct. 2, 2023, 2:55 p.m.

kevinblighe/RegParallel index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

kevinblighe/RegParallel
Standard regression functions in R enabled for parallel processing over large data-frames

RegParallel: Standard regression functions in R enabled for parallel...
In kevinblighe/RegParallel: Standard regression functions in R enabled for parallel processing over large data-frames

Standard regression functions in R enabled for parallel processing over large data-frames

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Related to RegParallel in kevinblighe/RegParallel...

R Package Documentation

Browse R Packages

We want your feedback!

kevinblighe/RegParallel Standard regression functions in R enabled for parallel processing over large data-frames

RegParallel: Standard regression functions in R enabled for parallel... In kevinblighe/RegParallel: Standard regression functions in R enabled for parallel processing over large data-frames

Standard regression functions in R enabled for parallel processing over large data-frames

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Related to RegParallel in kevinblighe/RegParallel...

R Package Documentation

Browse R Packages

We want your feedback!

kevinblighe/RegParallel
Standard regression functions in R enabled for parallel processing over large data-frames

RegParallel: Standard regression functions in R enabled for parallel...
In kevinblighe/RegParallel: Standard regression functions in R enabled for parallel processing over large data-frames