phewas: Function to perform a PheWAS analysis
In PheWAS/PheWAS: Phenome Wide Association Studies (PheWAS)

phewas

R Documentation

Function to perform a PheWAS analysis

Description

This function will perform a PheWAS analysis, optionally adjusting for other variables. It is parallelized using the base package parallel.

Usage

phewas(phenotypes, genotypes, data, covariates = c(NA), adjustments = list(NA), 
  outcomes, predictors, cores = 1, additive.genotypes = T, 
  significance.threshold, alpha = 0.05, unadjusted = F, return.models = F,
  min.records = 20, MASS.confint.level=NA, quick.confint.level,
  clean.phecode.predictors=F
  )

Arguments

`phenotypes`	The names of the outcome variables in data under study. These can be logical (for logistic regression) or continuous (for linear regression) columns. Can alternatively be a data frame of phenotypes, see Details for more information.
`genotypes`	The names of the prediction variables in data under study. These can be logical or continuous. Can alternatively be a data frame of genotypes, see Details for more information.
`data`	Data frame containing all variables for the anaylsis. Omitted if using data frames for other parameters, see Details for more information
`covariates`	The names of the covariates to appear in every analysis. Can alternatively be a data frame of covariates, see Details for more information.
`adjustments`	A list containing special one-off adjustments for the analysis. `list(NA)` will yield no special adjustments. Use the `covariates` parameter to adjust the analyses under normal circumstances. See Details for more information.
`outcomes`	An alternate to `phenotypes`. It will be ignored if `phenotypes` exists.
`predictors`	An alternate to `genotypes`. It will be ignored if `genotypes` exists.
`cores`	The number of cores to use in the parallel socket cluster implementation. If `cores=1`, `lapply` will be used instead.
`additive.genotypes`	Are additive genotypes being supplied? If so, it will attempt to calculate allele frequencies and HWE values. Default is TRUE.
`significance.threshold`	A vector of desired significance thresholds to calculate. Can include "p-value","bonferroni","fdr","simplem-genotype","simplem-phenotype","simplem-product". Note that simpleM based methods can be time intensive. See details for more information.
`alpha`	The base alpha for significance calculations.
`unadjusted`	Use Chi-Square and t-tests instead? This is a much simpler implementation. Defaults FALSE.
`return.models`	Return a list the complete models, with the names equal to the string formula used to create them, as well as the results. Default is FALSE.
`min.records`	The minimum number of records to perform a test. For logistic regression, there must be at least this number of each cases and controls, for linear regression this total number of records. Default is 20.
`MASS.confint.level`	Uses the `MASS` package and the `confint` function to calculate a confidence interval at the specified level. `confint` uses a profile likelihood method, which takes some time to compute. Output is stored in the `lower` and `upper` columns. Logistic models will report OR CIs and linear models will report beta CIs. Default is NA, which does not calculate confidence intervals.
`quick.confint.level`	Calculate a confidence interval based on `beta + or - qnorm * SE`. Output is stored in the `lower.q` and `upper.q` columns. Logistic models will return have the exponentiated OR confidence intervals.
`clean.phecode.predictors`	If phecodes are used as predictors, this option will enable a result post-processing step that will alter the snp/predictor column to contain only the phecode. It defaults to `FALSE` as this provides the best clarity. Clean phecodes likely indicate either continuous input or a boolean TRUE. The actual code is: sub("`([0-9.]+)`(TRUE)?","\\1",predictor)

Details

The complete data frame can be passed in using the data parameter with name vectors in phenotypes, genotypes, covariates, and adjustments parameters. Alternatively, phenotypes, genotypes, covariates, and adjustments can each be data frames. They will be merged using the shared columns between phenotypes and genotypes, ideally being an ID column.

covariates are those variables that are included in every model, e.g., age and gender. adjustments are additional variables that can be used to compare models by adjusting for potentially confounding factors. Including NA in the list will perform a set of analyses with no special adjustments, single names can be used for single adjustment, and a vector of names can be used for multiple special adjustments at once. An adjustments parameter of list(NA, "BMI", c("BMI", "smoking")) would adjust for no extra variable in all models, followed by adjusting additionally for BMI, and then perform another round of analyses using the BMI and smoking status variables.

These results can be directly plotted using the phewasManhattan function, assuming that models are not returned. If they are, the results item of the returned list needs to be used.

Value

The following are the default rows included in the returned data frame. The attributes of the returned data frame contain additional information about the anaylsis. If a model did not have sufficient cases or controls for analysis or failed to converge, NAs will be reported and a note will be added in the note field.

`phenotype`	The outcome under study
`snp`	The predictor under study
`adjustment`	The one off adjustment used
`beta`	The beta coefficient for the predictor
`SE`	The standard error for the beta coefficient
`lower.p`	The lower bound of the quick confidence interval, if requested
`upper.p`	The upper bound of the quick confidence interval, if requested
`lower`	The lower bound of the `confint` confidence interval, if requested
`upper`	The upper bound of the `confint` confidence interval, if requested
`OR`	For logistic regression, the odds ratio for the predictor
`p`	The p-value for the predictor
`type`	The type of regression model used
`n_total`	The total number of records in the analysis
`n_cases`	The number of cases in the analysis (logical outcome only)
`n_controls`	The number of controls in the analysis (logical outcome only)
`HWE_p`	The Hardy-Weinberg equilibrium p-value for the predictor, assuming 0,1,2 allele coding
`allele_freq`	The allele frequency in the predictor for the coded allele
`n_no_snp`	The number of records with a missing predictor
`note`	Additional warning or error information

If there are any requested significance thresholds, boolean variables will be included reporting significance. If return.models=T, a list is returned. The named item results contains the above data frame. The named item models contains a list of the models generated in the analysis. To distinguish models, the list is named by the full formula used in generation.

Author(s)

Robert Carroll

Examples


#Generate some example data
ex=generateExample(hit="335")
#Extract the two parts from the returned list
id.icd9.count=ex$id.icd9.count
genotypes=ex$genotypes
#Create the PheWAS code table- translates the icd9s, adds exclusions, and reshapes to a wide format
phenotypes=createPhewasTable(id.icd9.count)
#Run the PheWAS
results=phewas(phenotypes,genotypes,cores=4)

PheWAS/PheWAS documentation built on July 3, 2023, 3:40 p.m.