knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Download data and prepare for the analysis

Data is downloaded from GEO using getGEO function in GEOquery library. Expression matrix with probeset IDs, age of the samples and covarietes to be included in the analysis are extracted from geo object. Please note that this tutorial is just to demonstrate the functionality of this package and is not a proper gene expression analysis tutorial. Thus we skip many potential QC steps and probeset -> gene ID mapping.

library(GEOquery)
geo <- getGEO('GSE30272',destdir = '~/temp/')[[1]]
pd <- Biobase::pData(geo)
expmat <- Biobase::exprs(geo)
ages <- setNames(as.numeric(pd$`age:ch1`), pd$geo_accession)
covs <- list(array = as.factor(setNames(pd$`array batch:ch1`, pd$geo_accession)),
bbs = as.factor(setNames(pd$`brain bank source:ch1`, pd$geo_accession)),
sex = as.factor(setNames(pd$`Sex:ch1`, pd$geo_accession)),
race = as.factor(setNames(pd$`race:ch1`, pd$geo_accession)))
ages <- ages[ages >= 20]
expmat <- expmat[, names(ages)]
covs <- lapply(covs, function(x)x[names(ages)])

Calculating the age-related changes in expression level and heterogeneity

We can simply use calc.het function in our package to have all relevant results:

library(hetAge)
resx <- calc.het(exprmat = expmat[sample(rownames(expmat),1000),], # expression matrix
  age = ages, # age vector
  age_in = 'years',  # scale of the age vector
  age_to = 'pw-0.25', # scale of ages to be used
  batch_corr = 'LR+QN',  # batch correction method
  modex = 'linear', # model for the change in expression level
  tr_log2 = F, # log2 transform input matrix
  sc_features = T, # scale genes / probesets
  covariates = covs, # covariates
  het_change_met = 'spearman', # method to measure change in heterogeneity
  padj_met = 'fdr') # multiple test correction method

Result object

The resulting object is a list with several fields.

The list of samples used in the analysis:

head(resx$sampleID)

The expression matrix inputted in the first place, only including the samples used in the analysis.

resx$input_expr[1:5,1:5]

The ages in the original format, including only the samples used in the analysis

head(resx$input_age)

The ages after re-formetted into desired format

Here we wanted the ages to be in fourth root scale.

head(resx$usedAge)
library(tidyverse)
data.frame(age = resx$input_age, age0.25 = resx$usedAge) %>%
  ggplot(aes(x = age, y = age0.25)) +
  geom_point() +
  geom_smooth( method = 'loess') +
  theme_bw()

Expression matrix after correcting for confounding factors

These are the expression values for the probesets after accounting for the given covariates.

resx$LR_res$correctedExp[1:5,1:5]

Covariate coefficients for each feature

head(resx$LR_res$cov_coef)

P values for the covariate coefficients for each feature based on linear regression

head(resx$LR_res$cov_p)

The expression level matrix used for modeling

This is the matrix used to model age-related expression change - the reason it is different from corrected matrix is because we may further scale the expression level for the features ('sc_features' parameter.)

resx$usedMat[1:5,1:5]

Residuals from the model between expression level and age

Absolute values of this matrix can be considered as the level of heterogeneity for each gene, each sample.

resx$residMat[1:5,1:5]

Data frame with the summaries of age-related changes in expression level and heterogeneity

colnames(resx$feature_result)
head(resx$feature_result) 

Session Info

options(width = 100)
sessioninfo::session_info()


mdonertas/hetAge documentation built on Jan. 2, 2020, 12:53 a.m.