In montilab/BS831: Boston University - Genomics Data Mining

Here, we compare application of linear and logistic regression to perform differential analysis. Linear regression is in general preferable to logistic regression since the former has a closed formed solution, while the latter does not, and must rely on an iterative numerical method to estimate the model's coefficients. Thus, in general there is no reason to use logistic regression when linear regression can be used.

Below, we show a simple application of the two methods to the analysis of a subset of the TCGA Breast Cancer (BrCa) dataset. We compare the methods both in terms of execution time and model estimates.

knitr::opts_chunk$set(message=FALSE, warning=FALSE)

## require() or library() statements
require(BS831)
require(Biobase)
require(ggplot2)

Loading the Data

data(renamedBreastDB)
brca <- renamedBreastDB
table(brca$diseaseState)

Linear vs. Logistic Regression

Now, we use a linear model and a logistic regression model and compare the results of the two. We define two simple functions to encapsulate the relevant steps.

A multi-gene `lm` wrapper

multi.lm <- function(eset,pheno) {
    X <- pData(eset)[,pheno]
    data.frame(t(apply(exprs(eset),1,function(Y) {
        out <- summary(lm( Y ~ X))
        out$coefficients[2,c("t value","Pr(>|t|)")]
    })),check.names=FALSE)
}

A multi-gene logistic function wrapper

Logistic regression is an instance of generalized linear model (glm) using a binomial distribution and a 'logit' link function.

multi.logistic <- function(eset,pheno) {
    Y <- (0:1)[pData(eset)[,pheno]]
    data.frame(t(apply(exprs(eset),1,function(X) {
        out <- summary(glm(Y ~ X, family=binomial(link='logit')))
        out$coefficients[2,c("z value","Pr(>|z|)")]
    })),check.names=FALSE)
}

We next apply the wrapper functions to our BrCa dataset. We first compare their computation time.

Computation Time

## run linear regression
Tlin <- system.time( resultsLin <- multi.lm(brca,pheno="diseaseState") )
head(resultsLin) # show few results

## run logistic regression
Tlog <- system.time( resultsLog <- multi.logistic(brca,pheno="diseaseState") )
head(resultsLog) # show few results

## how much faster is linear compared to logistic?
Tlog["sys.self"]/Tlin["sys.self"]

Linear regression appears to be r round(Tlog["sys.self"]/Tlin["sys.self"],2) times faster than logistic regression. Also, notice that for some of the genes, the iterative estimation procedure did not converge (with the message "glm.fit: algorithm did not converge"), and some genes perfectly separate the phenotypic classes (with the message "glm.fit: fitted probabilities numerically 0 or 1 occurred"), which leads to inflated coefficient estimates.

Estimates

We now compare the scores (t value and z value), and the p-values.

DF <- data.frame(resultsLin,resultsLog)
colnames(DF) <- c("lm.t","lm.pval","logistic.z","logistic.pval")

ggplot(DF,aes(x=lm.t,y=logistic.z)) +
    geom_point() +
    geom_smooth()

## identify the genes with biggest p-value differences
idx <- DF$logistic.pval-DF$lm.pval > .05
sum(idx)
DF[idx,]

## plot the p-values and highlight the biggest differences
ggplot(DF,aes(x=lm.pval,y=logistic.pval)) +
    geom_point() +
    geom_point(data=DF[idx,],aes(x=lm.pval,y=logistic.pval,col="red")) +
    labs(col="difference>0.05",label=NULL)