library("Rcpp")
library(glmnet)
library(glm2)
library(dplyr)
library(ggplot2)
#Rcpp::sourceCpp('~/Desktop/Courses/625Bigdata/Logistic_package/src/LogRegCpp.cpp')
#Rcpp::sourceCpp('~/Desktop/Courses/625Bigdata/LogisticReg/src/LogRegCpp.cpp')
Logreg<-function(X,y,maxit = 5000){
  n<-dim(X)[1]
  X<-cbind(rep(1,n),X)
  p<-dim(X)[2]
  ### Use rcpp
  result <- LogRegcpp(X,rep(0,p),y,maxit = maxit)
  result$loss <- result$loss[result$loss !=0 ]
  result$prediction <- result$P > 0.5
  result$accuracy <- mean(result$prediction == y)

  return(result)
}

My_predict<-function(fit,newx){
  n<-dim(newx)[1]
  X<-cbind(rep(1,n),newx)
  p<-dim(X)[2]
  result <- X%*%fit$x
  return( as.numeric(result>0))
}

Generating binomial data

The model is Gaussian mixture model:

There are two groups, for sample j with label i (i can be 0 or 1), the distribution is $$X_j^i \sim N(\mu_i,\sigma^2)$$ The following is an toy example with 20000 samples and 100 features (we use it as training model). Among them, 10000 samples are labeled with 0 and the rest are labeled as 1. And the testing model follows the same design as training model.

sigma<-4
set.seed(123)
n<-1e4
p<-1e2
mu1<-rnorm(p)
mu2<-rnorm(p)
X1<-matrix(mu1+rnorm(n*p,0,sigma),n,p,byrow = TRUE)
X2<-matrix(mu2+rnorm(n*p,0,sigma),n,p,byrow = TRUE)
### Train data
X<-rbind(X1,X2)
y<-rep(c(1,0),each=n)
### Test data
test_x<-rbind( matrix(mu1+rnorm(n*p,0,sigma),n,p,byrow = TRUE), 
               matrix(mu2+rnorm(n*p,0,sigma),n,p,byrow = TRUE)  )
test_y<-rep(c(1,0),each=n)

How to use LogReg Function

Under most of the condition, we only need to input the n by p data matrix $X_{n,p}$ and binomial result $Y_n$ (whose value is 0 or 1).

t0<-proc.time()
fit<-Logreg(X,y)
t1<-proc.time()-t0

We can judge the convergence of algorithm by plotting the loss function. My function will judge by itself and stop when it converges. And the default maximal iteration number is 5000. In the application, if the results diverge according to the plot, we can tune the parameter maxit until it converges (increase the maxit value), which is important to achieve a great prediction. (As for the optimization detail, we use adam-like algorithm, which is quite stable and converges quickly. Different from SGD, it is less sensitive to parameter, and the default parameter can already deal with most of the condition.)

plot(51:length(fit$loss),fit$loss[-(1:50)],main = "Convergence",xlab = "iteration",ylab = "-loglikelihood")

plot(fit$P[c(1:100,1e4+(1:100))],main="Probability (Prediction)",xlab = "sample number",ylab = "probability of group 1")

How to predict with my package

We need to get the trained model(fit) first by using the above LogReg function. Then we just need two inputs (fit and the test matrix) to make a new prediction. The training accuracy is saved in fit$accuracy and the training result (the estimate of parameter) is saved in fit$x.

my_prediction<-My_predict(fit,newx = test_x)
train_acc_mine<-fit$accuracy
test_acc_mine<-mean(my_prediction == test_y)

CV.glmnet

Cv.glmnet is a very strong function with great maturity. It uses lasso & ridge penalty to improve the result and avoid overfitting, and cross validation is introduced to avoid the trouble of tuning. However, it may thus cost much more time as tradeoff.

t0<-proc.time()
fit0<-cv.glmnet(X,y,family = "binomial")
t3<-proc.time()-t0
result1<- predict(fit0, newx = X,  type = "class")
result2<- predict(fit0, newx = test_x,  type = "class")
train_acc_cv<-mean(result1==y)
test_acc_cv<-mean(result2==test_y)

glm

t0<-proc.time()
dat<-as.data.frame(cbind(y,X))
fit0<-glm(y~.,data=dat,family=binomial(link=logit))
t4<-proc.time()-t0

train_acc_glm<- (mean((fit0$fitted.values[1:n]>0.5) ) + mean((fit0$fitted.values[n+(1:n)]<0.5) ) )/2

pred_glm <-cbind(rep(1,n),test_x) %*% fit0$coefficients
test_acc_glm<- mean( pred_glm*rep(c(1,-1),each=n)>0  )

Comparison between three method

class <- rep(c("train","test"),3)
accuracy<-c(train_acc_mine,test_acc_mine,train_acc_cv,test_acc_cv,train_acc_glm,test_acc_glm)
method<- rep(c("Mine","cv.glmnet","glm"),each=2)
dat<-data.frame(class,accuracy,method)

ggplot(
  dat %>%
    filter(
      method %in% c("Mine","cv.glmnet","glm")
    ) %>% group_by(class,method) %>% summarize(
      acc = accuracy
    )
)+ aes(
  x=class,
  y=acc)+
  labs(
    x="class",
    y="Accuracy"
  )+ geom_col(
    aes(fill=factor(method)),
    position='dodge'
  )+coord_cartesian( ylim = c(0.90, 0.97))+ggtitle("Comparison (three models)") +
  theme(plot.title = element_text(hjust = 0.5))

A<-matrix(c(t1[3],t3[3],t4[3]),1,3)
colnames(A)<-c("My LogReg","cv.glmnet","glm")
rownames(A)<-"time (s)"
A

With lasso penalty and cross validation, cv.glmnet doesn't overfit, and its accuracy of test data is even higher than that of the train data. But it is very time-expensive.

My Logistic regression doesn't overfit, either. We may attribute it to the superiority of optimization algorithm (which is stable and likely to converge to global minimal rather than local minimal, thus avoiding overfitting). Besides, because it is adam-like algorithm, it converges quickly and thus needs fewer iteration. So it converges quickly.



hobbitish1028/Logistic_Reg documentation built on Nov. 22, 2019, 7:45 p.m.