In gbonte/gbcode: Code from the handbook "Statistical foundations of machine learning"

This notebook implements a Machine Learning playground to assess by simulation a number of theoretical concepts. The focus will be on regression. Given the playground nature and the need to make visible the results we will limit to consider learning tasks with small number of inputs ($1 \le n \le 5$) and univariate output.

rm(list=ls())
library(mvtnorm) ## multivariate normal density and random generator
library(scatterplot3d)
library(rlang)

cols=rep(c("red","green","blue","magenta","black","yellow"),3)

Monte Carlo simulation

The entire playground relies on Monte Carlo simulation. The basic assumption is that the number $S$ of Monte Carlo trials is sufficiently large to make extremely accurate all the Monte Carlo estimations of probabilistic quantities. In what follows we will not distinguish a Monte Carlo estimation from the real value.

S= 10000 ## number of Monte Carlo trials

For instance if we want to compute $E[g({\bf x})]$ where $g=|x|$ and ${\bf x} \sim {\mathcal N}(\mu=0,\sigma^2=2)$ we will assume that

mean(abs(rnorm(S,0,sd=sqrt(2))))

returns the correct value.

Data generating process

This section will show how to define a data generating process to create training and test sets.

Main parameters

The first quantities to be defined are the size of the training set and dimension of the input space.

N= 100 ## size of training set
n= 3 ## input dimension

Input distribution

Here we define the input multivariate distribution and provide a function which returns the input training set. In order to make general the input distribution we consider a mixture of $m$ gaussians

M=3 ## number of mixture components

Inputs<-function(N,n,m,centers=NULL,sds=NULL){
  ## m : number of mixture components

  X=NULL
  W=NULL ## keep strace of which component was sampled

  w=runif(m)
  w=w/sum(w)  

  if (is.null(centers))
    centers=array(rnorm(m*n),c(m,n))
  if (is.null(sds))
    sds=array(runif(m*n),c(m,n))


  for (i in 1:N){
    whichm=sample(1:m,1,prob=w)
    W=c(W,whichm)
    if (n>1)
      X=rbind(X,rmvnorm(1,mean=centers[whichm,],sigma=diag(sds[m,])))
    else
      X=c(X,rnorm(1,mean=centers[whichm],sd=sds[m]))

  }

  return(list(X=X,W=W,centers=centers,sds=sds))
}

Conditional expectation function

In a regression task we have to define the conditional expectation $E_[{\bf y}|x]=f(x)$ The R function below implements the function $f(x)$.

condexp<-function(X){
  n=NCOL(X)
  if (n==1){
    f=(X[,1]+2)
    #f=sin(2*pi*(X[,1])) ## put here your own function
  }
  if (n>1){
    f=((X[,1]+X[,2])) ## put here your own function
    f=(X[,1]^2+X[,2]^2+X[,1]*X[,2])
    f=(X[,1]+X[,2])
  }
  return(f)
}

Data set generation

In this section we show how to generate the training set $D_N$ of $N$ samples.

Input training generation

Inptr=Inputs(N,n,M)
X=array(Inptr$X,c(N,n))

Regression target generation

Target following ${\bf y}=f(x)+{\bf w}$

sdw=0.25  ## conditional variance (or noise variance)

Y=condexp(X)+rnorm(N,sd=sdw)

Dr=data.frame(cbind(Y,X))
colnames(Dr)<-c("y",paste("x",1:n,sep=""))

Data set visualization

You will find below some possible visualizations of the training set

Input data visualization

#ggplot(data = D, aes(x=x1,y=y))+geom_point()
if (n==2){
  plot(Dr$x1,Dr$x2,xlab="x1",ylab="x2", main="Input distribution")

}

Input/output data visualization

pairs(Dr)

if (n==2)
  with(Dr, {
    scatterplot3d(x = x1,
                  y = x2, 
                  z = y,
                  type = "h",
                  main="3-D Scatterplot")
  })

#library(plotly)
#fig<-plot_ly(data=Dr, x=~x1,y=~x2, z=~y,type="scatter3d", mode="markers")
#fig

Regression learner

In this section we define the learner, i.e. the estimator that given a training set and a test query point returns the prediction. Note that the entire learning process should be described in this function, notably the feature selection, parameter identification, model selection and prediction.

We will code some regression learners with different complexities

Constant regression learner

It is one of the simplest you could imagine. It just returns as prediction the average of the output

lslearn0<-function(D,Xts){

  Nts=NROW(Xts)

  Ytr<-D$y
  Yhat=numeric(Nts)+mean(Y)
  return(Yhat)
}

Least-squares regression learners

The first regressor uses only the first variable $x_1$

lslearn1<-function(D,Xts){
  lambda=0.001
  Ntr=NROW(D)
  Nts=NROW(Xts)
  n=NCOL(D)-1
  Xtr<-D[,2:NCOL(D)]
  if (n>1){
    Xtr<-cbind(numeric(Ntr)+1, Xtr[,1] )
    Xts<-cbind(numeric(Nts)+1,Xts[,1])
  } else {
    Xtr<-cbind(numeric(Ntr)+1, Xtr )
    Xts<-cbind(numeric(Nts)+1,Xts)

  }
  Ytr<-D$y
  betahat<-solve(t(Xtr)%*%Xtr+lambda*diag(NCOL(Xtr)))%*%t(Xtr)%*%Ytr
  Yhat=Xts%*%betahat
  return(Yhat)
}

The second regressor uses all original variables:

lslearn2<-function(D,Xts){
  Ntr=NROW(D)
  Nts=NROW(Xts)
  Xtr<-as.matrix(D[,2:NCOL(D)])
  Xtr<-cbind(numeric(Ntr)+1,Xtr)
  Xts<-cbind(numeric(Nts)+1,Xts)
  Ytr<-D$y
  betahat<-solve(t(Xtr)%*%Xtr)%*%t(Xtr)%*%Ytr
  Yhat=Xts%*%betahat
  return(Yhat)
}

The third regressor uses all original variables and add as feature some transformations of the original variables (rasise them to the 2nd and 3rd power)

lslearn3<-function(D,Xts){
  Ntr=NROW(D)
  Nts=NROW(Xts)
  Xtr<-as.matrix(D[,2:NCOL(D)])
  Xtr<-cbind(numeric(Ntr)+1,Xtr,Xtr^2,Xtr^3)
  Xts<-cbind(numeric(Nts)+1,Xts,Xts^2,Xts^3)
  Ytr<-D$y
  betahat<-solve(t(Xtr)%*%Xtr)%*%t(Xtr)%*%Ytr
  Yhat=Xts%*%betahat
  return(Yhat)
}

Testset i.i.d. generation

The code below shows how to draw i.i.d. test samples from the same distribution as the training set. Note that the

the input distribution has the same parameters (centers and std devs of the mixture components) as the training distribution.
the conditional expectation function is the same as in the training set
the conditional variance is the same

So are the data different? and why?

Inpts=Inputs(S,n,M,Inptr$centers,Inptr$sds)
Xts=array(Inpts$X,c(NROW(Inpts$X),n))
Yts=condexp(Xts)+rnorm(S,sd=sdw)

Generalization assessment of a learner

For a set of learners, we compute here functional risk, expected generalization error and we check the bias/variance decomposition of the mean-squared-error. Note that we may compute those quantities only because because we are in a simulated setting and we have a complete control over the data generation process.

Functional risk

The functional risk for a given training set $D_N$ and a hypothesis $h(x,\alpha_N)$ $$ R(\alpha_N)=E_{{\bf x}, {\bf y}} [({\bf y}-h({\bf x},\alpha_N))]$$ where $h(x,\alpha_N)$ is the hypothesis returned by a learner trained on the dataset $D_N$

for (learner in c("lslearn0","lslearn1","lslearn2","lslearn3")){
  Yhat<-exec(learner,Dr,Xts)
  RISK= mean((Yts-Yhat)^2)
  cat("Learner=",learner, ":\n Functional Risk=",RISK, "\n")
}

MISE

The MISE is $G_N=E_{{\bf D}_N}[R(\alpha_N)]$

Bias, variance and MSE

Bias: $B(x)=E_{{\bf D}N}[E[{\bf y} |x] -h(x,\alpha_N)]=f(x)-E{{\bf D}_N}[h(x,\alpha_N)]$

Variance: $V(x)=E_{{\bf D}N} [( h(x,\alpha_N)-E{{\bf D}_N}[h(x,\alpha_N)] )^2]$

MSE: $\text{MSE}(x)=B^2(x)+V(x)+\sigma_{{\bf w}}^2$

MISE: $\text{MISE}=E_{{\bf x}}[\text{MSE}({\bf x})]=E_{{\bf x}}[B^2({\bf x})+V({\bf x})+\sigma_{{\bf w}}^2]= E_{{\bf x}}[B^2({\bf x})]+E_{{\bf x}}[V({\bf x})]+\sigma_{{\bf w}}^2$

The following code computes MISE, bias, variance and verifies the identity above.

for (learner in c("lslearn0","lslearn1","lslearn2","lslearn3")){
  EEN=NULL
  YhatN<-NULL
  EN<-NULL

  ## generation test set input
  Inpts=Inputs(S,n,M)
  Xts=array(Inpts$X,c(S,n))


  ## Monte Carlo trials to compute the expectation over DN
  for (s in 1:200){

    ## generation i.i.d. training set (fixed N)
    Inp=Inputs(N,n,M,Inpts$centers,Inpts$sds)
    Xtr=array(Inp$X,c(N,n))   
    Ytr=condexp(Xtr)+rnorm(N,sd=sdw)
    Dr=data.frame(cbind(Ytr,Xtr))   
    colnames(Dr)<-c("y",paste("x",1:n,sep=""))

    ## generation i.i.d. test set
    fts=condexp(Xts)
    Yts=fts+rnorm(NROW(Xts),sd=sdw)  


    Yhat<-exec(learner,Dr,Xts)  ## prediction in test set

    YhatN<-cbind(YhatN,Yhat)
    EN<-cbind(EN,Yhat-fts)
    EEN=cbind(EEN,Yts-Yhat)

  }

  MSEx=apply(EEN^2,1,mean) ## MSE(x): MSE for test inputs
  MISE=mean(MSEx)

  Bx=apply(EN^2,1,mean) ## B(x): squared bias for test inputs
  B2= mean(Bx) ## Expectation over x of B(x)

  V2x=apply(YhatN,1,var) ## VAR(x): variance for test inputs
  V2= mean(V2x) ## Expectation over x of VAR(x)


  cat("\n Assessment MISE learner=",learner, ":\n MISE=",MISE,"\n Bias^2=",B2,";Variance=",V2,
      "\n Bias^2+Variance+Noise=", B2+V2+sdw^2,"\n")
}

Which model is the best one? Why in your opinion?

For the student

This is just a starting point for you to start playing with data. Now you could:

take a single dataset (this is what happens in reality) and compute the cross-validation error and check if you obtain a similar ranking,
design and assess other learners (e.g. local learners, regularized least-squares),
change the target function in condexp,
try different values of $N$, $n$, $\sigma_w^2$ and discuss the impact on final generalization,
compare a number of linear least-squares with different $\lambda$ parameters,
change the test distribution wrt training data distribution and assess the impact,
add a feature selection step and see the impact on bias and variance,
implement a gradient-based version of least-squares and check the impact on variance.

gbonte/gbcode documentation built on Aug. 30, 2024, 1:11 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

gbonte/gbcode
Code from the handbook "Statistical foundations of machine learning"

In gbonte/gbcode: Code from the handbook "Statistical foundations of machine learning"

Monte Carlo simulation

Data generating process

Main parameters

Input distribution

Conditional expectation function

Data set generation

Input training generation

Regression target generation

Data set visualization

Input data visualization

Input/output data visualization

Regression learner

Constant regression learner

Least-squares regression learners

Testset i.i.d. generation

Generalization assessment of a learner

Functional risk

MISE

Bias, variance and MSE

For the student

R Package Documentation

Browse R Packages

We want your feedback!

gbonte/gbcode Code from the handbook "Statistical foundations of machine learning"

In gbonte/gbcode: Code from the handbook "Statistical foundations of machine learning"

Monte Carlo simulation

Data generating process

Main parameters

Input distribution

Conditional expectation function

Data set generation

Input training generation

Regression target generation

Data set visualization

Input data visualization

Input/output data visualization

Regression learner

Constant regression learner

Least-squares regression learners

Testset i.i.d. generation

Generalization assessment of a learner

Functional risk

MISE

Bias, variance and MSE

For the student

R Package Documentation

Browse R Packages

We want your feedback!

gbonte/gbcode
Code from the handbook "Statistical foundations of machine learning"