An R package that does model comparison between different regression models in a model independent way. This model does not require a model object (e.g. result from lm
or other similar functions). In particular, our functions ask for:
- y
: observed variable
- y_pred
: predicted variable of a full model
- y_sub
: predicted variable of a subset model
- p
: number of explanatory variables of a full model (including the intercept)
- k
: number of explanatory variables of a subset model (including the intercept)
Using this package, you do not need to make a model object, thus have more flexibility and control over your models.
You can install regscoreR from github with:
# install.packages("devtools")
devtools::install_github("UBC-MDS/regscoreR")
AIC stands for Akaike’s Information Criterion. It estimates the quality of a model, relative to each of other models. The lower AIC score is, the better the model is. Therefore, a model with lowest AIC - in comparison to others, is chosen.
AIC = n*log(residual sum of squares/n) + 2p
where:
aic(y, y_pred, p)
Parameters:
True target variable(s)
y_pred: array-like of shape = (n_samples) or (n_samples, n_outputs)
Fitted target variable(s) obtained from your regression model
p: int
Return: * aic_score: int * AIC score of the model
BIC stands for Bayesian Information Criterion. Like AIC, it also estimates the quality of a model. When fitting models, it is possible to increase model fitness by adding more parameters. Doing this may results in model overfit. Both AIC and BIC helps to resolve this problem by using a penalty term for the number of parameters in the model. This term is bigger in BIC than in AIC.
BIC = n*log(residual sum of squares/n) + p*log(n)
where:
bic(y, y_pred, p)
Parameters: * y: array-like of shape = (n_samples) or (n_samples, n_outputs) * True target variable(s)
Fitted target variable(s) obtained from your regression model
p: int
Return: * bic_score: int * BIC score of the model
Mallow's C_p is named for Colin Lingwood Mallows. It is used to assess the fit of regression model, finding the best model involving a subset of predictive variables available for predicting some outcome.
C_p = (SSE_k/MSE) - n + 2k
where:
k
parameters (including the intercept)p
parameters (including the intercept), where p
> k
mallow(y, y_pred, y_sub, p, k)
Parameters:
True target variable(s)
y_pred: array-like of shape = (n_samples) or (n_samples, n_outputs)
Fitted target variable(s) obtained from your regression model
y_sub: array-like of shape = (n_samples) or (n_samples, n_outputs)
Fitted target variable(s) obtained from your subset regression model
p: int
Number of predictive variable(s) used in the model
k: int
Return:
> library(regscoreR)
> y <- c(1,2,3,4)
> y_pred <- c(5,6,7,8)
> p <- 3
>
> aic(y, y_pred, p)
[1] 17.09035
>
>
> bic(y, y_pred, p)
[1] 15.24924
>
>
> y_sub <- c(1,2,3,5)
> k <- 2L
> p <- 3L
> mallow(y, y_pred, y_sub, p, k)
[1] 0.015625
This is an open source project. Please follow the guidelines below for contribution. - Open an issue for any feedback and suggestions. - For contributing to the project, please refer to Contributing for details.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.