In manymodelr: Build and Tune Several Models

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In this vignette, we take a look at how we can simplify many machine learning tasks using manymodelr.

Installation

install.packages("manymodelr")

Once the package has been successfully installed, we can then proceed by loading the package and exploring some of the key functions.

Loading the package

library(manymodelr)
data("yields", package="manymodelr")

Modeling

First, a word of caution. The examples shown in this section are meant to simply show what the functions do and not what the best model is. For a specific use case, please perform the necessary model checks, post-hoc analyses, and/or choose predictor variables and model types as appropriate based on domain knowledge.

With this in mind, let us look at how we can perform modeling tasks using manymodelr.

multi_model_1

This is one of the core functions of the package. multi_model_1 aims to allow model fitting, prediction, and reporting with a single function. The multi part of the function's name reflects the fact that we can fit several model types with one function. An example follows next.

For purposes of this report, we create a simple dataset to use.

set.seed(520)
train_set<-createDataPartition(yields$normal,p=0.6,list=FALSE)
valid_set<-yields[-train_set,]
train_set<-yields[train_set,]
ctrl<-trainControl(method="cv",number=5)
m<-multi_model_1(train_set,"normal",".",c("knn","rpart"), 
                 "Accuracy",ctrl,new_data =valid_set)

The above returns a list containing metrics, predictions, and a model summary. These can be extracted as shown below.

m$metric

head(m$predictions)

multi_model_2

This is similar to multi_model_1 with one difference: it does not use metrics such as RMSE, accuracy and the like. This function is useful if one would like to fit and predict "simpler models" like generalized linear models or linear models. Let's take a look:

# fit a linear model and get predictions
lin_model <- multi_model_2(mtcars[1:16,],mtcars[17:32,],"mpg","wt","lm")

lin_model[c("predicted", "mpg")]

From the above, we see that wt alone may not be a great predictor for mpg. We can fit a multi-linear model with other predictors. Let's say disp and drat are important too, then we add those to the model.

multi_lin <- multi_model_2(mtcars[1:16, ], mtcars[17:32,],"mpg", "wt + disp + drat","lm")

multi_lin[,c("predicted", "mpg")]

fit_model

This function allows us to fit any kind of model without necessarily returning predictions.

lm_model <- fit_model(mtcars,"mpg","wt","lm")
lm_model

fit_models

This is similar to fit_model with the ability to fit many models with many predictors at once. A simple linear model for instance:

models<-fit_models(df=yields,yname=c("height", "weight"),xname="yield",
                   modeltype="glm")

One can then use these models as one may wish. To add residuals from these models for example:

res_residuals <- lapply(models[[1]], add_model_residuals,yields)
res_predictions <- lapply(models[[1]], add_model_predictions, yields, yields)
# Get height predictions for the model height ~ yield 
head(res_predictions[[1]])

If one would like to drop non-numeric columns from the analysis, one can set drop_non_numeric to TRUE as follows. The same can be done for fit_model above:

fit_models(df=yields,yname=c("height","weight"),
           xname=".",modeltype=c("lm","glm"), drop_non_numeric = TRUE)

Extraction of Model Information

To extract information about a given model, we can use extract_model_info as follows.

extract_model_info(lm_model, "r2")

To extract the adjusted R squared:

extract_model_info(lm_model, "adj_r2")

For the p value:

extract_model_info(lm_model, "p_value")

To extract multiple attributes:

extract_model_info(lm_model,c("p_value","response","call","predictors"))

This is not restricted to linear models but will work for most model types. See help(extract_model_info) to see currently supported model types.

Correlations

get_var_corr

As can probably(hopefully) be guessed from the name, this provides a convenient way to get variable correlations. It enables one to get correlation between one variable and all other variables in the data set.

Previously, one would set get_all to TRUE if they wanted to get correlations between all variables. This argument has been dropped in favor of simply supplying an optional other_vars vector if one does not want to get all correlations.

Sample usage:

# getall correlations

# default pearson

head( corrs <- get_var_corr(mtcars,comparison_var="mpg") )

Previously, one would also set drop_columns to TRUE if they wanted to drop factor columns. Now, a user simply provides a character vector specifying which column types(classes) should be dropped. It defaults to c("character","factor").

# purely demonstrative
get_var_corr(yields,"height",other_vars="weight",
             drop_columns=c("factor","character"),method="spearman",
             exact=FALSE)

Similarly, get_var_corr_ (note the underscore at the end) provides a convenient way to get combination-wise correlations.

head(get_var_corr_(yields),6)

To use only a subset of the data, we can use provide a list of columns to subset_cols. By default, the first value(vector) in the list is mapped to comparison_var and the other to other_Var. The list is therefore of length 2.

head(get_var_corr_(mtcars,subset_cols=list(c("mpg","vs"),c("disp","wt")),
                   method="spearman",exact=FALSE))

plot_corr

Obtaining correlations would mostly likely benefit from some form of visualization. plot_corr aims to achieve just that. There are currently two plot styles, squares and circles. circles has a shape argument that can allow for more flexibility. It should be noted that the correlation matrix supplied to this function is an object produced by get_var_corr_.

To modify the plot a bit, we can choose to switch the x and y values as shown below.

plot_corr(mtcars,show_which = "corr",
          round_which = "correlation",decimals = 2,x="other_var",  y="comparison_var",plot_style = "squares"
          ,width = 1.1,custom_cols = c("green","blue","red"),colour_by = "correlation")

To show significance of the results instead of the correlations themselves, we can set show_which to "signif" as shown below. By default, significance is set to 0.05. You can override this by supplying a different signif_cutoff.

# color by p value
# change custom colors by supplying custom_cols
# significance is default 
set.seed(233)
plot_corr(mtcars, x="other_var", y="comparison_var",plot_style = "circles",show_which = "signif", colour_by = "p.value", sample(colours(),3))

To explore more options, please take a look at the documentation.

Extra Functions

agg_by_group

As can be guessed from the name, this function provides an easy way to manipulate grouped data. We can for instance find the number of observations in the yields data set. The formula takes the form x~y where y is the grouping variable(in this case normal). One can supply a formula as shown next.

head(agg_by_group(yields,.~normal,length))

head(agg_by_group(mtcars,cyl~hp+vs,sum))

rowdiff

This is useful when trying to find differences between rows. The direction argument specifies how the subtractions are made while the exclude argument is used to specify classes that should be removed before calculations are made. Using direction="reverse" performs a subtraction akin to x-(x-1) where x is the row number.

head(rowdiff(yields,exclude = "factor",direction = "reverse"))

na_replace

This allows the user to conveniently replace missing values. Current options are ffill which replaces with the next non-missing value, samples that samples the data and does replacement, value that allows one to fill NAs with a specific value. Other common mathematical methods like min, max,get_mode, sd, etc are no longer supported. They are now available with more flexibility in standalone mde

head(na_replace(airquality, how="value", value="Missing"),8)

na_replace_grouped

This provides a convenient way to replace values by group.

test_df <- data.frame(A=c(NA,1,2,3), B=c(1,5,6,NA),groups=c("A","A","B","B"))
# Replace NAs by group
# replace with the next non NA by group.
na_replace_grouped(df=test_df,group_by_cols = "groups",how="ffill")

The use of mean,sd,etc is no longer supported. Use mde instead which is focused on missingness.

Exploring Further

The vignette has been short and therefore is non exhaustive. The best way to explore this and any package or language is to practise. For more examples, please use ?function_name and see a few implementations of the given function.

Reporting Issues

If you would like to contribute, report issues or improve any of these functions, please raise a pull request at (manymodelr)