ensemble_randForest: Bagged Ensemble Random Forests Model.

Description Usage Arguments Value See Also Examples

Description

This function creates a bagged random forests model on a given a dataset. This function effictively acts as a wrapper for the random forest package. The random forests model is a very versitile classifcation and prediction model that incorporates feature selection using shrinkage parametres and cross validation.

Usage

1
2
3
4
ensemble_randForest(y_index, train, valid_size = NULL, test = NULL,
  type = c("prediction", "classification"), ntree = 500,
  importance = FALSE, n = 10, r = NULL, r_replace = FALSE, c = NULL,
  c_replace = FALSE, seed = TRUE)

Arguments

y_index

A column index representing the response variable of the model.

train

A dataset for the random forest model to be trained on. The order and names of train set should be the exact same as the test set.

valid_size

A natural number indicating the number of observations to be randomly sampled from the training data for model validation.

test

A dataset for the random forest model to predict for. The order and names of test set should be the exact same as the train set.

type

The type of response variable, either 'prediction' or 'classification'.

n

A natural number indicating the number of random forest models to be built.

r

The number of rows to be bagged. Note r < nrow(train).

r_replace

A logical object allow resampling when bagging rows. Default is FALSE.

c

The number of columns to be bagged Note c < ncol(train)

c_replace

A logical object allowing resampling when bagging columns Default is FALSE.

seed

Logical, indicating whether a random seed should be implemented.

file_name

A character object indicating the file name when saving the data frame. The default is NULL. The name must include the .csv suffixs.

directory

A character object specifying the directory where the data frame is to be saved as a .csv file.

Value

Outputs a list of information related to the ensemble GLMNET model. The first object of the list is a data frame of the response observations, the corresponding predictions and the error associated with the prediction. The second object of the list is a data frame of model performance metrics. The third object of the list is a vector of predictions / classifications for the specified test set.

See Also

ensemble_mars, ensemble_mlr

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Example 1
# Numeric prediction example with Iris Data
data <- iris[sample(1:150, size = 150, replace = FALSE),]
# subset training data
train_data <- data[1:150,-5]
# bag rows only
ensemble_randForest(y_index = 1, train = train_data, valid_size = 50, n = 10, r = 80, type = "prediction")
# bag columns only
ensemble_randForest(y_index = 1, train = train_data, valid_size = 50, n = 10, c = 2, type = "prediction")
# bagg rows and columns
ensemble_randForest(y_index = 1, train = train_data, valid_size = 50, n = 10, r = 80, c = 2, type = "prediction")

# Example 2
# Binomial classication example with irisData
data <- iris[sample(1:150, size = 150, replace = FALSE),]
# Dummy encode the Species
data <- derive_variables(dataset = data, type = "dummy", integer = TRUE, return_dataset = TRUE)
# Convert the response variable into a binary factor with two class
data$Species_setosa <- as.factor(data$Species_setosa)
# Extract the test data
test <- data[101:50,c(1,2,3,4,6,7)]
# move Species_setosa to the front of the data frame
data <- data[,c(5,1,2,3,4,6,7)]
# fit a model
ensemble_randForest(y_index = 1, train = data, valid_size = 50, n = 10, type = "classification")
# fit a model
ensemble_randForest(y_index = 1, train = data, valid_size = 50, n = 20, r = 80, type = "classification")
# fit a model
ensemble_randForest(y_index = 1, train = data, valid_size = NULL, test = test, n = 10, type = "classification")

# Example 3
# Validation set with Multinomial Class prediction example using iris
data <- iris
data <- data[sample(1:150, size = 150, replace = FALSE),c(5,1,2,3,4)]
# plots
ense <- glmnet(y = as.vector(data[,1]), x = as.matrix(data[,-1]), alpha = 1, family = "binomial")
plot(ense)
ense <- cv.glmnet(y = as.vector(data[,1]), nfolds = 10, x = as.matrix(data[,-1]), alpha = 1, family = "binomial")
plot(ense)
# raw calculations
predict.cv.glmnet(object = cv.glmnet(y = as.vector(data[1:100,1]), x = as.matrix(data[1:100,-1]), alpha = 1, family = "multinomial"), newx = as.matrix(data[101:50,-1]), type = "class")

# Example 4
# Test set with numeric prediction using titanic
descriptive_statistics(dataset = titanic)
str(titanic)
data <- titanic[,c(6,2,3,7,8,10)]
data <- data[order(data$Age),]
data$Age
train <- data[1:714, ]
valid_size <- 50
test <- data[715:891, ]
ensemble_glmnet(y_index = 1, train = train, valid = valid_size, test = test, n = 10, r = 600, c = 4)
ensemble_glmnet(y_index = 1, train = train, valid = valid_size, n = 10)

# Example 5 
# Possion Prediction with IrisData
counts = rpois(n = 150, lambda = 3)
data <- iris[sample(1:150, 150, FALSE), ]
data = cbind(counts, data)
train <- data[1:100,]
# test <- data[101:150,]
test <- data[101:150, -1]
ensemble_glmnet(y_index = 1, train = train, test = test, valid_size = 50, family = "poisson", type = "response")

oislen/BuenaVista documentation built on May 16, 2019, 8:12 p.m.