neuralnetwork: Train a Neural Network

Description Usage Arguments Details Value References Examples

View source: R/interface.R

Description

Define and train a Multilayer Neural Network for regression or classification.

Usage

1
2
3
4
5
6
7
neuralnetwork(X, Y, hidden.layers, regression = FALSE,
  standardize = TRUE, loss.type = "log", huber.delta = 1,
  activ.functions = "tanh", step.H = 5, step.k = 100,
  optim.type = "sgd", learn.rates = 1e-04, L1 = 0, L2 = 0,
  sgd.momentum = 0.9, rmsprop.decay = 0.9, adam.beta1 = 0.9,
  adam.beta2 = 0.999, n.epochs = 100, batch.size = 32,
  drop.last = TRUE, val.prop = 0.1, verbose = TRUE)

Arguments

X

matrix with explanatory variables

Y

matrix with dependent variables. For classification this should be a one-columns matrix containing the classes - classes will be one-hot encoded.

hidden.layers

vector specifying the number of nodes in each layer. The number of hidden layers in the network is implicitly defined by the length of this vector. Set hidden.layers to NA for a network with no hidden layers

regression

logical indicating regression or classification

standardize

logical indicating if X and Y should be standardized before training the network. Recommended to leave at TRUE for faster convergence.

loss.type

which loss function should be used. Options are "log", "quadratic", "absolute", "huber" and "pseudo-huber"

huber.delta

used only in case of loss functions "huber" and "pseudo-huber". This parameter controls the cut-off point between quadratic and absolute loss.

activ.functions

character vector of activation functions to be used in each hidden layer. Possible options are 'tanh', 'sigmoid', 'relu', 'linear', 'ramp' and 'step'. Should be either the size of the number of hidden layers or equal to one. If a single avtivation type is specified, this type will be broadcasted across the hidden layers.

step.H

number of steps of the step activation function. Only applicable if activ.functions includes 'step'

step.k

parameter controlling the smoothness of the step activation function. Larger values lead to a less smooth step function. Only applicable if activ.functions includes 'step'.

optim.type

type of optimizer to use for updating the parameters. Options are 'sgd', 'rmsprop' and 'adam'. SGD is implemented with momentum.

learn.rates

the size of the steps to make in gradient descent. If set too large, the optimization might not converge to optimal values. If set too small, convergence will be slow. Should be either the size of the number of hidden layers plus one or equal to one. If a single learn rate is specified, this learn rate will be broadcasted across the layers.

L1

L1 regularization. Non-negative number. Set to zero for no regularization.

L2

L2 regularization. Non-negative number. Set to zero for no regularization.

sgd.momentum

numeric value specifying how much momentum should be used. Set to zero for no momentum, otherwise a value between zero and one.

rmsprop.decay

level of decay in the rms term. Controls the strength of the exponential decay of the squared gradients in the term that scales the gradient before the parameter update. Common values are 0.9, 0.99 and 0.999

adam.beta1

level of decay in the first moment estimate (the mean). The recommended value is 0.9

adam.beta2

level of decay in the second moment estimate (the uncentered variance). The recommended value is 0.999

n.epochs

the number of epochs to train. This parameter largely determines the training time (one epoch is a single iteration through the training data).

batch.size

the number of observations to use in each batch. Batch learning is computationally faster than stochastic gradient descent. However, large batches might not result in optimal learning, see Efficient Backprop by Le Cun for details.

drop.last

logical. Only applicable if the size of the training set is not perfectly devisible by the batch size. Determines if the last chosen observations should be discarded (in the current epoch) or should constitute a smaller batch. Note that a smaller batch leads to a noisier approximation of the gradient.

val.prop

proportion of training data to use for tracking the loss on a validation set during training. Useful for assessing the training process and identifying possible overfitting. Set to zero for only tracking the loss on the training data.

verbose

logical indicating if additional information should be printed

Details

A genereric function for training Neural Networks for classification and regression problems. Various types of activation and cost functions are supported, as well as L1 and L2 regularization. Possible optimizer include SGD (with or without momentum), RMSprop and Adam.

Value

An ANN object. Use function plot(<object>) to assess loss on training and optionally validation data during training process. Use function predict(<object>, <newdata>) for prediction.

References

LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks of the trade. Springer Berlin Heidelberg, 2012. 9-48.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Example on iris dataset:

# Plot full data
plot(iris, pch = as.numeric(iris$Species))

# Prepare test and train sets
random_draw <- sample(1:nrow(iris), size = 100)
X_train     <- iris[random_draw, 1:4]
Y_train     <- iris[random_draw, 5]
X_test      <- iris[setdiff(1:nrow(iris), randDraw), 1:4]
Y_test      <- iris[setdiff(1:nrow(iris), randDraw), 5]

# Train neural network on classification task
NN <- neuralnetwork(X = X_train, Y = Y_train, hidden.layers = c(5, 5),
                    optim.type = 'adam', learn.rates = 0.01, val.prop = 0)

# Plot the loss during training
plot(NN)

# Make predictions
Y_pred <- predict(NN, newdata = X_test)

# Plot predictions
plot(X_test, pch = as.numeric(Y_test), col = (Y_test == Y_pred$predictions) + 2)

bflammers/ANN2 documentation built on Oct. 27, 2018, 12:17 a.m.