ps_randomForest: ps_randomForest

View source: R/ps_randomForest.R

ps_randomForestR Documentation

ps_randomForest

Description

Implements a random forest analysis of source data, and predicts sources of unknowns if requested

Usage

ps_randomForest(
  doc = "ps_randomForest",
  data,
  GroupVar,
  Groups = "All",
  AnalyticVars,
  sourceID = " ",
  Ntrees = 500,
  NvarUsed = NA,
  Seed = 11111,
  digitsImportance = 1,
  plotErrorRate = TRUE,
  plotImportance = TRUE,
  predictSources = FALSE,
  predictData = NA,
  unknownID = " ",
  plotSourceProbs = TRUE,
  folder = " "
)

Arguments

doc

Documentation for the function use added to model usage, default value is the function name

data

A data frame with the data used to grow trees (source data if predictions are made)

GroupVar

The name of variable defining groups, grouping is required

Groups

A vector of codes for groups to be used, 'All' if use all groups

AnalyticVars

A vector with names (character-valued) of the analytic variables

sourceID

If not " " (the default), the name of the variable with sample ID for source data

Ntrees

The number of trees grown, default value of 500 is that for the randomForest function

NvarUsed

If not NA (the default), the number of variables to use in each random forest call to rpart; if NA, rpart uses the default value for randomForest() (the square root of the number of candidate variables)

Seed

If not NA, a random number generator seed to produce reproducible results; default value is 11111

digitsImportance

The number of significant digits for the importance measure, default is 1

plotErrorRate

Logical, whether to show the error rate plot, default is TRUE

plotImportance

Logical, whether to show the plot of variable importance, default is TRUE

predictSources

Logical; if T, predict sources for the data in predictData; default is FALSE

predictData

A data frame or matrix with data used to predict sources for observations, must contain all variables in AnalyticVars_

unknownID

if not " " (the default), the name of the variable with the sample ID for artifact data

plotSourceProbs

Logical, if TRUE (the default) and predictSources=TRUE, show box plots of source probabilities

folder

The path to the folder in which data frames will be saved; default is " "

Details

The function implements a random forest analysis using the R function randomForest(). If predictSources and plotSourceProbs are TRUE, the function creates two box plots. The first plot shows, for each source, the set of probabilities of assignment to that source for the observations assigned to that source (all of these probabilities should be large). The second plot shows, for each source, the set of probabilities of assignment to that source for the observations not assigned to that source (for each source, there is one such probability for observation); these probabilities should be relatively small, and some should be zero. See the vignette for more details and examples of these plots.

Value

The function returns a list with the following components:

  • usage: A string with the contents of the argument doc, the date run, the version of R used

  • dataUsed: The contents of the argument data restricted to the groups used

  • sourcesNA: A data frame with data from the data frame data with missing values, NÁ if no missing values

  • analyticVars: A vector with the value of the argument AnalyticVars

  • params: A list with the values of the grouping, logical, and numeric arguments

  • formulaRf: The formula used in the analysis (the variables specified in the argument AnalyticVars separated by + signs)

  • forest: A summary of the random forest call, estimated error rate, and confusion matrix

  • importance: A data frame with information on the importance of each variable in AnalyticVars

  • confusion: A data frame with the estimate of the confusion matrix

  • predictedData: A data frame with the artifact data used for predictions; if there is missing data, after imputation of the missing data

  • predictedNA: A data frame with the observations for which missing data were imputed; NA if there are no missing data

  • predictedSources: A data frame with prediction information, sample ID (if requested), and values of AnalyticVars

  • predictedTotals: A vector with the predicted totals for each group (source)

  • impError: The estimated OOB (out of bag) error for imputed predictor data; NA if no imputed data

  • location: The value of the parameter folder

Examples

data(ObsidianSources)
analyticVars<-c("Rb","Sr","Y","Zr","Nb")
save_randomForest <- ps_randomForest(data=ObsidianSources, GroupVar="Code",Groups="All",
  sourceID="ID", AnalyticVars=analyticVars, NvarUsed=3, plotSourceProbs=FALSE)
#
# predicted sources for artifacts
data(ObsidianSources)
data(ObsidianArtifacts)
analyticVars<-c("Rb","Sr","Y","Zr","Nb")
save_randomForest <- ps_randomForest(data=ObsidianSources, GroupVar="Code",Groups="All",
AnalyticVars=analyticVars, sourceID="ID", NvarUsed=3, plotErrorRate=FALSE,
plotImportance=FALSE, predictSources=TRUE, predictData=ObsidianArtifacts, unknownID="ID",
 plotSourceProbs=TRUE)


benmarwick/karon documentation built on July 29, 2023, 10:11 a.m.