ps_randomForest: ps_randomForest
In benmarwick/karon: Compositional Data Analysis of Archaeological Artefacts

ps_randomForest

R Documentation

ps_randomForest

Description

Implements a random forest analysis of source data, and predicts sources of unknowns if requested

Usage

ps_randomForest(
  doc = "ps_randomForest",
  data,
  GroupVar,
  Groups = "All",
  AnalyticVars,
  sourceID = " ",
  Ntrees = 500,
  NvarUsed = NA,
  Seed = 11111,
  digitsImportance = 1,
  plotErrorRate = TRUE,
  plotImportance = TRUE,
  predictSources = FALSE,
  predictData = NA,
  unknownID = " ",
  plotSourceProbs = TRUE,
  folder = " "
)

Arguments

`doc`	Documentation for the function use added to model usage, default value is the function name
`data`	A data frame with the data used to grow trees (source data if predictions are made)
`GroupVar`	The name of variable defining groups, grouping is required
`Groups`	A vector of codes for groups to be used, 'All' if use all groups
`AnalyticVars`	A vector with names (character-valued) of the analytic variables
`sourceID`	If not " " (the default), the name of the variable with sample ID for source data
`Ntrees`	The number of trees grown, default value of 500 is that for the randomForest function
`NvarUsed`	If not NA (the default), the number of variables to use in each random forest call to rpart; if NA, rpart uses the default value for randomForest() (the square root of the number of candidate variables)
`Seed`	If not NA, a random number generator seed to produce reproducible results; default value is 11111
`digitsImportance`	The number of significant digits for the importance measure, default is 1
`plotErrorRate`	Logical, whether to show the error rate plot, default is TRUE
`plotImportance`	Logical, whether to show the plot of variable importance, default is TRUE
`predictSources`	Logical; if T, predict sources for the data in predictData; default is FALSE
`predictData`	A data frame or matrix with data used to predict sources for observations, must contain all variables in AnalyticVars_
`unknownID`	if not " " (the default), the name of the variable with the sample ID for artifact data
`plotSourceProbs`	Logical, if TRUE (the default) and predictSources=TRUE, show box plots of source probabilities
`folder`	The path to the folder in which data frames will be saved; default is " "

Details

The function implements a random forest analysis using the R function randomForest(). If predictSources and plotSourceProbs are TRUE, the function creates two box plots. The first plot shows, for each source, the set of probabilities of assignment to that source for the observations assigned to that source (all of these probabilities should be large). The second plot shows, for each source, the set of probabilities of assignment to that source for the observations not assigned to that source (for each source, there is one such probability for observation); these probabilities should be relatively small, and some should be zero. See the vignette for more details and examples of these plots.

Value

The function returns a list with the following components:

usage: A string with the contents of the argument doc, the date run, the version of R used
dataUsed: The contents of the argument data restricted to the groups used
sourcesNA: A data frame with data from the data frame data with missing values, NÁ if no missing values
analyticVars: A vector with the value of the argument AnalyticVars
params: A list with the values of the grouping, logical, and numeric arguments
formulaRf: The formula used in the analysis (the variables specified in the argument AnalyticVars separated by + signs)
forest: A summary of the random forest call, estimated error rate, and confusion matrix
importance: A data frame with information on the importance of each variable in AnalyticVars
confusion: A data frame with the estimate of the confusion matrix
predictedData: A data frame with the artifact data used for predictions; if there is missing data, after imputation of the missing data
predictedNA: A data frame with the observations for which missing data were imputed; NA if there are no missing data
predictedSources: A data frame with prediction information, sample ID (if requested), and values of AnalyticVars
predictedTotals: A vector with the predicted totals for each group (source)
impError: The estimated OOB (out of bag) error for imputed predictor data; NA if no imputed data
location: The value of the parameter folder

Examples

data(ObsidianSources)
analyticVars<-c("Rb","Sr","Y","Zr","Nb")
save_randomForest <- ps_randomForest(data=ObsidianSources, GroupVar="Code",Groups="All",
  sourceID="ID", AnalyticVars=analyticVars, NvarUsed=3, plotSourceProbs=FALSE)
#
# predicted sources for artifacts
data(ObsidianSources)
data(ObsidianArtifacts)
analyticVars<-c("Rb","Sr","Y","Zr","Nb")
save_randomForest <- ps_randomForest(data=ObsidianSources, GroupVar="Code",Groups="All",
AnalyticVars=analyticVars, sourceID="ID", NvarUsed=3, plotErrorRate=FALSE,
plotImportance=FALSE, predictSources=TRUE, predictData=ObsidianArtifacts, unknownID="ID",
 plotSourceProbs=TRUE)

benmarwick/karon documentation built on July 29, 2023, 10:11 a.m.