analyzeUJ: Analyze User Journey
In LoneWolf6/UJ-Analysis: Creation and Analysis of User Journey Data

Description Usage Arguments Value See Also Examples

This function applies multiple machine learning techniques for regression or classification. It can select features that contribute to the prediction performance. Additionally, all machine learning techniques are evaluated by RMSE/MAE for regression or ROC/AUC for classification, which are returned as well as the predictions and effects for each model. Note that if categorical variables are included in the data, features with not enough levels will be automatically excluded. You can verify the utilized features in the resulting list.

analyzeUJ(input, target, type, firstFeats, lastFeats, sumFeats, method, PCA, PCAOnly, comps, task, interval, crossValidation, split, folds, optPara, missing, imp, perc, percEx, exFeat, ROC, proba, scale, holdout, holdoutSize)

`input`	an object of type data.frame. Preferably the output of reshapeData() without missing values.
`target`	a mandatory character string defining the dependent variable.
`type`	a mandatory character string defining the type of analysis. This is dependent on the setting of the study to be analyzed. If only one and only one target value is available for each individual, set type to 'aggregate'. In this case, the target variable in the input should be repeated for all observations corresponding to the user. All observations for each individual are then aggregated (default: mean for numeric variables and mode for categorical variable) and utilized for prediction. If target values exist for each individual and each point in time, set type to 'cont'. Default is 'cont'.
`firstFeats`	an optional character string defining the variables that should not be aggreagted by mean/mode if type=aggregate. Here, the first value will be utilized neglecting the rest of the values.
`lastFeats`	an optional character string defining the variables that should not be aggreagted by mean/mode if type=aggregate. Here, the last value will be utilized neglecting the rest of the values.
`sumFeats`	an optional character string defining the variables that should not be aggreagted by mean/mode if type=aggregate. Here, the sum of the corresponding feature will be utilized.
`method`	an optional character string defining the methods to be run in this analysis. Can be 'all', 'svm', 'lm','log', 'lasso', "ridge", or 'treeBoost'. Default is 'all'. all: all possible methods are executed. svm: support vector machine according to the svm function. Depending on the specified class, the type is C-classification or eps-regression with default parameters. lm: linear regression according to the lm function. log: logistic regression according to the glm function. lasso: lasso regression according to the glmnet function. The optimal lambda is found by cross-validation with a predifined lambda sequence of 10^seq(10,-2,length=1000). ridge: ridge regression according to the glmnet function. The optimal lambda is found by cross-validation with a predifined lambda sequence of 10^seq(10,-2,length=1000). treeBoost: extreme gradient boosting trees according to the xgboost function with maximum depth of 3, maximum number of boosting iterations of 1000, and the rest of the parameters are default. (Multiple parameters are tuned if optPara=T (for more information, see optPara))
`PCA`	an optional logical value True or False. Defines if Principal Component Analysis (PCA) is executed for numeric predictors only.
`PCAOnly`	an optional logical value True or False. Defines if only principal components are utilized as features (True) for the predictive models or if the non-numeric features are also included (False). Default is False.
`comps`	an optional non-negative numeric value defining the number of components to utilize. If not specified, as many components are utilized until 99 percent of the variance is explained.
`task`	an optional character string defining the task of analysis. Can be either 'classification' or 'regression'. However, this parameter should be set according to the goal of the analysis. Classification currently only works for binary classification. Default is 'regression'.
`interval`	an optional non-negative numeric value defining the interval for the analysis in days. Default is the consideration of all points in time.
`crossValidation`	an optional logical value True or False. Defines if cross-validation is executed. Default is True.
`split`	an optional non-negative numeric value defining the split for training and test data if crossValidation is False. Specify the percentage as the training data. The rest will be utilized as test data. Default is 70.
`folds`	an optional non-negative numeric value defining the folds for k-fold cross validation. Default is 10.
`optPara`	an optional logical value True or False. Defines if some specific machine learning techniques will be tuned before application. Can increase run time tremendously. Default is False. svm: grid-based search for cost (10^seq(-5,5,0.1)) and gamma parameters (2^(-3:3)) and radial basis kernel. treeBoost: grid based search for eta, max_depth, gamma, colsample_bytree, and min_child_weight.
`missing`	an optional logical value True or False. Defines if missing values are imputed. Can be "median/mode" or "knn". If median/mode, categorical variables are imputed by mode and numeric variables by median corresponding to the imputefunction. If knn, missings values are imputed by k-Nearest Neighbour method according to the kNN function with default values. The knn method can increase computation time depending on the amount of observations and features. Default is False. However, if missing values are still detected, median/mode is applied. After imputation procedure, complete cases are used for analysis.
`imp`	an optional logical value True or False. If missing values exist, this parameter defines if imputation should be executed before the split into training and test set or after the split on each corresponding training and test set. Default is True and imputation will be executed before the split (also affects holdout set if specified).
`perc`	an optional non-negative numeric value. If missing is True or missing values are detected, this value defines the percentage of missing values up to which features are excluded from the analysis. Default is none.
`percEx`	an optional character string defining the variables that should not be deleted even though amount of missing values is higher than provided percentage (perc). Default is none.
`exFeat`	an optional character string defining the features not to be included in the analysis. Has to be specified by name of column of corresponding feature.
`ROC`	an optional logical value True or False. If True, ROC curve will be included in output if task is classification and only two labels exist. Default is True.
`proba`	an optional numeric value between 0 and 1. If task is classification, this is the threshold for the classifying the observations. Confusion Matrix is based on this threshold. Default is .5.
`scale`	an optional logical value True or False. Numeric features will be scaled. Default is True.
`holdout`	an optional logical value True or False. Choose if holdout set should be set aside. Default is FALSE.
`holdoutSize`	a numeric value for the percentage of users in the holdout set, i.e., 25 for 25%. Default is 20%.

Function returns a list containing the following components:

`input`	A data frame consisting of the utilized input data for analysis
`fold`	A list consisting of the indices for each fold of k-fold cross-validation
`TargetObservations`	A numeric vector containing the target observations
`TargetPredictions`	A numeric vector containing the target predictions for each executed method
`performance`	Regression: A data frame consisting of the cross-validated prediction error indicated by mean absolute error and root mean square error for all specified methods and an addtional "mean model", which utilizes the mean of the training data as prediction. These performance measures are included or each fold and averaged across folds. Classification: A list consisting of confusion matrices (for each fold and summed up), values for plotting the receiver operating characteristics curves (ROC), and area under the curve (AUC) values (for each fold and averaged) according to the performance function (if ROC=T) for each executed method.
`results`	A list containing the trained methods based on all data

createFolds, impute, glm, glmnet, svm, kNN

# create data frame with mandatory columns
data = data.frame('id'=rep(c(1:5), each=600),
                  'type'=rep(c('Var1', 'Var2', 'Var3'), times=1000),
                  'value'=rep(c(1:5), times=600),
                  'date'=rep(seq(as.Date("2000/1/1"), by = "day", length.out=60), each=50))

# use function to create rectangle version of user journey
dat = reshapeData(data, parallel=T, cores=2, na.rm=F)

# use function to analyze data
res = analyzeUJ(dat, target="Var1", missing="median/mode", task="regression")
{
  }