An R package for implementing the "Trichotomous Classification Tree (TriCTree)" algorithm.
Reference: Zhu Y and Fang J (2016). Logistic regression-based trichotomous classification tree and its application in medical diagnosis. Med Decis Making, 36(8):973-89. doi: 10.1177/0272989X15618658.
MIT
The latest version of the TriCTree
package is available at GitHub taotliu/TriCTree. It requires the devtools
package to be installed in R. If you do not have devtools in your R program, use the code install.packages("devtools")
to install the devtools package first. Then run the following codes to install the TriCTree
package.
install.packages("devtools")
library(devtools)
devtools::install_github("taotliu/TriCTree")
library(TriCTree)
The following R
code example demonstrates the use of the TriCTree
package.
In the simulate_normal() function you can generate a dataset with 20 variables with a specific mean vector(mean) and correlation matrix(p between each two variables). The default dataset includes 800 rows and can be changed by settting the number of datasets n, 20 observations in each detaset.
The default mean vector of variables labeled 0 is set 0, and the default mean vector of data with label 1 is assumed as: (0,0,0,0,0.4,0.4,0.4,0.4,0.8,0.8,0.8,0.8,1.2,1.2,1.2,1.2,1.6,1.6,1.6,1.6)
.
> dat = simulate_normal(p = 0.8, mean = rep(0, 20), n = 20)
The output dat
is a 800x21 matrix, the first colomn indicating the label of observation(0 or 1). 400 observations of 0 and 400 of 1 are generated.
In the simulate_nonnormal() function you can generate a distribution with specific skewness(skewvec) and kurtosis(kurtvec).and the default correlation(0.5) can be changed. The output format is 'matrix'.
> simulate_nonnormal(p=0.5, skewvec=rep(3.5,20), kurtvec=rep(20,20))
The output dat
is a 800x21 matrix, the first colomn indicating the label of observation(0 or 1).400 observations of 0 and 400 of 1 are generated.
We use LRTCT method to generate a classification tree. The default minimum separate number of observations in each iteation is set as 5, and the default teminate condition is that suspended data is less than 20. The input formula helps to choose the variables that can be utilized as classification criterion.
> data = simulate_normal(n=40, p = 0.8)
> data1 = data.frame(data[2*(1:400), ])
> param = TriCTree_tripart(X1~.,data1)
The param is a lrtct
type list, containing the information of the classification result:
> param[[1]]
[1] "X20" "X4" "X5" "X18"
This is the index of variables selected as criterion.
> param[[2]]
[1]
Call: glm(formula = y ~ ., family = binomial, data = candidate_data)
Coefficients:
(Intercept) X20 X4
-4.314 5.215 -4.447
where param[[2]]
is a list of the linear models. For detailed explanation, use summary()
.
> param[[3]]
[,1] [,2]
[1,] -0.79020859 2.0126159
[2,] 0.03459975 0.9695618
[3,] 0.07811542 0.9090882
[4,] 0.20952837 0.9057503
This is the upper and lower bound of selection criteria. For detailed explanation, use summary().
Te summary() function exhibits the results of classification and explains the meaning of each parameter.
> summary(param)
Classification variable 1 is X20
if X20 < -0.7902086 ,classification=0
if X20 > 2.012616 ,classification=1
else it enters the next iteration
This is a paragragh to demonstrate how to manipulate the classification.
TriCTree_predict() function is used to predict the classification of the observations in test dataset.
The model function is the output of assert() function to classify the suspended data in the last layer, and p indicates the prior probability of (type==0
), 0.5 by default.
> dat = simulate_normal(p = 0.8, mean = rep(0, 20), n = 20)
> data1 = data.frame(dat[2*(1:400), ])
> data2 = data.frame(dat[2*(1:400)-1,])
> param = TriCTree_tripart(X1~.,data1)
> model = assert(data1)
> result = TriCTree_predict(model,data2,param)
Result is an array predicting the classification of each observation in the test dataset.
While there is a possibility that some data in the test dataset remain left after the classification, an assertion of the data type is made. It is not recommended to do so in practice, because these data need to be scrutinized. However the classification can still be accomplished when necessary.
> dat = simulate_normal(p = 0.8, mean = rep(0, 20), n = 20)
> model=assert(dat)
The output of assert() function is a generalized linear model for complimantary classification.
Tao Liu, PhD tliu@stat.brown.edu
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.