To carry out a search partition analysis (SPAN)

Description

To carry out a search partition analysis (SPAN)

Usage

1
2
spanr(formula, weight = NA, data = NULL, cc = FALSE, makepos = TRUE,
  beta = NA, size = c(2, 2, 1), gamma = NA)

Arguments

formula

A formula of the standard form y ~ x + u + v + w.... giving the outcome y and predictor covariates x, u, v, w..... Operators other than + should not be used. A survival object is allowed for y. For example, Surv(time,death) ~ x + u + v + w.... in which case optimation is with respect to log-rank chi-square survival differences

data

A data frame with the variables in the formula.

weight

A frequency weight attached to each row of data. Default, NA, indicates unit weight to each data row.

cc

Indicates complete case analysis (default FALSE). If TRUE, a row of data is deleted if any one attribute is missing. Otherwise a case is only deleted if any attribute is missing in a Boolean combination, as evaluated during a search. Default FALSE

makepos

If TRUE, and an attribute is found to be negative, the direction of x is reversed. The rule for reversal is if mean of y|x=1 < mean of y|x=0. When y is a survival object the rule for creversal is if rate |x=1 < rate |x=0 where rate= case/person-time. Default is TRUE.

beta

Parameter controlling degree of complexity penalising. Zero for no complexity penalising. NA (default) or negative determines a value for beta automatically as 0.03 times the initial gradient of the compleity hull.

size

Defines the upper allowable size parameters of a disjunctive normal form used in the initial iteration of a search. It is a list of length q defining p_1,p_2,..p_q. Default c(2,2,1) defines p_1=2, p_2=2, and p_3=1.

gamma

Parameter controlling balance of observations in A and its complement !A. Default is NA, corresponds to no balancing. Balancing multiplies either MSE reduction or log-rank by (P_A(1-P_A))^γ where P_A is proportion of data in A to make a new optimization criterion.

Details

A function to search for an optimal Boolean combination partition. Optimization is with respect to reduction in mean square error of y by split into partition (A,!A), or if y is a survival object, with respect to log-rank chi-square for survival differences of (A,!A). The Boolean expression for A is output in normal disjunctive form A= g_1 | g_2 | g_3 | ... and the Boolean expression for the complement !A is also output in normal disjunctive form !A = h_1 | h_2 | h_3 | .... Each element of the disjunctive forms, g_i of A, or h_i of !A, of the represents a subgroup. Subgroups are returned data frames.

If variables x, u, v, w.... of the formula are not coded binary, a pre-analysis is done to establish an optimal cut of the variable. This is done, again with respect to reduction in MSE, or log-rank for a survival formula, over values of the variable. If numeric, a dictotomy is made by above/below a cut, the possible cuts being unique values of the variable if there are 20 or fewer, otherwise at 20 equally spaced intervals. If factor variable, according each value of the factor.

Value

Object spanr with attributes:

A Data frame of same length as input data that is a binary indicator of belonging to A.

g Data frame of same length as input data, columns indicating belonging to the subgroups of A

h Data frame of same length as input data, columns indicating belonging to the subgroups of !A

Author(s)

Roger Marshall <rj.marshall@auckland.ac.nz>, The University of Auckland, New Zealand

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
## 1. Simulate Bernoulli binary predictors x1, x2...x10, and outcome y
## For (x1 x2 x3) | (x1 x4) | (x1 x9),  make y~N(11,0.5) and N(10,0.5) otherwise.
x <- matrix(data=rbinom(10000,1,0.5),nrow=1000,ncol=10)
colnames(x) <- paste("x", seq(1:10), sep = "")
P <- ifelse((x[,1]& x[,2] & x[,3])|(x[,1] & x[,4])|x[,9] & x[,1], 1,0)
y <- ifelse(P,rnorm(1000,11,0.5),rnorm(1000,10,0.5) )
d <- data.frame(cbind(y,x))
sp <- spanr(formula= y ~ x1 +x2+x3+x4+x5+x6+x7+x8+x9+x10,data=d,size=c(1,2,2),beta=NA)
## 2. Survival analysis of pbc data
library(survival)
data(pbc)
sp <-with(pbc, spanr(formula = Surv(time, status==2) ~ trt + age + sex + ascites
               + hepato + spiders + edema + bili + chol + albumin
               + copper  + ast + trig + platelet + protime + stage,
                 beta=NA,cc=TRUE,gamma=1)   )
test <- cbind(pbc,sp$A)
##Kaplan-Meier curves of A  versus !A
x <- survfit(Surv(test$time,test$status==2) ~ test$A)
plot(x, col=c(1,2))

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.