GUCfit: Predicting the Underlying Causes of Death

Description Usage Arguments Details Value Methods (by class) References See Also Examples

View source: R/GUCfit.R

Description

Fit a redistribution model to predict the Underlying Causes (UCs) from Garbage Codes (GCs). NHIRC-Usable \loadmathjax

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
## S3 method for class 'formula'
GUCfit(
  formula,
  data,
  gc_to_uc,
  nm_id = "id",
  method = c("NB", "MLR"),
  prop_valid = 0.2,
  ...
)

GUCfit(formula, ...)

## S3 method for class 'GUCfit'
print(x)

## S3 method for class 'GUCfit'
summary(x)

Arguments

formula

an object of class "formula": a symbolic description of the model to be fitted. This should contain an outcome variable (as the underlying causes) and predictor variables if any. Predictor variables put inside multi() are recognized as multiple causes. See Details §1 also.

data

a data.frame or a list (with equal-length vectors) containing all the variables.

gc_to_uc

a named matrix specifying the a priori constraints for GC-UC mapping. The row names are used as the Garbage Code (GC) levels and the column names are used as the Underlying Cause (UC) levels. This is a required argument. See Details §2 also.

nm_id

variable name of the identity key for the individual record.

method

one of the following redistribution model, See Details §3 also.

  • "NB": Naive Bayes Classifier (default).

  • "MLR": Multinomial Logistic Regression implemented by nnet::multinom().

prop_valid

proportion of data used in validation (default to 0.2). See Details §4 also.

...

the following optional arguments are passed to redistribution methods.

  • alp = 0.1 (default): The smoothing parameter in "NB" method , i.e. the additional counts added to all strata of the conditional probabilities.

  • maxit = 100 (default): Maximum number of iterations in "MLR" method. Additional arguments for nnet can also be specified here. See nnet::nnet() for details.

Details

§1. Specify the model formula

The formula argument takes the general model form as in lm() or glm(). The form should be like GUC ~ x1 + x2 + multi(MC1, MC2), where GUC is the name of the outcome variable (the underlying causes of death , UCs). The RHS of ~ contains the names of predictors used by the model. Here, x1 and x2 represent the normal predictor variables as in common regression models. multi(...) is used to specify the multiple causes of death (here, MC1 and MC2). multi(...) is treated differently in different methods. In "NB", multi(...) seen as item-sets to calculate the conditional probabilities. See the reference paper for details. In "MLR", multi(...) is transformed into many binary variables indicating whether one cause of death item exist. There should be only one multi(...) term in the formula. Also, only factor or character variables are accepted.

§2. Specify the GC-UC mapping constraints

The GC-UC mapping constraints gc_to_uc should be a named matrix. The row names and column names are required as the row names define the GC categories, and the column names define the UC categories. The entries of this matrix (\mjseqnA) should be binary, so that \mjseqnA_ij = 1 denotes the permission to redistribute \mjseqni-th GC category to \mjseqnj-th UC category, otherwise, \mjseqnA_ij = 0.

§3. Redistributing GCs to UCs

Records with UCs (defined in gc_to_uc) in the outcome variable are used to train the "NB" or "MLR" model. Then, the trained model is used to predict the UCs for those having GC outcomes. Generally, "NB" is recommended as it better handles missing data and large number of UC categories (more accurate and efficient) . However, "MLR" can perform better with more complete data and small number of UC categories. We recommend using validation procedure to compare the two methods before full implementation.

§4. Validation and the error measures

When the proportion of validation (prop_valid = 0.2 by default) is greater than zero, a random proportion of records with UCs is erserved for validation. Binary and cross entropy error measures are used to evaluate the model performance. Use summary() to the returned GUCfit object to see the average errors in the training and validation partition.

Value

A GUCfit object containing the following components.

Methods (by class)

References

Ng, T. C., Lo, W. C., Ku, C. C., Lu, T. H., & Lin, H. H. (2020). Improving the use of mortality data in public health: A comparison of garbage code redistribution models. American journal of public health, 110(2), 222-229.

See Also

multideath for the demo dataset, nnet::multinom for the underlying MLR method.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
## Not run: 
# load demo dataset
data("multideath")

# create a full gc_to_uc matrix
gucs <- sort(unique(multideath$GUC))
gc_to_uc = matrix(1, 10, 97, dimnames = list(gucs[98:107], gucs[1:97]))

# predictors have to be factors or characters
d <- multideath
d$x1 <- factor(d$x1)
d$x2 <- factor(d$x2)
d$x3 <- factor(d$x3)

# fit a NB model
fit1 <- GUCfit(
  formula = GUC ~ age + x1  + x2  + x3 + multi(MC1, MC2, MC3),
  data = d, gc_to_uc = gc_to_uc,
  nm_id = "id", method = "NB", prop_valid = 0.2)

# summarizing the results
summary(fit1)

## End(Not run)

dachuwu/TBDtoolbox documentation built on Dec. 27, 2021, 8:11 p.m.