Description Usage Arguments Details Value Methods (by class) References See Also Examples
Fit a redistribution model to predict the Underlying Causes (UCs) from Garbage Codes (GCs). NHIRC-Usable \loadmathjax
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
formula |
an object of class "formula": a symbolic description of the model to be fitted.
This should contain an outcome variable (as the underlying causes) and predictor variables if any.
Predictor variables put inside |
data |
a data.frame or a list (with equal-length vectors) containing all the variables. |
gc_to_uc |
a named matrix specifying the a priori constraints for GC-UC mapping. The row names are used as the Garbage Code (GC) levels and the column names are used as the Underlying Cause (UC) levels. This is a required argument. See Details §2 also. |
nm_id |
variable name of the identity key for the individual record. |
method |
one of the following redistribution model, See Details §3 also.
|
prop_valid |
proportion of data used in validation (default to 0.2). See Details §4 also. |
... |
the following optional arguments are passed to redistribution methods.
|
The formula
argument takes the general model form as in lm()
or glm()
. The form should be like GUC ~ x1 + x2 + multi(MC1, MC2)
,
where GUC
is the name of the outcome variable (the underlying causes of death , UCs). The RHS of ~
contains the names of predictors used by the model. Here,
x1
and x2
represent the normal predictor variables as in common regression models. multi(...)
is used to specify the multiple causes of death (here, MC1
and MC2
).
multi(...)
is treated differently in different methods. In "NB", multi(...)
seen as item-sets to calculate the conditional probabilities. See the reference paper for details.
In "MLR", multi(...)
is transformed into many binary variables indicating whether one cause of death item exist. There should be only one multi(...)
term in the formula.
Also, only factor or character variables are accepted.
The GC-UC mapping constraints gc_to_uc
should be a named matrix. The row names and column names are required as the row names define the GC categories,
and the column names define the UC categories. The entries of this matrix (\mjseqnA) should be binary, so that \mjseqnA_ij = 1 denotes
the permission to redistribute \mjseqni-th GC category to \mjseqnj-th UC category, otherwise, \mjseqnA_ij = 0.
Records with UCs (defined in gc_to_uc
) in the outcome variable are used to train the "NB" or "MLR" model. Then, the trained model is used to
predict the UCs for those having GC outcomes. Generally, "NB" is recommended as it better handles missing data and large number of UC categories (more accurate and efficient) .
However, "MLR" can perform better with more complete data and small number of UC categories. We recommend using validation procedure to compare the two methods
before full implementation.
When the proportion of validation (prop_valid
= 0.2 by default) is greater than zero, a random proportion of records with UCs is erserved for validation.
Binary and cross entropy error measures are used to evaluate the model performance.
Use summary()
to the returned GUCfit
object to see the average errors in the training and validation partition.
A GUCfit
object containing the following components.
formula
: The formula same as the input
pred_GUC
: The the predicted UC probabilities for each GC record. A data.frame where
the row identifying the individual records, and the column identifying the UC categories.
The key nm_id
is preserved to identify individual predictions.
dat_info
: A data.frame summarizing the no. of records used for training, validation, and prediction.
error_info
: A data.frame summarizing the error measures
fit
: The fitted model.
gcs
: A character vector listing the GC levels, same as the rownames of gc_to_uc
.
ucs
: A character vector listing the UC levels, same as the colnames of gc_to_uc
.
method
: The modeling method.
GUCfit
: Print the basics (GC/UC levels, redistribution method) of GUCfit
GUCfit
: Print the details (variable importance, errors) of GUCfit
Ng, T. C., Lo, W. C., Ku, C. C., Lu, T. H., & Lin, H. H. (2020). Improving the use of mortality data in public health: A comparison of garbage code redistribution models. American journal of public health, 110(2), 222-229.
multideath for the demo dataset, nnet::multinom for the underlying MLR method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | ## Not run:
# load demo dataset
data("multideath")
# create a full gc_to_uc matrix
gucs <- sort(unique(multideath$GUC))
gc_to_uc = matrix(1, 10, 97, dimnames = list(gucs[98:107], gucs[1:97]))
# predictors have to be factors or characters
d <- multideath
d$x1 <- factor(d$x1)
d$x2 <- factor(d$x2)
d$x3 <- factor(d$x3)
# fit a NB model
fit1 <- GUCfit(
formula = GUC ~ age + x1 + x2 + x3 + multi(MC1, MC2, MC3),
data = d, gc_to_uc = gc_to_uc,
nm_id = "id", method = "NB", prop_valid = 0.2)
# summarizing the results
summary(fit1)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.