va_gcv: va.gcv:General Cross Validation for Seleting Optimal Size to...
In iqss-research/VA-package: Verbal Autopsies

Description Usage Arguments Details Value References

This function determines the optimal size of the subsets of symptoms. In general, the training set is randomy splited, half of the training set is treated as the hospital sample and the other half is treated as the community sample. va.gcv seraches for the optimal size of subsets nsymp that minimize the prediction errors in the community sample.The estimation is done using constrained quadratic optimization.

1 2	va.gcv(formula, data, nsymp.vec, n.subset=300, prob.wt=1, boot.se=FALSE, nboot=1, printit=FALSE, print.reg.size=TRUE)

`formula`	a formula object. The left side of the formula is the collection of symptoms. The right side is the cause of death. For example, if there are 5 symptoms, named `fever`,`coughing`,`chestpain`,`dizziness`, `shortbreath`, and the cause of death variable is `death`, then the formula can be written as: `formula=cbind(fever, coughing, chestpain, dizziness, shortbreath)~death` or for short as: `formula=cbind(fever, ... ,shortbreath)~death` Note that the short way of writing formula requires the symptoms variables are located in a consecutive block in the data starting from `fever` and ending with `shortbreath`. Note that the current version requires the varible on the right hand side of the formula, `death` in this example, to be present in the `community` sample. If it is unknown in the `community` sample, the user needs to create such variable with arbitrary numerical values.
`data`	a list of two datasets. The first is the hospital data, which contains the known cause of death for each individual, and a collection of symptoms from verbal autopsy studies. The second is the community data where typically only the symptoms are available. The known cause of death can be available outside hospital if it is a validation study, but it will not be used during estimation. Variable names must be exactly the same in two data sets.
`nsymp.vec`	a vector of positive integer, it contains a collection of different `nsymp` that can be used by `va()`. For a total of `J` number of causes of death and a total of `ns` symptoms in the sample, `nsymp.vec` cna be set to be a vector `a:b`, while `a` is the smallest integer than $2^a>J$. `b` is typically set to be `floor{0.75*b}`. If sample size is small, `b` can be set to smaller value to avoid function exiting due to data sparsity. No default value is set.
`n.subset`	A positive integer specifing the total number of subsets and thus estimations of all symptoms. The default is `300`.
`prob.wt`	A positive integer or a vector of weights that determines how likely a symptom is of being selected for a subset. When `prob.wt` is a user input vector, it needs to be a vector of probabilities and sum up to 1. The length of `prob.wt` needs to be equal to the total number of symptoms. When `prob.wt=1`, binomial weights which are proportion to the inverse of variances of the each reported binary symptom variable. When `prob.wt=0`, all symptoms will be equally selected. The default is `1`.
`boot.se`	a Logical value. If `TRUE`, bootstrap standard errors of the CSMF are estimated. This typically takes a lot of computing time. It is highly suggested to set `boot.se=FALSE` in `va_gcv`. Default=`FALSE`.
`nboot`	a positive integer. If `boot.se=TRUE`, it specifies the number of bootstrapping samples taken to estimate the standard errors of CSMF. The default is `1`.
`printit`	Logical value. If `TRUE`, the progress of the estimation procedure will be printed on the screen.
`print.reg.size`	Logical value. If `TRUE`, the size of the regression matrix is printed at each step of subsampling. It provides helpful information for user to choose the number of symptoms to subsample. It is recommended to print the size of the regression matrix for different values of `nsymp` with a small size of `n.subset`.

For details, please refer to "Verbal Autposy Methods with Multiple Causes of Death"(King and Lu, 2008), and http:\gking.harvard.edu\va

va.gcv outputs two objects. best.symp returns the best nsymp that minimizes mean square error between estimated cause-specific mortality fraction and the observed cause-specific mortality fraction. mse returns a vector of mean square errors associated with each size of the subsets (as specified in symp.vec).

King, Gary and Ying Lu. (2008) “Verbal Autopsy Methods with Multiple Causes of Death”, 14(1), Statistical Science, Also available at http:gking.harvard.edu/va

iqss-research/VA-package documentation built on Dec. 20, 2021, 7:58 p.m.