va_valid: va.valid: a method selecting potentially biased symptoms
In iqss-research/VA-package: Verbal Autopsies

Description Usage Arguments Details Value References

This function selects the symptoms that are potentially reported with bias by comparing the fitted marginal distribution of the symptoms with the observed symptom distribution in the community sample. To control for the false discovery rate, Bonferroni adjustment to the significance values can be sequentially applied to the models.

 va.validate(formula,  data=list(hospital=NA, community=NA),
               nsymp=10,n.subset=300, nboot=1, boot.se=FALSE, method="quadOpt",
                      fix=NA, bound=NA, prob.wt=1, 
                      printit=TRUE,print.reg.size=TRUE,
                      clean.method="ttest",
                      min.Symp=10, confidence=0.95, FDR=TRUE)

`formula`	A formula object. The left side of the formula is the collection of symptoms. The right side is the cause of death. For example, if there are 5 symptoms, named `fever`,`coughing`,`chestpain`,`dizziness`, `shortbreath`, and the cause of death variable is `death`, then the formula can be written as: `formula=cbind(fever, coughing, chestpain, dizziness, shortbreath)~death` or for short as: `formula=cbind(fever, ... ,shortbreath)~death` Note that the short way of writing formula requires the symptoms variables are located in a consecutive block in the data starting from `fever` and ending with `shortbreath`. Note that the current version requires the varible on the right hand side of the formula, `death` in this example, to be present in the `community` sample. If it is unknown in the `community` sample, the user needs to create such variable with arbitrary numerical values.
`data`	a list of two datasets. The first is the hospital data, which contains the known cause of death for each individual, and a collection of symptoms from verbal autopsy studies. The second is the community data where typically only the symptoms are available. The known cause of death can be available outside hospital if it is a validation study, but it will not be used during estimation. Variable names must be exactly the same in two data sets.
`nsymp`	a positive integer, specifing the size of subsets of symptoms drawn from the total set for estimating cause specific mortality fractions at each iteration. The optimal number of `nsymp` can be found calling `va_gcv`, which use general cross-validation method to find the optimal size of subset that minimize the prediction errors based on the training data(typically, hospital data). For more details, refer to King and Lu (2008).
`n.subset`	A positive integer specifing the total number of subsets and thus estimations of all symptoms. The default is `300`.
`nboot`	a positive integer. If `boot.se=TRUE`, it specifies the number of bootstrapping samples taken to estimate the standard errors of CSMF. The default is `1`.
`boot.se`	a Logical value. If `TRUE`, bootstrap standard errors of the CSMF are estimated. This typically takes a lot of computing time. It is highly suggested to set `boot.se=FALSE` in `va_gcv`. Default=`FALSE`.
`method`	A string specifying the computational procedure used to estimate the cause specific mortality fractions. When `method=''quadOpt''`, CSMF is estimated via constrained quadratic programming. A subroutine (`Solve.QP`) from the `quadprog` package is called to perform the constrained quadratic optimization task. When method=``constrainLS'', CSMF is estimated via constrained least squares. The default method is `quadprog` as it is faster and more stable.
`fix`	A vector of strings that specifies whether a subset of the cause specific mortality fractions are set to predetermined values (based on, e.g.,the information obtained from other sources). Suppose we would like to prefix ”d1” to be 5%, ”d2” to be 15%, then `fix=c("d1=0.05", "d2=0.15")`. The default is `NA`, no such constrain is imposed.
`bound`	A vector of strings that specifies lower and upper bounds of a subset of the cause specific mortality fractions (based on, e.g.,the information obtained from other sources). Suppose we would like ”d3” to be estimated between 5% and 10%, "d4" to be between 1% and 2%, then `bound=c("0.05<d3<0.1", "0.01<d4<0.02")`. The default is `NA`, no such constrain is imposed.
`prob.wt`	A positive integer or a vector of weights that determines how likely a symptom is of being selected for a subset. When `prob.wt` is a user input vector, it needs to be a vector of probabilities and sum up to 1. The length of `prob.wt` needs to be equal to the total number of symptoms. When `prob.wt=1`, binomial weights which are proportion to the inverse of variances of the each reported binary symptom variable. When `prob.wt=0`, all symptoms will be equally selected. The default is `1`.
`printit`	Logical value. If `TRUE`, the progress of the estimation procedure will be printed on the screen.
`print.reg.size`	Logical value. If `TRUE`, the size of the regression matrix is printed at each step of subsampling. It provides helpful information for user to choose the number of symptoms to subsample. It is recommended to print the size of the regression matrix for different values of `nsymp` with a small size of `n.subset`.
`clean.method`	A string specifying which test to use to detect poorly fit symptoms. The default is `''ttest''`. The other option is `''ztest''`. For details, see King and Lu (2008b).
`min.Symp`	An integer value. When the number of availability symptoms is less than `min.Symp`, the automatic procedure for selecting next poorly fit symptom will stop. The default is 10. But if `min.Symp` is less than `nsymp`, `va.validate` coerces `min.Symp` to be `nysmp`.
`confidence`	A number between 0 and 1. It specifies the confidence level (or the significance level `(1-confidence)/2`) at which user decided whether to remove a symptom out of the estimation. The default is 0.9
`FDR`	Logical value. If `TRUE`, a Bonferroni adjustment for multiple testing is applied sequentially to a collection of nested models as more symptoms being removed. For details, see King and Lu (2008b).

For details, please refer to ”Designing Verbal Autopsy Analyses: A Report to the World Health Organization” (King and Lu, 2008b) and http:\gking.harvard.edu\va

va.validate outputs the following objects. cod.list returns a list of cause of death estimations based on a set of nested models. Ps.list returns a list of fitted marginal symptom distributions based on the same set of nested models. delete.list returns a list of collections of removed symptoms based on the nested models. FDR.delete.list returns a list of symptoms that should be removed based on the Bonferroni adjustment and attained a global confidence level of confidence. When boot.se is TRUE, va.validate also returns a set of objects that summarizes the variance of the predicted marginal symptom distribution (Ps.var), the variance of the residuals (e.var), the results based on all the bootstrapped samples (cod.list.boot and Ps.list.boot).

King, Gary and Ying Lu. (2008) “Verbal Autopsy Methods with Multiple Causes of Death”, 14(1), Statistical Science. Also available at http:gking.harvard.edu/va King, Gary and Ying Lu. (2008b) “Designing Verbal Autopsy Analyses: A Report to WHO”.

iqss-research/VA-package documentation built on Dec. 20, 2021, 7:58 p.m.