insilico.train: Modified InSilicoVA methods with training data

View source: R/insilico_train.r

insilico.trainR Documentation

Modified InSilicoVA methods with training data

Description

This function implements InSilicoVA model with non-InterVA4 input data.

Usage

insilico.train(
  data,
  train,
  cause,
  causes.table = NULL,
  thre = 0.95,
  type = c("quantile", "fixed", "empirical")[1],
  isNumeric = FALSE,
  updateCondProb = TRUE,
  keepProbbase.level = TRUE,
  CondProb = NULL,
  CondProbNum = NULL,
  datacheck = TRUE,
  datacheck.missing = TRUE,
  warning.write = FALSE,
  external.sep = TRUE,
  Nsim = 4000,
  thin = 10,
  burnin = 2000,
  auto.length = TRUE,
  conv.csmf = 0.02,
  jump.scale = 0.1,
  levels.prior = NULL,
  levels.strength = NULL,
  trunc.min = 1e-04,
  trunc.max = 0.9999,
  subpop = NULL,
  java_option = "-Xmx1g",
  seed = 1,
  phy.code = NULL,
  phy.cat = NULL,
  phy.unknown = NULL,
  phy.external = NULL,
  phy.debias = NULL,
  exclude.impossible.cause = TRUE,
  impossible.combination = NULL,
  indiv.CI = NULL,
  CondProbTable = NULL,
  ...
)

Arguments

data

The original data to be used. It is suggested to use similar input as InterVA4, with the first column being death IDs and 245 symptoms. The only difference in input is InsilicoVA takes three levels: “present”, “absent”, and “missing (no data)”. Similar to InterVA software, “present” symptoms takes value “Y”; “absent” symptoms take take value “NA” or “”. For missing symptoms, e.g., questions not asked or answered in the original interview, corrupted data, etc., the input should be coded by “.” to distinguish from “absent” category. The order of the columns does not matter as long as the column names are correct. It can also include more unused columns than the standard InterVA4 input. But the first column should be the death ID. For example input data format, see RandomVA1 and RandomVA2.

train

Training data, it should be in the same format as the testing data and contains one additional column (see cause below) specifying known cause of death. The first column is also assumed to be death ID.

cause

the name of the column in train that contains cause of death.

causes.table

The list of causes of death used in training data.

thre

a numerical value between 0 to 1. It specifies the maximum rate of missing for any symptoms to be considered in the model. Default value is set to 0.95, meaning if a symptom has more than 95% missing in the training data, it will be removed.

type

Three types of learning conditional probabilities are provided: “empirical”, “quantile” or “fixed”. Since InSilicoVA works with ranked conditional probabilities P(S|C), “quantile” means the rankings of the P(S|C) are obtained by matching the same quantile distributions in the default InterVA P(S|C), and “fixed” means P(S|C) are matched to the closest values in the default InterVA P(S|C) table. Empirically both types of rankings produce similar results. “empirical”, on the other hand, means no ranking is calculated, but use the empirical conditional probabilities directly. If “empirical”, updateCondProb will be forced to be FALSE.

isNumeric

Indicator if the input is already in numeric form. If the input is coded numerically such that 1 for “present”, 0 for “absent”, and -1 for “missing”, this indicator could be set to True to avoid conversion to standard InterVA format.

updateCondProb

Logical indicator. If FALSE, then fit InSilicoVA model without re-estimating conditional probabilities.

keepProbbase.level

see insilico for more detail.

CondProb

see insilico for more detail.

CondProbNum

see insilico for more detail.

datacheck

Not Implemented.

datacheck.missing

Not Implemented.

warning.write

Not Implemented.

external.sep

Not Implemented.

Nsim

see insilico for more detail.

thin

see insilico for more detail.

burnin

see insilico for more detail.

auto.length

see insilico for more detail.

conv.csmf

see insilico for more detail.

jump.scale

see insilico for more detail.

levels.prior

see insilico for more detail.

levels.strength

see insilico for more detail.

trunc.min

see insilico for more detail.

trunc.max

see insilico for more detail.

subpop

see insilico for more detail.

java_option

see insilico for more detail.

seed

see insilico for more detail.

phy.code

see insilico for more detail.

phy.cat

see insilico for more detail.

phy.unknown

see insilico for more detail.

phy.external

see insilico for more detail.

phy.debias

see insilico for more detail.

exclude.impossible.cause

Whether to include impossible causes

impossible.combination

a matrix of two columns, first is the name of symptoms, and the second is the name of causes. Each row corresponds to a combination of impossible symptom (that exists) and cause.

indiv.CI

see insilico for more detail.

CondProbTable

a data frame of two columns: one alphabetic level of the CondProb argument and one numerical value corresponding to the numerical value of each level. Only used when only conditional probabilities are provided instead of training data.

...

not used

Details

Please see insilico for more details about choosing chain length and OS system differences. This function implements InSilico with customized input format and training data.

For more detail of model specification, see the paper on https://arxiv.org/abs/1411.3042.

Value

insilico object


InSilicoVA documentation built on Sept. 29, 2022, 9:06 a.m.