fit_density: Fit (multivariate) conditional density
In osofr/condensier: Non-Parametric Conditional Density Estimation

Description Usage Arguments Value Details Modeling P(sA|sW) for summary measures (sA,sW) Three methods for defining bin (interval) cuttoffs for a continuous one-dimenstional summary measure sA[j] Examples

Fit (multivariate) conditional density

fit_density(X, Y, input_data, bin_method = c("equal.mass", "equal.len",
  "dhist"), nbins = 5, max_n_cat = 20, pool = FALSE,
  max_n_bin = NA_integer_, parfit = FALSE,
  bin_estimator = speedglmR6$new(), intrvls = NULL, weights = NULL,
  verbose = getOption("condensier.verbose"))

`X`	A vector containing the names of predictor variables to use for modeling.
`Y`	A character string name of the column that represent the response variable(s) in the model.
`input_data`	Input dataset, can be a `data.frame` or a `data.table`.
`bin_method`	The method for choosing bins when discretizing and fitting the conditional continuous summary exposure variable `sA`. The default method is `"equal.len"`, which partitions the range of `sA` into equal length `nbins` intervals. Method `"equal.mass"` results in a data-adaptive selection of the bins based on equal mass (equal number of observations), i.e., each bin is defined so that it contains an approximately the same number of observations across all bins. The maximum number of observations in each bin is controlled by parameter `max_n_bin`. Method `"dhist"` uses a mix of the above two approaches, see Denby and Mallows "Variations on the Histogram" (2009) for more detail.
`nbins`	Set the default number of bins when discretizing a continous outcome variable under setting `bin_method = "equal.len"`. If set to `NA` the total number of equal intervals (bins) is determined by the nearest integer of `nobs`/`max_n_bin`, where `nobs` is the total number of observations in the input data. Default value is 5.
`max_n_cat`	Max number of unique levels for cat outcome. If the outcome has more levels it is automatically considered continuous.
`pool`	Set to `TRUE` for fitting a pooled regression which pools bin indicators across all bins. When fitting a model for binirized continuous outcome, set to `TRUE` for pooling bin indicators across several bins into one outcome regression?
`max_n_bin`	Max number of observations per single bin for a continuous outcome (applies directly when `bin_method="equal.mass"` and indirectly when `bin_method="equal.len"` but `nbins = NA`). Default is `NA_integer_`.
`parfit`	Default is `FALSE`. Set to `TRUE` to use `foreach` package and its functions `foreach` and `dopar` to perform parallel logistic regression fits and predictions for discretized continuous outcomes. This functionality requires registering a parallel backend prior to running `condensier` function, e.g., using `doParallel` R package and running `registerDoParallel(cores = ncores)` for integer `ncores` parallel jobs. For an example, see a test in "./tests/RUnit/RUnit_tests_04_netcont_sA_tests.R".
`bin_estimator`	The estimator to use for fitting the binary outcomes (defaults to `speedglmR6` which estimates with `speedglmR6`) another default option is `glmR6`.
`intrvls`	A named list of intervals, one for each outcome variable. Each list item consists of a vector of numeric cutoffs that define the bins. By default the bins are defined automatically (based on the selected binning method). When this argument is used it will over-ride automatic bin selection. This options should be only used when fine-tuned control over bin definitions is desired.
`weights`	Specify the column in the input data containing the optional weights. Optionally, specify the weights as a numeric vector.
`verbose`	Set to `TRUE` to print messages on status and information to the console. Turn this on by default using `options(condensier.verbose=TRUE)`.

An R6 object of class SummariesModel containing the conditional density fit(s).

**********************************************************************

As described in the following section, the first step is to construct an estimator P_{g_N}(sA | sW) for the common (in i) conditional density P_{g_0}(sA | sW) for common (in i) unit-level summary measures (sA,sW).
The same fitting algorithm is applied to construct an estimator P_{g^*_N}(sA^* | sW^*) of the common (in i) conditional density P_{g^*}(sA^* | sW^*) for common (in i) unit-level summary measures (sA^*,sW^*) implied by the user-supplied stochastic intervention f_gstar1 or f_gstar2 and the observed distribution of W.
These two density estimators form the basis of the IPTW estimator, which is evaluated at the observed N data points O_i=(sW_i, sA_i, Y_i), i=1,...,N and is given by

ψ^{IPTW}_n = ∑_{i=1,...,N}{Y_i \frac{P_{g^*_N}(sA^*=sA_i | sW=sW_i)}{P_{g_N}(sA=sA_i | sW=sW_i)}}.

Modeling `P(sA|sW)` for summary measures `(sA,sW)`

**********************************************************************

Non-parametric estimation of the common unit-level multivariate joint conditional probability model P_g0(sA|sW), for unit-level summary measures (sA,sW) generated from the observed exposures and baseline covariates (A,W)=(A_i,W_i : i=1,...,N) (their joint density given by g_0(A|W)Q(W)), is performed by first constructing the dataset of N summary measures, (sA_i,sW_i : i=1,...,N), and then fitting the usual i.i.d. MLE for the common density P_g0(sA|sW) based on the pooled N sample of these summary measures.

Note that sA can be multivariate and any of its components sA[j] can be either binary, categorical or continuous. The joint probability model for P(sA|sA) = P(sA[1],...,sA[k]|sA) can be factorized as P(sA[1]|sA) * P(sA[2]|sA, sA[1]) * ... * P(sA[k]|sA, sA[1],...,sA[k-1]), where each of these conditional probability models is fit separately, depending on the type of the outcome variable sA[j].

If sA[j] is binary, the conditional probability P(sA[j]|sW,sA[1],...,sA[j-1]) is evaluated via logistic regression model. When sA[j] is continuous (or categorical), its range will be fist partitioned into K bins and the corresponding K bin indicators (B_1,...,B_K), where each bin indicator B_j is then used as an outcome in a separate logistic regression model with predictors given by sW, sA[1],...,sA[k-1]. Thus, the joint probability P(sA|sW) is defined by such a tree of binary logistic regressions.

For simplicity, we now suppose sA is continuous and univariate and we describe here an algorithm for fitting P_{g_0}(sA | sW) (the algorithm for fitting P_{g^*}(sA^* | sW^*) is equivalent, except that exposure A is replaced with exposure A^* generated under f_gstar1 or f_gstar2 and the predictors sW from the regression formula hform.g0 are replaced with predictors sW^* specified by the regression formula hform.gstar).

Generate a dataset of N observed continuous summary measures (sA_i:i=1,...,N) from observed ((A_i,W_i):i=1,...,N).
Divide the range of sA values into intervals S=(i_1,...,i_M,i_M+1) so that any observed data point sa_i belongs to one interval in S, namely, for each possible value sa of sA there is k\in1,...,M, such that, i_k < sa <= i_k+1. Let the mapping B(sa)\in1,...,M denote a unique interval in S for sa, such that, i_B(sa) < sa <= i_B(sa)+1. Let bw_B(sa):=i_B(sa)+1-i_B(sa) be the length of the interval (bandwidth) (i_B(sa),i_B(sa)+1). Also define the binary indicators b_1,...,b_M, where b_j:=I(B(sa)=j), for all j <= B(sa) and b_j:=NA for all j>B(sa). That is we set b_j to missing ones the indicator I(B(sa)=j) jumps from 0 to 1. Now let sA denote the random variable for the observed summary measure for one unit and denote by (B_1,...,B_M) the corresponding random indicators for sA defined as B_j := I(B(sA) = j) for all j <= B(sA) and B_j:=NA for all j>B(sA).
For each j=1,...,M, fit the logistic regression model for the conditional probability P(B_j = 1 | B_j-1=0, sW), i.e., at each j this is defined as the conditional probability of B_j jumping from 0 to 1 at bin j, given that B_j-1=0 and each of these logistic regression models is fit only among the observations that are still at risk of having B_j=1 with B_j-1=0.
Normalize the above conditional probability of B_j jumping from 0 to 1 by its corresponding interval length (bandwidth) bw_j to obtain the discrete conditional hazards h_j(sW):=P(B_j = 1 | (B_j-1=0, sW) / bw_j, for each j. For the summary measure sA, the above conditional hazard h_j(sW) is equal to P(sA \in (i_j,i_j+1) | sA>=i_j, sW), i.e., this is the probability that sA falls in the interval (i_j,i_j+1), conditional on sW and conditional on the fact that sA does not belong to any intervals before j.
Finally, for any given data-point (sa,sw), evaluate the discretized conditional density for P(sA=sa|sW=sw) by first evaluating the interval number k=B(sa)\in1,...,M for sa and then computing \prodj=1,...,k-11-h_j(sW))*h_k(sW) which is equivalent to the joint conditional probability that sa belongs to the interval (i_k,i_k+1) and does not belong to any of the intervals 1 to k-1, conditional on sW.

The evaluation above utilizes a discretization of the fact that any continuous density f of random variable X can be written as f_X(x)=S_X(x)*h_X(x), for a continuous density f of X where S_X(x):=P(X>x) is the survival function for X, h_X=P(X>x|X>=x) is the hazard function for X; as well as the fact that the discretized survival function S_X(x) can be written as a of the hazards for s<x: S_X(x)=\prods<xh_X(x).

Three methods for defining bin (interval) cuttoffs for a continuous one-dimenstional summary measure `sA[j]`

**********************************************************************

There are 3 alternative methods to defining the bin cutoffs S=(i_1,...,i_M,i_M+1) for a continuous summary measure sA. The choice of which method is used along with other discretization parameters (e.g., total number of bins) is controlled via the tmlenet_options() function. See ?tmlenet_options argument bin_method for additional details.

Approach 1 (equal.len): equal length, default.

*********************

The bins are defined by splitting the range of observed sA (sa_1,...,sa_n) into equal length intervals. This is the dafault discretization method, set by passing an argument bin_method="equal.len" to tmlenet_options function prior to calling tmlenet(). The intervals will be defined by splitting the range of (sa_1,...,sa_N) into nbins number of equal length intervals, where nbins is another argument of tmlenet_options() function. When nbins=NA (the default setting) the actual value of nbins is computed at run time by taking the integer value (floor) of n/max_n_bin, for n - the total observed sample size and max_n_bin=1000 - another argument of tmlenet_options() with the default value 1,000.

Approach 2 (equal.mass): data-adaptive equal mass intervals.

*********************

The intervals are defined by splitting the range of sA into non-equal length data-adaptive intervals that ensures that each interval contains around max_n_bin observations from (sa_j:j=1,...,N). This interval definition approach can be selected by passing an argument bin_method="equal.mass" to tmlenet_options() prior to calling tmlenet(). The method ensures that an approximately equal number of observations will belong to each interval, where that number of observations for each interval is controlled by setting max_n_bin. The default setting is max_n_bin=1000 observations per interval.

Approach 3 (dhist): combination of 1 & 2.

*********************

The data-adaptive approach dhist is a mix of Approaches 1 & 2. See Denby and Mallows "Variations on the Histogram" (2009)). This interval definition method is selected by passing an argument bin_method="dhist" to tmlenet_options() prior to calling tmlenet().

library("simcausal")
D <- DAG.empty()
D <-
D + node("W1", distr = "rbern", prob = 0.5) +
  node("W2", distr = "rbern", prob = 0.3) +
  node("W3", distr = "rbern", prob = 0.3) +
  node("sA.mu", distr = "rconst", const = (0.98 * W1 + 0.58 * W2 + 0.33 * W3)) +
  node("sA", distr = "rnorm", mean = sA.mu, sd = 1)
D <- set.DAG(D, n.test = 10)
datO <- sim(D, n = 10000, rndseed = 12345)

## Fit conditional density using equal mass bins (same number of obs per bin):
dens_fit <- fit_density(
    X = c("W1", "W2", "W3"),
    Y = "sA",
    input_data = datO,
    nbins = 20,
    bin_method = "equal.mass",
    bin_estimator = speedglmR6$new())

## Wrapper function to predict the conditional probability (likelihood)
## for new observations:
newdata <- datO[1:5, c("W1", "W2", "W3", "sA"), with = FALSE]
preds <- predict_probability(dens_fit, newdata)

## Wrapper function to sample the values from the conditional density fit:
sampledY <- sample_value(dens_fit, newdata)

## Fit conditional density using equal length bins
## (divides the range of support of the outcome to define each bin):
dens_fit <- fit_density(
    X = c("W1", "W2", "W3"),
    Y = "sA",
    input_data = datO,
    nbins = 20,
    bin_method = "equal.len",
    bin_estimator = speedglmR6$new())

## Wrapper function to predict the conditional probability (likelihood)
## for new observations:
newdata <- datO[1:5, c("W1", "W2", "W3", "sA"), with = FALSE]
preds <- predict_probability(dens_fit, newdata)

## Wrapper function to sample the values from the conditional density fit:
sampledY <- sample_value(dens_fit, newdata)


## Fit conditional density using custom bin definitions (argument intrvls):
dens_fit <- fit_density(
    X = c("W1", "W2", "W3"),
    Y = "sA",
    input_data = datO,
    bin_estimator = speedglmR6$new(),
    nbins = 5,
    intrvls = list(sA = seq(-4,4, by = 0.1)))


## Fit conditional density using custom bin definitions and
## pool all bin indicators into a single long-format dataset.
## The pooling results in a single regression that is fit for all bin hazards,
## with a bin indicator added as an additional covariate.
dens_fit <- fit_density(
    X = c("W1", "W2", "W3"),
    Y = "sA",
    input_data = datO,
    bin_estimator = speedglmR6$new(),
    intrvls = list(sA = seq(-4,4, by = 0.1)),
    nbins = 5,
    pool = TRUE)