Description Usage Arguments Value Details Modeling P(sA|sW) for summary measures (sA,sW) Three methods for defining bin (interval) cuttoffs for a continuous one-dimenstional summary measure sA[j] Examples
Fit (multivariate) conditional density
1 2 3 4 5 | fit_density(X, Y, input_data, bin_method = c("equal.mass", "equal.len",
"dhist"), nbins = 5, max_n_cat = 20, pool = FALSE,
max_n_bin = NA_integer_, parfit = FALSE,
bin_estimator = speedglmR6$new(), intrvls = NULL, weights = NULL,
verbose = getOption("condensier.verbose"))
|
X |
A vector containing the names of predictor variables to use for modeling. |
Y |
A character string name of the column that represent the response variable(s) in the model. |
input_data |
Input dataset, can be a |
bin_method |
The method for choosing bins when discretizing and fitting the conditional continuous summary
exposure variable |
nbins |
Set the default number of bins when discretizing a continous outcome variable under setting
|
max_n_cat |
Max number of unique levels for cat outcome. If the outcome has more levels it is automatically considered continuous. |
pool |
Set to |
max_n_bin |
Max number of observations per single bin for a continuous outcome (applies directly when
|
parfit |
Default is |
bin_estimator |
The estimator to use for fitting the binary outcomes (defaults to |
intrvls |
A named list of intervals, one for each outcome variable. Each list item consists of a vector of numeric cutoffs that define the bins. By default the bins are defined automatically (based on the selected binning method). When this argument is used it will over-ride automatic bin selection. This options should be only used when fine-tuned control over bin definitions is desired. |
weights |
Specify the column in the input data containing the optional weights. Optionally, specify the weights as a numeric vector. |
verbose |
Set to |
An R6 object of class SummariesModel
containing the conditional density fit(s).
**********************************************************************
As described in the following section, the first step is to construct an estimator P_{g_N}(sA | sW)
for the common (in i
) conditional density P_{g_0}(sA | sW) for common (in i
) unit-level summary
measures (sA
,sW
).
The same fitting algorithm is applied to construct an estimator P_{g^*_N}(sA^* | sW^*) of the common (in i
)
conditional density P_{g^*}(sA^* | sW^*) for common (in i
) unit-level summary measures (sA^*
,sW^*
)
implied by the user-supplied stochastic intervention f_gstar1
or f_gstar2
and the observed distribution of W
.
These two density estimators form the basis of the IPTW estimator, which is evaluated at the observed N data points O_i=(sW_i, sA_i, Y_i), i=1,...,N and is given by
ψ^{IPTW}_n = ∑_{i=1,...,N}{Y_i \frac{P_{g^*_N}(sA^*=sA_i | sW=sW_i)}{P_{g_N}(sA=sA_i | sW=sW_i)}}.
P(sA|sW)
for summary measures (sA,sW)
**********************************************************************
Non-parametric
estimation of the common unit-level multivariate joint conditional probability model P_g0(sA|sW)
,
for unit-level summary measures (sA,sW)
generated from the observed exposures and baseline covariates
(A,W)=(A_i,W_i : i=1,...,N) (their joint density given by g_0(A|W)Q(W)), is performed by first
constructing the dataset of N summary measures, (sA_i,sW_i : i=1,...,N), and then fitting the usual i.i.d. MLE
for the common density P_g0(sA|sW)
based on the pooled N sample of these summary measures.
Note that sA
can be multivariate and any of its components sA[j]
can be either binary, categorical
or continuous.
The joint probability model for P(sA|sA)
= P(sA[1],...,sA[k]|sA)
can be factorized as
P(sA[1]|sA)
* P(sA[2]|sA, sA[1])
* ... * P(sA[k]|sA, sA[1],...,sA[k-1])
,
where each of these conditional probability models is fit separately, depending on the type of the outcome variable
sA[j]
.
If sA[j]
is binary, the conditional probability P(sA[j]|sW,sA[1],...,sA[j-1])
is evaluated via logistic
regression model.
When sA[j]
is continuous (or categorical), its range will be fist partitioned into K
bins and the
corresponding K
bin indicators (B_1,...,B_K
), where each bin indicator B_j
is then used as an outcome in a
separate logistic regression model with predictors given by sW, sA[1],...,sA[k-1]
.
Thus, the joint probability P(sA|sW)
is defined by such a tree of binary logistic regressions.
For simplicity, we now suppose sA
is continuous and univariate and we describe here an algorithm for fitting
P_{g_0}(sA | sW) (the algorithm
for fitting P_{g^*}(sA^* | sW^*) is equivalent, except that exposure A
is replaced with exposure A^*
generated under f_gstar1
or f_gstar2
and
the predictors sW
from the regression formula hform.g0
are replaced with predictors sW^*
specified by the regression formula hform.gstar
).
Generate a dataset of N observed continuous summary measures (sA_i
:i=1,...,N) from observed
((A_i
,W_i
):i=1,...,N).
Divide the range of sA
values into intervals S=(i_1,...,i_M,i_M+1) so that any observed data point
sa_i
belongs to one interval in S, namely,
for each possible value sa of sA
there is k\in1,...,M, such that, i_k < sa
<= i_k+1.
Let the mapping B(sa)\in1,...,M denote a unique interval in S for sa, such that, i_B(sa) < sa <= i_B(sa)+1.
Let bw_B(sa):=i_B(sa)+1-i_B(sa) be the length of the interval (bandwidth) (i_B(sa),i_B(sa)+1).
Also define the binary indicators b_1,...,b_M, where b_j:=I(B(sa)=j), for all j <= B(sa) and b_j:=NA for all j>B(sa).
That is we set b_j to missing ones the indicator I(B(sa)=j) jumps from 0 to 1.
Now let sA
denote the random variable for the observed summary measure for one unit
and denote by (B_1,...,B_M) the corresponding random indicators for sA
defined as B_j := I(B(sA
) = j)
for all j <= B(sA
) and B_j:=NA for all j>B(sA
).
For each j=1,...,M, fit the logistic regression model for the conditional probability P(B_j = 1 | B_j-1=0, sW), i.e., at each j this is defined as the conditional probability of B_j jumping from 0 to 1 at bin j, given that B_j-1=0 and each of these logistic regression models is fit only among the observations that are still at risk of having B_j=1 with B_j-1=0.
Normalize the above conditional probability of B_j jumping from 0 to 1 by its corresponding interval length (bandwidth) bw_j to
obtain the discrete conditional hazards h_j(sW):=P(B_j = 1 | (B_j-1=0, sW) / bw_j, for each j.
For the summary measure sA
, the above conditional hazard h_j(sW) is equal to P(sA
\in (i_j,i_j+1) | sA
>=i_j, sW),
i.e., this is the probability that sA
falls in the interval (i_j,i_j+1), conditional on sW and conditional on the fact that
sA
does not belong to any intervals before j.
Finally, for any given data-point (sa,sw)
, evaluate the discretized conditional density for P(sA
=sa|sW=sw) by first
evaluating the interval number k=B(sa)\in1,...,M for sa
and then computing \prodj=1,...,k-11-h_j(sW))*h_k(sW)
which is equivalent to the joint conditional probability that sa
belongs to the interval (i_k,i_k+1) and does not belong
to any of the intervals 1 to k-1, conditional on sW.
The evaluation above utilizes a discretization of the fact that any continuous density f of random variable X can be written as f_X(x)=S_X(x)*h_X(x), for a continuous density f of X where S_X(x):=P(X>x) is the survival function for X, h_X=P(X>x|X>=x) is the hazard function for X; as well as the fact that the discretized survival function S_X(x) can be written as a of the hazards for s<x: S_X(x)=\prods<xh_X(x).
sA[j]
**********************************************************************
There are 3 alternative methods to defining the bin cutoffs S=(i_1,...,i_M,i_M+1) for a continuous summary measure
sA
. The choice of which method is used along with other discretization parameters (e.g., total number of
bins) is controlled via the tmlenet_options() function. See ?tmlenet_options
argument bin_method
for
additional details.
Approach 1 (equal.len
): equal length, default.
*********************
The bins are defined by splitting the range of observed sA
(sa_1,...,sa_n) into equal length intervals.
This is the dafault discretization method, set by passing an argument bin_method="equal.len"
to
tmlenet_options
function prior to calling tmlenet()
. The intervals will be defined by splitting the
range of (sa_1,...,sa_N) into nbins
number of equal length intervals, where nbins
is another argument
of tmlenet_options()
function. When nbins=NA
(the default setting) the actual value of nbins
is computed at run time by taking the integer value (floor) of n/max_n_bin
,
for n
- the total observed sample size and max_n_bin=1000
- another argument of
tmlenet_options()
with the default value 1,000.
Approach 2 (equal.mass
): data-adaptive equal mass intervals.
*********************
The intervals are defined by splitting the range of sA
into non-equal length data-adaptive intervals that
ensures that each interval contains around
max_n_bin
observations from (sa_j:j=1,...,N).
This interval definition approach can be selected by passing an argument bin_method="equal.mass"
to
tmlenet_options()
prior to calling tmlenet()
.
The method ensures that an approximately equal number of observations will belong to each interval, where that number
of observations for each interval
is controlled by setting max_n_bin
. The default setting is max_n_bin=1000
observations per interval.
Approach 3 (dhist
): combination of 1 & 2.
*********************
The data-adaptive approach dhist is a mix of Approaches 1 & 2. See Denby and Mallows "Variations on the Histogram"
(2009)). This interval definition method is selected by passing an argument bin_method="dhist"
to
tmlenet_options()
prior to calling tmlenet()
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | library("simcausal")
D <- DAG.empty()
D <-
D + node("W1", distr = "rbern", prob = 0.5) +
node("W2", distr = "rbern", prob = 0.3) +
node("W3", distr = "rbern", prob = 0.3) +
node("sA.mu", distr = "rconst", const = (0.98 * W1 + 0.58 * W2 + 0.33 * W3)) +
node("sA", distr = "rnorm", mean = sA.mu, sd = 1)
D <- set.DAG(D, n.test = 10)
datO <- sim(D, n = 10000, rndseed = 12345)
## Fit conditional density using equal mass bins (same number of obs per bin):
dens_fit <- fit_density(
X = c("W1", "W2", "W3"),
Y = "sA",
input_data = datO,
nbins = 20,
bin_method = "equal.mass",
bin_estimator = speedglmR6$new())
## Wrapper function to predict the conditional probability (likelihood)
## for new observations:
newdata <- datO[1:5, c("W1", "W2", "W3", "sA"), with = FALSE]
preds <- predict_probability(dens_fit, newdata)
## Wrapper function to sample the values from the conditional density fit:
sampledY <- sample_value(dens_fit, newdata)
## Fit conditional density using equal length bins
## (divides the range of support of the outcome to define each bin):
dens_fit <- fit_density(
X = c("W1", "W2", "W3"),
Y = "sA",
input_data = datO,
nbins = 20,
bin_method = "equal.len",
bin_estimator = speedglmR6$new())
## Wrapper function to predict the conditional probability (likelihood)
## for new observations:
newdata <- datO[1:5, c("W1", "W2", "W3", "sA"), with = FALSE]
preds <- predict_probability(dens_fit, newdata)
## Wrapper function to sample the values from the conditional density fit:
sampledY <- sample_value(dens_fit, newdata)
## Fit conditional density using custom bin definitions (argument intrvls):
dens_fit <- fit_density(
X = c("W1", "W2", "W3"),
Y = "sA",
input_data = datO,
bin_estimator = speedglmR6$new(),
nbins = 5,
intrvls = list(sA = seq(-4,4, by = 0.1)))
## Fit conditional density using custom bin definitions and
## pool all bin indicators into a single long-format dataset.
## The pooling results in a single regression that is fit for all bin hazards,
## with a bin indicator added as an additional covariate.
dens_fit <- fit_density(
X = c("W1", "W2", "W3"),
Y = "sA",
input_data = datO,
bin_estimator = speedglmR6$new(),
intrvls = list(sA = seq(-4,4, by = 0.1)),
nbins = 5,
pool = TRUE)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.