Inddist: Independence Empirical Distribution
In PSinference: Inference for Released Plug-in Sampling Single Synthetic Dataset

View source: R/Inddist.R

Inddist

R Documentation

Independence Empirical Distribution

Description

This function calculates the empirical distribution of the pivotal random variable that can be used to perform inferential procedures and test the independence of two subsets of variables based on the released Single Synthetic data generated under Plug-in Sampling, assuming that the original dataset is normally distributed.

Usage

Inddist(part, nsample, pvariates, iterations)

Arguments

`part`	Number of variables in the first subset.
`nsample`	Sample size.
`pvariates`	Number of variables.
`iterations`	Number of iterations for simulating values from the distribution and finding the quantiles. Default is `10000`.

Details

We define

T_3^\star = \frac{|\boldsymbol{S}^{\star}|} {|\boldsymbol{S}^{\star}_{11}||\boldsymbol{S}^{\star}_{22}|}

where \boldsymbol{S}^\star = \sum_{i=1}^n (v_i - \bar{v})(v_i - \bar{v})^{\top}, v_i is the ith observation of the synthetic dataset, considering \boldsymbol{S}^\star partitioned as

\boldsymbol{S}^{\star}=\left[\begin{array}{lll} \boldsymbol{S}^{\star}_{11}& \boldsymbol{S}^{\star}_{12}\\ \boldsymbol{S}^{\star}_{21} & \boldsymbol{S}^{\star}_{22} \end{array}\right].

Under the assumption that \boldsymbol{\Sigma}_{12} = \boldsymbol{0}, its distribution is stochastic equivalent to

\frac{|\boldsymbol{\Omega}|}{|\boldsymbol{\Omega}_{11}||\boldsymbol{\Omega}_{22}|}

where \boldsymbol{\Omega} \sim \mathcal{W}_p(n-1, \frac{\boldsymbol{W}}{n-1}), \boldsymbol{W} \sim \mathcal{W}_p(n-1, \mathbf{I}_p) and \boldsymbol{\Omega} partitioned in the same way as \boldsymbol{S}^{\star}. To test \mathcal{H}_0: \boldsymbol{\Sigma}_{12} = \boldsymbol{0}, compute the value of T_{3}^\star, \widetilde{T_{3}^\star}, with the observed values and reject the null hypothesis if \widetilde{T_{3}^\star}<t^\star_{3,\alpha} for \alpha-significance level, where t^\star_{3,\gamma} is the \gammath percentile of T_3^\star.

Value

a vector of length iterations that recorded the empirical distribution's values.

References

Klein, M., Moura, R. and Sinha, B. (2021). Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling. Sankhya B 83, 273–287.

Examples

#generate original data with two independent subsets of variables
library(MASS)
n_sample = 100
p = 4
mu <- c(1,2,3,4)
Sigma = matrix(c(1,   0.5,   0,     0,
                 0.5,   2,   0,     0,
                 0,     0,   3,   0.2,
                 0,     0,   0.2,   4), nr = 4, nc = 4, byrow = TRUE)
df = mvrnorm(n_sample, mu = mu, Sigma = Sigma)
# generate synthetic data
df_s = simSynthData(df)

#Decompose Sstar in 4 parts
part = 2

Sstar = cov(df_s)
Sstar_11 = partition(Sstar,nrows = part, ncol = part)[[1]]
Sstar_12 = partition(Sstar,nrows = part, ncol = part)[[2]]
Sstar_21 = partition(Sstar,nrows = part, ncol = part)[[3]]
Sstar_22 = partition(Sstar,nrows = part, ncol = part)[[4]]

#Compute observed T3_star
T3_obs = det(Sstar)/(det(Sstar_11)*det(Sstar_22))

alpha = 0.05

# colect the quantile from the distribution assuming independence between the two subsets
T3 <- Inddist(part = part, nsample = n_sample, pvariates = p, iterations = 10000)
q5 <- quantile(T3, alpha)

T3_obs < q5 #False means that we don't have statistical evidences to reject independence
print(T3_obs)
print(q5)
# Note that the value of the observed T3_obs is close to one as expected

PSinference documentation built on April 4, 2025, 2:08 a.m.