transformData: Transform data methods

transformDataR Documentation

Transform data methods

Description

Implements various data trasformation methods with optimal scaling for ordinal or nominal data, and to help relax the assumption of normality (gaussianity) for continuous data.

Usage

transformData(x, method = "npn", ...)

Arguments

x

A matrix or data.frame (n x p). Rows correspond to subjects, and columns to graph nodes.

method

Trasform data method. It can be one of the following:

  1. "npn" (default), performs nonparanormal(npn) or semiparametric Gaussian copula model (Liu et al, 2009), estimating the Gaussian copula by marginally transforming the variables using smooth ECDF functions. The npn distribution corresponds to the latent underlying multivariate normal distribution, preserving the conditional independence structure of the original variables.

  2. "spearman", computes a trigonometric trasformation of Spearman rho correlation for estimation of latent Gaussian correlations parameter of a nonparanormal distribution (Harris & Dorton (2013), and generates the data matrix with the exact same sample covariance matrix as the estimated one.

  3. "kendall", computes a trigonometric trasformation of Kendall tau correlation for estimation of latent Gaussian correlations parameter of a nonparanormal distribution (Harris & Dorton (2013), and generates the data matrix with the exact same sample covariance matrix as the estimated one.

  4. "polychoric", computes the polychoric correlation matrix and generates the data matrix with the exact same sample covariance matrix as the estimated one. The polychoric correlation (Olsson, 1974) is a measure of association between two ordinal variables. It is based on the assumption that two latent bivariate normally distributed random variables generate couples of ordinal scores. Tetrachoric (two binary variables) and biserial (an ordinal and a numeric variables) correlations are special cases.

  5. "lineals", performs optimal scaling in order to achieve linearizing transformations for each bivariate regression between pairwise variables for subsequent structural equation models using the resulting correlation matrix computed on the transformed data (de Leeuw, 1988).

  6. "mca", performs optimal scaling of categorical data by Multiple Correspondence Analysis (MCA, a.k.a homogeneity analysis) maximizing the first eigenvalues of the trasformed correlation matrix. The estimates of the corresponding structural parameters are consistent if the underlying latent space of the observed variables is unidimensional.

...

Currently ignored.

Details

Nonparanormal trasformation is computationally very efficient and only requires one ECDF pass of the data matrix. Polychoric correlation matrix is computed with the lavCor() function of the lavaan package. Optimal scaling (lineals and mca) is performed with the lineals() and corAspect() functions of the aspect package (Mair and De Leeuw, 2008). To note, SEM fitting of the generate data (fake data) must be done with a covariance-based method and bootstrap SE, i.e., with SEMrun(..., algo="ricf", n_rep=1000).

Value

A list of 2 objects is returned:

  1. "data", the matrix (n x p) of n observations and p transformed variables or the matrix (n x p) of simulate observations based on the selected correlation matrix.

  2. "catscores", the category weights for "lineals" or "mca" methods or NULL otherwise.

Author(s)

Mario Grassi mario.grassi@unipv.it

References

Liu H, Lafferty J, and Wasserman L (2009). The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs. Journal of Machine Learning Research 10(80): 2295-2328

Harris N, and Drton M (2013). PC Algorithm for Nonparanormal Graphical Models. Journal of Machine Learning Research 14 (69): 3365-3383

Olsson U (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 44(4), 443-460.

Mair P, and De Leeuw J (2008). Scaling variables by optimizing correlational and non-correlational aspects in R. Journal of Statistical Software, 32(9), 1-23.

de Leeuw J (1988). Multivariate analysis with linearizable regressions. Psychometrika, 53, 437-454.

Examples


#... with continuous ALS data
graph<- alsData$graph
data<- alsData$exprs; dim(data)
X<- data[, colnames(data) %in% V(graph)$name]; dim(X)

npn.data<- transformData(X, method="npn")
sem0.npn<- SEMrun(graph, npn.data$data, algo="cggm")

mvnS.data<- transformData(X, method="spearman")
sem0.mvnS<- SEMrun(graph, mvnS.data$data, algo="cggm")

mvnK.data<- transformData(X, method="kendall")
sem0.mvnK<- SEMrun(graph, mvnK.data$data, algo="cggm")

#...with ordinal (K=4 categories) ALS data
Xord <- data.frame(X)
Xord <- as.data.frame(lapply(Xord, cut, 4, labels = FALSE))
colnames(Xord) <- sub("X", "", colnames(Xord))

mvnP.data<- transformData(Xord, method="polychoric")
sem0.mvnP<- SEMrun(graph, mvnP.data$data, algo="cggm")

#...with nominal (K=4 categories) ALS data
mca.data<- transformData(Xord, method="mca")
sem0.mca<- SEMrun(graph, mca.data$data, algo="cggm")
mca.data$catscores
gplot(sem0.mca$graph, l="fdp", main="ALS mca")

# plot colored graphs
#par(mfrow=c(2,2), mar=rep(1,4))
#gplot(sem0.npn$graph, l="fdp", main="ALS npm")
#gplot(sem0.mvnS$graph, l="fdp", main="ALS mvnS")
#gplot(sem0.mvnK$graph, l="fdp", main="ALS mvnK")
#gplot(sem0.mvnP$graph, l="fdp", main="ALS mvnP")


SEMgraph documentation built on Sept. 11, 2024, 8:36 p.m.