d.highdim: Artificial data with 50 factors and 1191 cases

d.highdimR Documentation

Artificial data with 50 factors and 1191 cases


These crisp-set data are simulated from a presupposed data generating structure (i.e. a causal chain). They feature 20% noise and massive fragmentation (limited diversity). d.highdim is used to illustrate CNA's capacity to analyze high-dimensional data.




The data frame contains 50 factors (columns), V1 to V50, and 1191 rows (cases). It was simulated from the following data generating structure:

(v2*V10 + V18*V16*v15 <-> V13)*(V2*v14 + V3*v12 + V13*V19 <-> V11)

20% of the cases in d.highdim are incompatible with that structure, meaning they are affected by noise or measurement error. The fragmentation is massive, as there is a total of 281 trillion (2^{48}) configurations over the set {V1,...,V50} that are compatible with that structure.


d.highdim has been generated with the following code:

m0 <- matrix(0, 5000, 50)
dat1 <- as.data.frame(apply(m0, c(1,2), function(x) sample(c(0,1), 1)))
target <- "(v2*V10 + V18*V16*v15 <-> V13)*(V2*v14 + V3*v12 + V13*V19 <-> V11)"
dat2 <- ct2df(selectCases(target, dat1))
incomp.data <- dplyr::setdiff(dat1, dat2)

no.replace <- round(nrow(dat2)*0.2)
a <- dat2[sample(nrow(dat2), nrow(dat2)-no.replace, replace = FALSE),]
b <- some(incomp.data, no.replace)
d.highdim <- rbind(a, b)

cna documentation built on Aug. 11, 2023, 1:09 a.m.