CDF: the Comparison Data Forest (CDF) Approach
In EFAfactors: Determining the Number of Factors in Exploratory Factor Analysis

View source: R/CDF.R

CDF	R Documentation

the Comparison Data Forest (CDF) Approach

Description

The Comparison Data Forest (CDF; Goretzko & Ruscio, 2019) approach is a combination of Random Forest with the comparison data (CD) approach.

Usage

CDF(
  response,
  num.trees = 500,
  mtry = "sqrt",
  nfact.max = 10,
  N.pop = 10000,
  N.Samples = 500,
  cor.type = "pearson",
  use = "pairwise.complete.obs",
  vis = TRUE,
  plot = TRUE
)

Arguments

`response`	A required `N` × `I` matrix or data.frame consisting of the responses of `N` individuals to × `I` items.
`num.trees`	the number of trees in the Random Forest. (default = 500) See details.
`mtry`	the maximum depth for each tree, can be a number or a character (`"sqrt"`). When `mtry = "sqrt"`, it means that the maximum depth of each tree will be determined by the square root of the number of available features (converted to an integer by round).default = `"sqrt"`. See details.
`nfact.max`	The maximum number of factors discussed by CD approach. (default = 10)
`N.pop`	Size of finite populations of simulating.. (default = 10,000)
`N.Samples`	Number of samples drawn from each population. (default = 500)
`cor.type`	A character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson" (default), "kendall", or "spearman". @seealso cor.
`use`	an optional character string giving a method for computing covariances in the presence of missing values. This must be one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs" (default). @seealso cor.
`vis`	A Boolean variable that will print the factor retention results when set to TRUE, and will not print when set to FALSE. (default = TRUE)
`plot`	A Boolean variable that will print the CDF plot when set to TRUE, and will not print it when set to FALSE. @seealso plot.CDF. (Default = TRUE)

Details

The Comparison Data Forest (CDF; Goretzko & Ruscio, 2019) Approach is a combination of random forest with the comparison data (CD) approach. Its basic steps involve using the method of Ruscio & Roche (2012) to simulate data with different factor counts, then extracting features from this data to train a random forest model. Once the model is trained, it can be used to predict the number of factors in empirical data. The algorithm consists of the following steps:

1. **Simulation Data:**

(1): For each value of nfact in the range from 1 to nfact_{max}, generate a population data using the GenData function.
(2): Each population is based on nfact factors and consists of N_{pop} observations.
(3): For each generated population, repeat the following for N_{rep} times, For the j-th in N_{rep}: a. Draw a sample N_{sam} from the population that matches the size of the empirical data; b. Compute a feature set \mathbf{fea}_{nfact,j} from each N_{sam}.
(4): Combine all the generated feature sets \mathbf{fea}_{nfact,j} into a data frame as \mathbf{data}_{train, nfact}.
(5): Combine all \mathbf{data}_{train, nfact} into a final data frame as the training dataset \mathbf{data}_{train}.

2. **Training RF:**

Train a Random Forest model RF using the combined \mathbf{data}_{train}.

3. **Prediction the Empirical Data:**

(1)

Calculate the feature set \mathbf{fea}_{emp}for the empirical data.

(2)

Use the trained Random Forest model RF to predict the number of factors nfact_{emp} for the empirical data:

nfact_{emp} = RF(\mathbf{fea}_{emp})

According to Goretzko & Ruscio (2024) and Breiman (2001), the number of trees in the Random Forest num.trees is recommended to be 500. The Random Forest in CDF performs a classification task, so the recommended maximum depth for each tree mtry is \sqrt{q} (where q is the number of features), which results in m_{try}=\sqrt{181}=13.

Since the CDF approach requires extensive data simulation and computation, which is much more time consuming than the CD Approach, C++ code is used to speed up the process.

Value

An object of class CDF is a list containing the following components:

`nfact`	The number of factors to be retained.
`RF`	the trained Random Forest model
`probability`	A matrix containing the probabilities for factor numbers ranging from 1 to nfact.max (1xnfact.max), where the number in the f-th column represents the probability that the number of factors for the response is f.
`features`	A matrix (1×181) containing all the features for determining the number of factors. @seealso extractor.feature.FF

Author(s)

Haijiang Qin <Haijiang133@outlook.com>

References

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324

Goretzko, D., & Ruscio, J. (2024). The comparison data forest: A new comparison data approach to determine the number of factors in exploratory factor analysis. Behavior Research Methods, 56(3), 1838-1851. https://doi.org/10.3758/s13428-023-02122-4

Ruscio, J., & Roche, B. (2012). Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure. Psychological Assessment, 24, 282–292. http://dx.doi.org/10.1037/a0025697.

Examples

library(EFAfactors)
set.seed(123)

##Take the data.bfi dataset as an example.
data(data.bfi)

response <- as.matrix(data.bfi[, 1:25]) ## loading data
response <- na.omit(response) ## Remove samples with NA/missing values

## Transform the scores of reverse-scored items to normal scoring
response[, c(1, 9, 10, 11, 12, 22, 25)] <- 6 - response[, c(1, 9, 10, 11, 12, 22, 25)] + 1

## Run CDF function with default parameters.

CDF.obj <- CDF(response)

print(CDF.obj)

## CDF plot
plot(CDF.obj)

## Get the nfact results.
nfact <- CDF.obj$nfact
print(nfact)



## Limit the maximum number of factors to 8, with populations set to 5000.

CDF.obj <- CDF(response, nfact.max=8, N.pop = 5000)

print(CDF.obj)

## CDF plot
plot(CDF.obj)

## Get the nfact results.
nfact <- CDF.obj$nfact
print(nfact)

EFAfactors documentation built on June 10, 2025, 9:11 a.m.

EFAfactors index

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

EFAfactors
Determining the Number of Factors in Exploratory Factor Analysis

CDF: the Comparison Data Forest (CDF) Approach
In EFAfactors: Determining the Number of Factors in Exploratory Factor Analysis

the Comparison Data Forest (CDF) Approach

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Related to CDF in EFAfactors...

R Package Documentation

Browse R Packages

We want your feedback!

EFAfactors Determining the Number of Factors in Exploratory Factor Analysis

CDF: the Comparison Data Forest (CDF) Approach In EFAfactors: Determining the Number of Factors in Exploratory Factor Analysis

the Comparison Data Forest (CDF) Approach

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Related to CDF in EFAfactors...

R Package Documentation

Browse R Packages

We want your feedback!

EFAfactors
Determining the Number of Factors in Exploratory Factor Analysis

CDF: the Comparison Data Forest (CDF) Approach
In EFAfactors: Determining the Number of Factors in Exploratory Factor Analysis