GenData: Simulating Data Following John Ruscio's RGenData
In EFAfactors: Determining the Number of Factors in Exploratory Factor Analysis

GenData

R Documentation

Simulating Data Following John Ruscio's RGenData

Description

This function simulates data with nfact factors based on empirical data. It represents the simulation data part of the CD function and the CDF function. This function improves upon GenDataPopulation by utilizing C++ code to achieve faster data simulation.

Usage

GenData(
  response,
  nfact = 1,
  N.pop = 10000,
  Max.Trials = 5,
  lr = 1,
  cor.type = "pearson",
  use = "pairwise.complete.obs",
  isSort = FALSE
)

Arguments

`response`	A required `N` × `I` matrix or data.frame consisting of the responses of `N` individuals to `I` items.
`nfact`	The number of factors to extract in factor analysis. (default = 1)
`N.pop`	Size of finite populations for simulating. (default = 10,000)
`Max.Trials`	The maximum number of consecutive trials without obtaining a lower RMSR. (default = 5)
`lr`	The learning rate for updating the correlation matrix during iteration. (default = 1)
`cor.type`	A character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson" (default), "kendall", or "spearman". @seealso cor.
`use`	An optional character string specifying a method for computing covariances in the presence of missing values. This must be one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs" (default). @seealso cor.
`isSort`	Logical, determines whether the simulated data needs to be sorted in descending order. (default = FALSE)

Details

The core idea of GenData is to start with the empirical data's correlation matrix and iteratively approach data with nfact factors. Any value in the simulated data must come from the empirical data. The specific steps of GenData are as follows:

(1)

Use the empirical data (\mathbf{Y}_{emp}) correlation matrix as the target, \mathbf{R}_{targ}.

(2)

Simulate scores for N.pop examinees on nfact factors using a multivariate standard normal distribution:

\mathbf{S}_{(N.pop \times nfact)} \sim \mathcal{N}(0, 1)

Simulate noise for N.pop examinees on I items:

\mathbf{U}_{(N.pop \times I)} \sim \mathcal{N}(0, 1)

(3)

Initialize \mathbf{R}_{temp} = \mathbf{R}_{targ}, and set the minimum Root Mean Square Residual RMSR_{min} = \text{Inf}. Start the iteration process.

(4)

Extract nfact factors from \mathbf{R}_{temp}, and obtain the factor loadings matrix \mathbf{L}_{shar}. Ensure that the first element of \mathbf{L}_{share} is positive to standardize the direction.

(5)

Calculate the unique factor matrix \mathbf{L}_{uniq, (I \times 1)}:

L_{uniq,i} = \sqrt{1 - \sum_{j=1}^{nfact} L_{share, i, j}^2}

(6)

Calculate the simulated data \mathbf{Y}_{sim}:

Y_{sim, i, j} = \mathbf{S}_{i} \mathbf{L}_{shar, j}^T + U_{i, j} L_{uniq,i}

(7)

Compute the correlation matrix of the simulated data, \mathbf{R}_{simu}.

(8)

Calculate the residual correlation matrix \mathbf{R}_{resi} between the target matrix \mathbf{R}_{targ} and the simulated data's correlation matrix \mathbf{R}_{simu}:

\mathbf{R}_{resi} = \mathbf{R}_{targ} - \mathbf{R}_{simu}

(9)

Calculate the current RMSR:

RMSR_{cur} = \sqrt{\frac{\sum_{i < j} \mathbf{R}_{resi, i, j}^2}{0.5 \times (I^2 - I)}}

(10)

If RMSR_{cur} < RMSR_{min}, update \mathbf{R}_{temp} = \mathbf{R}_{temp} + lr \times \mathbf{R}_{resi}, RMSR_{min} = RMSR_{cur}, set \mathbf{R}_{min, resi} = \mathbf{R}_{resi}, and reset the count of consecutive trials without improvement cou = 0. If RMSR_{cur} \geq RMSR_{min}, update \mathbf{R}_{temp} = \mathbf{R}_{temp} + 0.5 \times cou \times lr \times \mathbf{R}_{min, resi} and increment cou = cou + 1.

(11)

Repeat steps (4) through (10) until cou \geq Max.Trials.

Of course C++ code is used to speed up.

Value

A N.pop * I matrix containing the simulated data.

References

Ruscio, J., & Roche, B. (2012). Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure. Psychological Assessment, 24, 282–292. http://dx.doi.org/10.1037/a0025697.

Examples

library(EFAfactors)
set.seed(123)

##Take the data.bfi dataset as an example.
data(data.bfi)

response <- as.matrix(data.bfi[, 1:25]) ## loading data
response <- na.omit(response) ## Remove samples with NA/missing values

## Transform the scores of reverse-scored items to normal scoring
response[, c(1, 9, 10, 11, 12, 22, 25)] <- 6 - response[, c(1, 9, 10, 11, 12, 22, 25)] + 1

data.simulated <- GenData(response, nfact = 1, N.pop = 10000)
head(data.simulated)

EFAfactors documentation built on June 10, 2025, 9:11 a.m.