knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)
options(knitr.kable.NA = '')

#necessary to render tutorial correctly
library(learnr) 
library(htmltools)
#tidyverse
library(dplyr)
library(ggplot2)
#non tidyverse
library(correlation)
library(GPArotation)
library(knitr)
library(parameters)
library(psych)

source("./www/discovr_helpers.R")

# knitr::write_bib(c('forcats'), file = 'packages.bib')


# Read data files needed for the tutorial

raq_tib <- discovr::raq
raq_items_tib <- raq_tib |> dplyr::select(-id)

raq_poly <- psych::polychoric(raq_items_tib)
raq_cor <- raq_poly$rho
raq_fa <- raq_items_tib |> psych::fa(nfactors = 4, scores = "tenBerge")
# Create bib file for R packages
here::here("inst/tutorials/discovr_18/packages.bib") |>
  knitr::write_bib(c('here', 'tidyverse', 'dplyr', 'readr', 'forcats', 'tibble', 'knitr', 'correlation', 'GPArotation', 'Hmisc', 'parameters', 'psych'), file = _)

discovr: Exploratory Factor Analysis (EFA)

Overview

discovr package hex sticker, female space pirate with gun. Gunsmoke forms the letter R. **Usage:** This tutorial accompanies [Discovering Statistics Using R and RStudio](https://www.discovr.rocks/) [@field_discovering_2023] by [Andy Field](https://en.wikipedia.org/wiki/Andy_Field_(academic)). It contains material from the book so there are some copyright considerations but I offer them under a [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nc-nd/4.0/). Tl;dr: you can use this tutorial for teaching and non-profit activities but please don't meddle with it or claim it as your own work.

r cat_space(fill = blu) Welcome to the discovr space pirate academy

Hi, welcome to discovr space pirate academy. Well done on embarking on this brave mission to planet r rproj()s, which is a bit like Mars, but a less red and more hostile environment. That's right, more hostile than a planet without water. Fear not though, the fact you are here means that you can master r rproj(), and before you know it you'll be as brilliant as our pirate leader Mae Jemstone (she's the badass with the gun). I am the space cat-det, and I will pop up to offer you tips along your journey.

On your way you will face many challenges, but follow Mae's system to keep yourself on track:

It's not just me that's here to help though, you will meet other characters along the way:

Also, use hints and solutions to guide you through the exercises (Figure 1).

Each codebox has a hints or solution button that activates a popup window containing code and text to guide you through each exercise.
Figure 1: In a code exercise click the hints button to guide you through the exercise.

By for now and good luck - you'll be amazing!

Workflow

Packages

This tutorial uses the following packages:

It also uses these tidyverse packages [@R-tidyverse; @tidyverse2019]: dplyr [@R-dplyr] and readr [@R-readr].

Coding style

There are (broadly) two styles of coding:

  1. Explicit: Using this style you declare the package when using a function: package::function(). For example, if I want to use the mutate() function from the package dplyr, I will type dplyr::mutate(). If you adopt an explicit style, you don't need to load packages at the start of your Quarto document (although see below for some exceptions).

  2. Concise: Using this style you load all of the packages at the start of your Quarto document using library(package_name), and then refer to functions without their package. For example, if I want to use the mutate() function from the package dplyr, I will use library(dplyr) in my first code chunk and type the function as mutate() when I use it subsequently.

Coding style is a personal choice. The Google r rproj() style guide and tidyverse style guide recommend an explicit style, and I use it in teaching materials for two reasons (1) it helps you to remember which functions come from which packages, and (2) it prevents clashes resulting from using functions from different packages that have the same name. However, even with this style it makes sense to load tidyverse because the dplyr and ggplot2 packages contain functions that are often used within other functions and in these cases explicit code is difficult to read. Also, no-one wants to write ggplot2:: before every function from ggplot2.

You can use either style in this tutorial because all packages are pre-loaded. If working outside of the tutorial, load the tidyverse package (and any others if you're using a concise style) at the beginning of your Quarto document:

library(tidyverse)

Data

To work outside of this tutorial you need to download the following data files:

Set up an r rstudio() project in the way that I recommend in this tutorial, and save the data files to the folder within your project called [data]{.alt}. Place this code in the first code chunk in your Quarto document:

raq_tib <- here::here("data/raq.csv") |>
  readr::read_csv()

This code reads in the data into an object called raq_tib.

r bmu() An r rproj()-induced anxiety example [(2)]{.alt}

Factor analysis is used frequently to develop questionnaires. A lot of students become very stressed about r rproj(). Imagine that I wanted to design a questionnaire to measure a trait that I termed ‘r rproj() anxiety'. I devised a questionnaire to measure various aspects of students' anxiety towards learning r rproj() and statistics, the RAQ. I generated questions based on interviews with anxious and non-anxious students and came up with 23 possible questions to include. Each question was a statement followed by a five-point Likert scale: strongly disagree = 1, disagree = 2, neither agree nor disagree = 3, agree = 4 and strongly agree = 5. I wanted to know whether anxiety about r rproj() could be broken down into specific forms of anxiety. In other words, what latent variables contribute to anxiety about learning or using r rproj()?

The questions in the RAQ are as follows (each has responses from :

  1. raq_01: Statistics make me cry
  2. raq_02: My friends will think I'm stupid for not being able to cope with r rproj()
  3. raq_03: Standard deviations excite me
  4. raq_04: I dream that Pearson is attacking me with correlation coefficients
  5. raq_05: I don't understand statistics
  6. raq_06: I have little experience of computers
  7. raq_07: All computers hate me
  8. raq_08: I have never been good at mathematics
  9. raq_09: My friends are better at statistics than me
  10. raq_10: Computers are useful only for playing games
  11. raq_11: I did badly at mathematics at school
  12. raq_12: People try to tell you that r rproj() makes statistics easier to understand but it doesn't
  13. raq_13: I worry that I will cause irreparable damage because of my incompetence with computers
  14. raq_14: Computers have minds of their own and deliberately go wrong whenever I use them
  15. raq_15: Computers are out to get me
  16. raq_16: I weep openly at the mention of central tendency
  17. raq_17: I slip into a coma whenever I see an equation
  18. raq_18: R always crashes when I try to use it
  19. raq_19: Everybody looks at me when I use r rproj()
  20. raq_20: I can't sleep for thoughts of eigenvectors
  21. raq_21: I wake up under my duvet thinking that I am trapped under a normal distribution
  22. raq_22: My friends are better at r rproj() than I am
  23. raq_23: If I am good at statistics people will think I am a nerd

With a little help from a few lecturer friends I collected 2571 completed questionnaires. The data are in [raq_tib]{.alt}.

r alien() Alien coding challenge

View the data in [raq_tib]{.alt}.


raq_tib

Note that there are 24 variables:

r bmu() Pre-analysis [(2)]{.alt}

As we've seen, the data has a variable id containing the participant ID. We won't want this variable in the analyses we do. We can use select() to remove it each time we don't need it, but it will save a lot of repetitious code if we store a version of the data that contains only the scores from the RAQ items.

r alien() Alien coding challenge

Use the code box to create an object called [raq_items_tib]{.alt} that contains only the questionnaire item variables. Inspect this new object and note that the [id]{.alt} variable has gone.


raq_items_tib <- raq_tib |> 
  dplyr::select(-id)
raq_items_tib

We will use this new object, [raq_items_tib]{.alt}, for the remainder of the tutorial.

r bmu() Inspect the correlation matrix [(2)]{.alt}

The first thing to do is to look at the correlations between RAQ items.

r alien() Alien coding challenge

Using what you've learnt in previous tutorials produce a correlation matrix for the 23 RAQ items.

`r cat_space()` **Tip: Correlation memory jog** * We can use `correlation::correlation()` to get a correlations, and `summary()` to present them as a matrix/grid. * You need to de-select the [id]{.alt} variable before computing the correlations, we can do this by using the [raq_items_tib]{.alt} version of the data that we just created. * You can round values by piping into `kable()` in the usual way.

# feed the tibble with only the RAQ items into the correlation() function
correlation::correlation(raq_items_tib)
# pipe into summary() to get a condensed table of correlations:
correlation::correlation(raq_items_tib) |>
  summary() |> 
  knitr::kable(digits = 2)

That was great practice, but because our items use Likert style scales it is more appropriate to look at something called the [polychoric correlation]{.alt} rather than the Pearson correlation coefficient. We can obtain polychoric correlations by placing the tibble containing our data into the polychoric() function from the psych package. For example,

raq_poly <- psych::polychoric(raq_items_tib)

This code takes the data from the RAQ (raq_items_tib), puts it through the polychoric() function to get the correlation coefficients and stores them in an object called raq_poly. This object contains information other than the correlation coefficients. The matrix of correlation coefficients is stored in a variable called rho, so we can access it using raq_poly$rho. However, we need to use this matrix at various points so it's helpful to store it as its own object.

raq_cor <- raq_poly$rho

This code stores the polychoric correlation matrix as an object called raq_cor. We view these correlations by executing the name of the object, but it's helpful to round them to 2 decimal places when viewing them using the kable() function that we have met before.

r alien() Alien coding challenge

Create an object called [raq_cor]{.alt} that contains the polychoric correlations for the 23 RAQ items. View it using kable() to round values to 2 decimal places.


# create the polychoric correlation object
raq_poly <- psych::polychoric(raq_items_tib)
# save the correlations from the object you just created
raq_cor <- raq_poly$rho
# view the correlations rounded to 2dp:
knitr::kable(raq_cor, digits = 2)
# put it all together
raq_poly <- psych::polychoric(raq_items_tib)
raq_cor <- raq_poly$rho
knitr::kable(raq_cor, digits = 2)

Staring at all of those numbers can make your eyes hurt. We can instead produce a heat map of correlations using the cor.plot() function from the psych package. This function plots a grid of the correlations and colours each cell by the strength and direction of the correlation. We can get a sense of the relationships in the data by looking for very lightly shaded cells (correlation close to zero) and very darkly shaded cells (correlations close to 1 or -1). Given that correlation matrices are symmetrical above and below the diagonal the information below the diagonal is identical to the information above it. It is, therefore, helpful to use [upper = FALSE]{.alt} to hide the information above the diagonal.

r robot() Code example

To use the cor.plot() function we feed the corrlation matrix we just created into it.

psych::cor.plot(raq_cor, upper = FALSE)

r alien() Alien coding challenge

Use the code box to visualise the correlation matrix using cor.plot().


psych::cor.plot(raq_cor, upper = FALSE)

We can visually scan the correlation matrix and look for correlations between about ±0.3. You're looking for any variables that have lots of correlations that are close to 0 (no relationship). There's nothing magic about the value 0.3, it's a fairly arbitrary and inclusive heuristic to allow you to narrow your focus! Also note correlations greater than (roughly) ±0.9 because these might indicate collinear or singular items.

All questions in the RAQ correlate reasonably well with all others and none of the correlation coefficients are excessively large; therefore, we won't eliminate any questions based on the size of the correlations.

r bmu() Bartlett's test and the KMO test [(2)]{.alt}

Another way to inspect the correlation matrix is with Bartlett's test of sphericity, which tests whether the correlation matrix is significantly different from an identity matrix (i.e. it tests whether the correlation coefficients are all 0). It's testing a pretty extreme scenario and because significance depends on sample size, and in factor analysis sample sizes are very large, the test will nearly always be significant. So it's not a particularly useful test, but worth doing because in the unlikely event that it is non-significant you likely have a big problem on your hands.

r robot() Code example

We can implement Bartlett's test using the psych::cortest.bartlett() function. To use it, we pipe the correlation matrix ([raq_cor]{.alt}) into it and specify the sample size using, for these data, [n = 2571]{.alt}. For other data sets change the value of 2571 to be the sample size.

psych::cortest.bartlett(raq_cor, n = 2571)

r alien() Alien coding challenge

Use the code box to get Bartlett's test.


psych::cortest.bartlett(raq_cor, n = 2571)
bartlett <- psych::cortest.bartlett(raq_cor, n = 2571)

As expected, given the huge sample size, Bartlett's test is highly significant, $\chi^2$(r bartlett$df) = r sprintf("%.2f", bartlett$chisq), p = r sprintf("%.2f", bartlett$p.value). This significant value only really tells us that we don't have a massive problem, which is nice to know, I suppose.

You might also want to check the sampling adequacy (spoiler: with 2571 participants I think we're good). The Kaiser–Meyer–Olkin (KMO) measure of sampling adequacy (KMO) varies between 0 and 1 with a value of 0 indicating that factor analysis is likely to be inappropriate. A value close to 1 indicates that patterns of correlations are relatively compact and so factor analysis should yield distinct and reliable factors. Kaiser and Rice (1974) provided appealing guidelines, especially if you like the letter M:

Values smaller than 0.5 should lead you either to collect more data or to rethink which variables to include.

r robot() Code example

We can implement the KMO test using the psych::KMO() function. To use it, we again feed the correlation matrix ([raq_cor]{.alt}) into it. For example:

psych::KMO(raq_cor)

r alien() Alien coding challenge

Use the code box to get then KMO test.


psych::KMO(raq_cor)
kmo <- psych::KMO(raq_cor)

The KMO statistic ([Overall MSA]{.alt}) is r sprintf("%.2f", kmo$MSA), which is well above the minimum criterion of 0.5 and falls into the range of marvellous. The KMO values for individual variables range from r sprintf("%.2f", min(kmo$MSAi)) to r sprintf("%.2f", max(kmo$MSAi)). All values are, therefore, well above 0.5, which is good news.

`r info()` **Information** If you find variables with KMO values below 0.5 then consider excluding them from the analysis (or run the analysis with and without that variable and note the difference). Removal of a variable affects the KMO statistics, so if you plan to remove a variable be sure to re-examine the KMO for the correlation matrix that has the variable removed.

r bmu() Parallel analysis [(2)]{.alt}

To determine how many factors we should extract, we can use the psych::fa.parallel() function to run parallel analysis. This function has a lot of arguments, the ones most likley to be of use to you are

r robot() Code example

You can feed your tibble containing the item data from the questionnaire into the fa.parallel() function. All of the default settings are fine for our purposes except that we need to use polychoric correlations. We do this by including the [cor = "poly"]{.alt} argument.

psych::fa.parallel(raq_items_tib, cor = "poly")

Alternatively, given that we have already computed the polychoric correlations and stored them in [raq_cor]{.alt}, it's faster to apply the function directly to this correlation matrix and specify the sample size with [n.obs = 2571]{.alt}.

psych::fa.parallel(raq_cor, n.obs = 2571)

r alien() Alien coding challenge

Use the code box to run a parallel analysis with the default settings but specifying [fa = "fa"]{.alt}. How many factors should be extracted?


# Choose one of
# Using the raw data
psych::fa.parallel(raq_items_tib, fa = "fa", cor = "poly")

# Using the correlation matrix (faster)
psych::fa.parallel(raq_cor, n.obs = 2571, fa = "fa")

In a parallel analysis each eigenvalue (which represents the size of the factor) is compared against an eigenvalue for the corresponding factor in many randomly generated data sets that have the same characteristics as the data being analysed. In doing so, each eigenvalue is compared to an eigenvalue from a data set that has no underlying factors. This is a bit like asking whether our observed factor is bigger than a non-existing factor. Factors that are bigger than their ‘random' counterparts are retained. In the scree plot, each factor is plotted on the x-axis with its corresponding eigenvalue on the y-axis. The observed eigenvalues for each factor are shown as blue triangles connected by a blue line to form the characteristic 'scree' shape. The red dotted line shows the corresponding line from randomly generated datasets. Given we want to keep factors that have eigen values larger than we'd expect in data that has no underlying factors, we want to keep factors that have triangles (eigenvalues) that fall above the red line. In this case, the first four factors have eigenvalues larger than their counterparts from random data. The output also contains a handy message confirming that we should extract 4 factors.

r alien() Alien coding challenge

Repeat the parallel analysis but this time using principal components to compute the eigenvalues.


# Choose one of
# Using the raw data
psych::fa.parallel(raq_items_tib, fa = "pc", cor = "poly")

# Using the correlation matrix (faster)
psych::fa.parallel(raq_cor, n.obs = 2571, fa = "pc")
quiz(caption = "Parallel analysis Quiz",
     question("Based on the parallel analysis that used principal components to compute the eiegenvalues, how many factors should be extracted?",
    answer("4", correct = T, message = "Yes, fortunately this analysis agrees with the parallel analysis based on eigenvalues from factor analysis."),
    answer("5", message = "Look at how many cpomonents have eigen values above the red line."),
    answer("3", message = "Look at how many cpomonents have eigen values above the red line"),
    answer("2", message = "Look at how many cpomonents have eigen values above the red line"),
    random_answer_order = TRUE,
    allow_retry = T
    )
)

r bmu() Factor analysis [(2)]{.alt}

We can now run the factor analysis, extracting 4 factors, using the psych::fa() function. This function has this general format (I'm including only the arguments you're most likely to use):

my_fa_object <- psych::fa(r,
                          nfactors = 1,
                          fm = "minres",
                          rotate = "oblimin",
                          scores = "regression",
                          max.iter = 50,
                          use = "pairwise",
                          cor = "cor"
                          )

In which you replace [my_fa_object]{.alt} with the name you want to give your factor analysis object.

`r cat_space()` **Tip: factor rotation** To do factor rotation we need to have the `GPArotation` package loaded. Make sure that you have it installed by executing the following in the console: wzxhzdk:37 Once installed, make sure that in your Quarto document you include (if you follow my advise you'll do this in the code chunk where you setup your Quarto document) wzxhzdk:38

r robot() Code example

As with parallel analysis we can feed the raw data into the function (remembering to set [cor = "poly"]{.alt} so that the analysis is based on the polychoric correlations), or we can feed in the correlation matrix and sample size ([n.obs = 2571]{.alt})

For the RAQ data, we could use the following code, which leaves a lot of the default settings alone, to create an object [raq_fa]{.alt}:

raq_fa <- psych::fa(raq_items_tib,
                    nfactors = 4,
                    scores = "tenBerge",
                    cor = "poly"
                    )

Using the correlation matrix instead of the raw data we'd get:

raq_fa <- psych::fa(raq_cor,
                    n.obs = 2571,
                    nfactors = 4,
                    scores = "tenBerge"
                    ) 

r alien() Alien coding challenge

Use the code box to create the object [raq_fa]{.alt} using the code above and then inspect it.


# Use either (but not both)
# From the raw data
raq_fa <- psych::fa(raq_items_tib, nfactors = 4, scores = "tenBerge", cor = "poly")
raq_fa
# From the correlation matrix
raq_fa <- psych::fa(raq_cor, n.obs = 2571, nfactors = 4, scores = "tenBerge") 
raq_fa
raq_fa <- psych::fa(raq_cor, n.obs = 2571, nfactors = 4, scores = "tenBerge") 

There's a lot of output. First, we are shown the pattern matrix, but we'll return to this. Note that the factors are labelled [MR1]{.alt}, [MR2]{.alt}, [MR3]{.alt} and [MR4]{.alt}. We are given some information about how much variance each factor accounts for.

raq_fa$Vaccounted |>
  knitr::kable(digits = 2)
va <- raq_fa$Vaccounted

prop_var <- sprintf("%.2f", va[2, ])
perc_var <- sprintf("%.0f", 100*va[2, ])
cum_var <- sprintf("%.2f", va[3, ])
cum_perc_var <- sprintf("%.0f", 100*va[3, ])
pe <- sprintf("%.2f", va[4, ])
pe_perc <- sprintf("%.0f", 100*va[4, ])

We see, for example, from [Proportion Var]{.alt} that [MR1]{.alt} accounts for r prop_var[1] of the overall variance (r perc_var[1]%) and [MR2]{.alt} accounts for r prop_var[2] of the variance (r perc_var[2]%) and so on. The [Cumulative Var]{.alt} is the proportion of variance explained cumulatively by the factors. So, cumulatively, [MR1]{.alt} accounts for r prop_var[1] of the overall variance (13%) and [MR1]{.alt} and [MR2]{.alt} together account for r prop_var[1] + r prop_var[2] = r cum_var[2] of the variance (r cum_perc_var[2]%). Importantly, we can see that all four factors in combination explain r cum_var[4] of the overall variance (r cum_perc_var[4]%).

The [Proportion Explained]{.alt} is the proportion of the explained variance, that is explained by a factor. So, of the r cum_perc_var[4]% of variance accounted for, r pe[1] (r pe_perc[1]%) is attributable to [MR1]{.alt}, r pe[2] (r pe_perc[2]%) to [MR2]{.alt}, r pe[3] (r pe_perc[3]%) to [MR4]{.alt} and r pe[4] (r pe_perc[4]%) to [MR3]{.alt}.

The correlations between factors are also displayed:

raq_fa$Phi |>
  knitr::kable(digits = 2)

These are all non-zero indicating that factors are correlated (and oblique rotation was appropriate). It also tells us the degree to which factors are correlated. All of the factors positively, and fairly strongly, correlate with each other. In other words, the latent constructs represented by the factors are related.

There are several fit indices that tell us how well the model fits the data.

`r info()` **Fit indices** Good fit is (probably) indicated by * A combination of TLI > 0.96 and SRMR (RMSR in the output) < 0.06 * A combination of RMSEA < 0.05 and SRMR < 0.09 The TLI is `r sprintf("%.2f", raq_fa$TLI)`, which is greater than 0.96, and RMSR is `r sprintf("%.2f", raq_fa$rms)` which is smaller than both 0.09 and 0.06. Furthermore, RMSEA is `r sprintf("%.3f", raq_fa$RMSEA[1])`, which is less than 0.05. With the caveat that universal cut-offs need to be taken with a pinch of salt, it's reasonable to conclude that the model has excellent fit.

To interpret the factor analysis we look at the factor loadings for each question on each factor to see which items load most heavily onto which factors. The factor loadings are shown in the pattern matrix that forms part of the main output, but this is quite difficult to interpret in its raw form. To make life easier we can use the parameters::model_parameters() function, which provides an easy way to sort the items by their factor loadings and suppress factor loadings below a certain value (which can help you to see the wood for the trees). In general it takes the form:

parameters::model_parameters(my_fa_object, sort = TRUE, threshold = "max")

In which [my_fa_object]{.alt} is the factor analysis object containing the factor loadings. By setting [sort = TRUE]{.alt} we sort the items by their factor loadings, and we can set [threshold]{.alt} to a value above which we show values. By default the threshold is the maximum loading (i.e. for each item only the maximum loading is shown), to see all of the factor loadings we need to set [threshold = NULL]{.alt}, and to see loadings above a particular threshold (between 0 and 1) we can set the value of the threshold we want; for example, if we set [threshold = 0.2]{.alt} then only factor loadings above 0.2 will be displayed.

r alien() Alien coding challenge

Use the code box to sort the the factor loadings and display only those greater than 0.2. Round to 2 decimal places using kable() (Remember that earlier we saved the factor analysis object as [raq_fa]{.alt}.)


parameters::model_parameters(raq_fa, sort = TRUE, threshold = "0.2") |> 
  knitr::kable(digits = 2)

This makes it much easier to see the patterns in the questions that load onto the same factors. The questions that load highly on [MR1]{.alt} seem to be items that relate to Fear of computers:

Note that item 5 also loads highly onto [MR4]{.alt}.

The questions that load highly on [MR2]{.alt} seem to be items that relate to Fear of peer/social evaluation:

The questions that load highly on [MR4]{.alt} seem to be items that relate to Fear of statistics:

The questions that load highly on [MR3]{.alt} seem to be items that relate to Fear of mathematics:

This analysis seems to reveal that the questionnaire is composed of four subscales: fear of statistics, fear of computers, fear of maths and fear of negative peer evaluation. There are two possibilities here. The first is that the RAQ failed to measure what it set out to (namely, r rproj() anxiety) but instead measures related constructs. The second is that these four constructs are sub-components of r rproj() anxiety; however, the factor analysis does not indicate which of these possibilities is true.

r bmu() Reliability analysis [(2)]{.alt}

r bmu() McDonald's $\omega_t$ and $\omega_h$ [(2)]{.alt}

If the items are all scored in the same direction then we can select the variables on a particular subscale and pipe them into the omega() functions from the psych package. This function has the following general format

my_omg <- psych::omega(
    my_tibble
    nfactors = 1,
    fm = "minres",
    key = c(1, 1, -1, 1, 11),
    rotate = "oblimin",
    poly = FALSE
  )

In which [my_omg]{.alt} is the name you want to give to the object that stores the results, and [my_tibble]{.alt} is the name of your data.

r robot() Code example

To compute omega you need to reconstruct the original factor analysis so many of the arguments will be familiar and you set them to however they were set for your factor analysis. For example, we extracted 4 factors so we'd use [nfactors = 4]{.alt}, we used the minres methods with oblimin rotation so we'd set [fm = "minres"]{.alt} and [rotate = "oblimin"]{.alt}, and because we use polychoric correlations we need to include [poly = TRUE]{.alt}.

The key argument allows us to reverse item scoring on the fly. We supply the key argument with a vector of 1s and −1s that is the same length as the number of variables being fed into omega(). For the RAQ we have 23 items, and one of them (raq_03) is reverse scored. Assuming we enter the items into omega() in order, we'd use a vector of 23 1s and assign the third one a minus sign:

key = c(1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
`r cat_space()` **Tip: using `rep()`** The `rep()` function, which we've met before, can help us to specify the key argument. It takes the form wzxhzdk:52 On the RAQ, after item 3 all items are scored normally meaning that we have 20 repetitions of 1. We can specify these with `rep(1, 20)` within our vector: wzxhzdk:53

r robot() Code example

Putting this all together we can get omega for the RAQ by executing

raq_omg <- psych::omega(raq_items_tib,
                        nfactors = 4,
                        fm = "minres",
                        key = c(1, 1, -1, rep(1, 20)),
                        poly = TRUE
                        )

This code puts the raw data (in [raq_items_tib]{.alt}) into the omega() function, replicates the arguments used for the initial factor analysis, reverse scores raq_03, and tells the function to compute polychoric correlations. The results are stored in an object that I've called [raq_omg]{.alt}. To view this object execute its name.

r alien() Alien coding challenge

Use the code box to run a reliability analysis on the RAQ.


raq_omg <- psych::omega(raq_items_tib,
                        nfactors = 4,
                        fm = "minres",
                        key = c(1, 1, -1, rep(1, 20)),
                        poly = TRUE
                        )
raq_omg 
raq_omg <- psych::omega(raq_items_tib,
                        nfactors = 4,
                        fm = "minres",
                        key = c(1, 1, -1, rep(1, 20)),
                        poly = TRUE, plot = F
                        )

The first thing to look at is the table of factor loadings (reproduced below). Note that raq_03 is helpfully labelled as [raq_03-]{.alt} (note the minus sign suffix) to indicate that it is reverse scored. The column labelled [g]{.alt} shows the loading of each item on the general factor. This column should be inspected before anything else because both $\omega_h$ and $\omega_t$ are based on a model with a common factor. If any items load poorly to this general factor then a common factor model isn't appropriate. Also, remember that $\omega_h$ uses the communalities for each item with only the general factor so it is based on these loadings. All items here have loadings with the general factor far enough away from zero that we might consider a general factor model to be appropriate. The columns labelled F1 to F4 show the factor loadings for each item on the four factors that we asked to be extracted. Note that the values don't match those from the main factor analysis (because this model also includes a general factor) but the patterns of item loadings to factors do.

raq_omg$schmid$sl |>
  as.data.frame() |> 
  dplyr::mutate(
    across(`F1*`:`F4*`, \(x) ifelse(abs(x) > 0.2, x, NA))
  ) |> 
  knitr::kable(digits = 2)

At the top of the text output are the reliability estimates for the general factor, which include Cronbach's $\alpha$ = r sprintf("%.2f", raq_omg$alpha), $\omega_h$ = r sprintf("%.2f", raq_omg$omega_h) and $\omega_t$ = r sprintf("%.2f", raq_omg$omega.tot). If you're into $\alpha$ then there's a section for you later on. $\omega_h$ is a gauge of the extent to which items reflect a single construct and this value (r sprintf("%.2f", raq_omg$omega_h)) suggests that to some extent they do, but there is still a lot of variance in the general factor that is unexplained. In fact, further down we're told that only about half (r sprintf("%.2f", raq_omg$ECV)) of the variance in the general factor is explained by the items. $\omega_t$ is the total reliability and r sprintf("%.2f", raq_omg$omega.tot) is a high value suggesting the scores are reliable (i.e., the scale is reliable in this sample).

Next we get two lots of model fit statistics. The first set are for the model that has four factors and repeat the information from the original EFA. The chi-square statistic is significant, $\chi^2$ = r sprintf("%.2f", raq_omg$schmid$STATISTIC), p < 0.001, a bad thing but unsurprising given the sample size, the RMSR = r sprintf("%.2f", raq_omg$schmid$rms) and the RMSEA = r sprintf("%.2f", raq_omg$schmid$RMSEA[1]) 90% CI [r sprintf("%.2f", raq_omg$schmid$RMSEA[2]), r sprintf("%.2f", raq_omg$schmid$RMSEA[3])]. Next, we get the same information but for a model that contains only the general factor (and not the four sub-factors). The fit gets worse as shown by (1) a larger and more significant chi-square, $\chi^2$ = r sprintf("%.2f", raq_omg$gstats$STATISTIC), p < 0.001; (2) a larger RMSR = r sprintf("%.2f", raq_omg$gstats$rms); and (3) a larger RMSEA = r sprintf("%.2f", raq_omg$gstats$RMSEA[1]) 90% CI [r sprintf("%.2f", raq_omg$gstats$RMSEA[2]), r sprintf("%.2f", raq_omg$gstats$RMSEA[3])]. This tells us that the model that characterises the RAQ in terms of four-factors is a better fit of the data than a model that characterises it as a single factor.

Finally, under Total, General and Subset omega for each subset we get the $\omega_t$ (labelled omega total) and $\omega_h$ (labelled Omega general) for the general factor (column [g]{.alt}) and for the subfactors. The values for the general factor repeat the information at the start of the output. The values for the sub-factors are particularly relevant for $\omega_t$ because it represents the total reliability of the scores, so these values tell us the total reliability of the scores from the underlying sub-scales. For anxiety related to computers (F1) we have $\omega_t$ = r sprintf("%.2f", raq_omg$omega.group$total[2]), for anxiety around peer or social evaluation (F2) $\omega_t$ = r sprintf("%.2f", raq_omg$omega.group$total[3]), for anxiety around statistics (F3) $\omega_t$ = r sprintf("%.2f", raq_omg$omega.group$total[4]), and for anxiety around maths (F4) $\omega_t$ = r sprintf("%.2f", raq_omg$omega.group$total[5]). Basically scores from each subscale are reliable but somewhat less so for anxiety around statistics.

r bmu() Cronbach's $\alpha$ [(2)]{.alt}

There are lots of reasons not to use Cronbach's alpha (see the book for details). If you really must compute alpha then you need to compute it on the individual subscales. As a quick reminder, we had four subscales:

We pipe the variables for each subscale into the alpha() function from psych to get the conventional statistics that people who use alpha like to have.

r robot() Code example

For the fear of computers subscale we could execute (I'm excluding raq_05 because it makes more sense conceptually on the fear of statistics factor):

raq_tib |> 
  dplyr::select(raq_06, raq_07, raq_10, raq_13, raq_14, raq_15, raq_18) |> 
  psych::alpha()

r alien() Alien coding challenge

Use the code box to run a reliability analysis on the fear of computers subscale:


raq_tib |> 
  dplyr::select(raq_06, raq_07, raq_10, raq_13, raq_14, raq_15, raq_18) |> 
  psych::alpha()
r1 <- raq_tib |> 
  dplyr::select(raq_06, raq_07, raq_10, raq_13, raq_14, raq_15, raq_18) |> 
  psych::alpha()

get_alpha <- function(x, digits = 2){
  dp <- paste0("%.", digits, "f")
  a <- x$total$raw_alpha
  se <- x$total$ase

  ci_up <- sprintf(dp, a + 1.96*se)
  ci_lo <- sprintf(dp, a - 1.96*se)


  paste0("$\\alpha$", " = ", sprintf(dp, a), " [", ci_lo, ", ", ci_up, "]")  
}

First, and perhaps most important, the value of Alpha at the very top is Cronbach's $\alpha$, and we are given its 95% confidence interval below: we're looking for values in the range of .7 to .8 (or thereabouts). In this case, r get_alpha(r1) so this probably indicates good reliability.

Next, we get a table giving the statistics for the scale if we deleted each item in turn. The values in the column labelled [raw_alpha]{.alt} are the values of the overall $\alpha$ if that item is not included in the calculation. Basically it's the change in Cronbach's $\alpha$ that would be seen if a particular item were deleted. The overall $\alpha$ is r sprintf("%.2f", r1$total$raw_alpha), and so all values greater than this indicate that the overall $\alpha$ increases (i.e. reliability improves) if the item is removed. None of the items here would improve reliability if they were deleted.

The table labelled [item statistics]{.alt} shows, in the column labelled [raw.r]{.alt} the correlations between each item and the total score from the scale — sometimes called item-total correlations. There's a problem with this statistic in that the item is included in the scale total, which inflates the overall correlation. Ideally we want these correlations to be computed without the item in question, and these are the values in [r.drop]{.alt}. In a reliable scale all items should correlate with the total. So, we're looking for items that don't correlate with the overall score from the sub-scale: if any of these values of [r.drop]{.alt} are less than about .3 then we've got problems, because it means that a particular item does not correlate very well with the subscale overall. [The .3 is an arbitrary value that represents a reasonable sized relationship, but there's nothing magic about so use your own judgement.] For the fear of computers subscale, all items have corrected item-total correlations above 0.3, which is encouraging.

The final table tells us what percentage of people gave each response to each of the items. This is useful to make sure that everyone in your sample is not giving the same response. It is usually the case that an item where everyone (or almost everyone) gives the same response will almost certainly have poor reliability statistics. For this subscale few people responded with a 1 on any of the items suggesting that no-one is really feeling the love for computers or that the items are doing a poor job of eliciting those extreme responses.

r alien() Alien coding challenge

Use the code box to run a reliability analysis on the Fear of peer/social evaluation subscale:


raq_tib |> 
  dplyr::select(raq_02, raq_09, raq_19, raq_22, raq_23) |> 
  psych::alpha()
r2 <- raq_tib |> 
  dplyr::select(raq_02, raq_09, raq_19, raq_22, raq_23) |> 
  psych::alpha()

Again we have good overall reliability (r get_alpha(r2)), no items improve this value if dropped, item correlations with the scale total are all really good and again we have issues with the items not eliciting extremely low responses (for all items only 0.02, or 2%, of responders responded with a 1).

r alien() Alien coding challenge

Use the code box to run a reliability analysis on the Fear of mathematics subscale:


raq_tib |> 
  dplyr::select(raq_08, raq_11, raq_17) |> 
  psych::alpha()
r3 <- raq_tib |> 
  dplyr::select(raq_08, raq_11, raq_17) |> 
  psych::alpha()

Overall reliability is still fairly high (r get_alpha(r3)), no items improve this value if dropped, item correlations with the scale total are all really good and again we have issues with the items not eliciting extremely low responses (again, for each item only 2%, of responders responded with a 1).

The fear of statistics subscale has a reverse scored item: Standard deviations excite me (raq_03), so we need to indicate that this item is reverse scored. Like with the omega() function, we need to tell the alpha() function which items are reversed by using the [keys]{.alt} argument within the function. Just to confuse you note that the argument is [keys]{.alt} (with an s) whereas for omega() it was [key]{.alt} (without an s). We supply this argument with a vector of 1s and -1s as we did before. If we selected the items for the Fear of statistics subscale in order by using

raq_tib |> 
  dplyr::select(raq_01, raq_03, raq_04, raq_05, raq_12, raq_16, raq_20, raq_21)

Then we'd need to indicate that the second item in the list of variables is reverse scored, which we'd do using:

psych::alpha(keys = c(1, -1, 1, 1, 1, 1, 1, 1))

Note that the -1 is in the same position (second) as [raq_03]{.alt} is in the variable list.

r robot() Code example

For the fear of statistics subscale we would execute:

raq_tib |> 
  dplyr::select(raq_01, raq_03, raq_04, raq_05, raq_12, raq_16, raq_20, raq_21) |> 
  psych::alpha(keys = c(1, -1, 1, 1, 1, 1, 1, 1))

r alien() Alien coding challenge

Use the code box to run a reliability analysis on the fear of computers subscale:


raq_tib |> 
  dplyr::select(raq_01, raq_03, raq_04, raq_05, raq_12, raq_16, raq_20, raq_21) |> 
  psych::alpha(keys = c(1, -1, 1, 1, 1, 1, 1, 1))
r4 <- raq_tib |> 
  dplyr::select(raq_01, raq_03, raq_04, raq_05, raq_12, raq_16, raq_20, raq_21) |> 
  psych::alpha(keys = c(1, -1, 1, 1, 1, 1, 1, 1))

Note in the output that [raq_03]{.alt} is helpfully labelled as [raq_03-]{.alt} to indicate that it is reverse scored. Again we have acceptable, but not staggering, overall reliability (r get_alpha(r4)), no items improves this value if it is dropped, item correlations with the scale total are not too bad (they range from 0.27 to 0.49) and again we have issues with the items not eliciting extreme responses.

discovr package hex sticker, female space pirate with gun. Gunsmoke forms the letter R. **A message from Mae Jemstone:** Mae was burnt out. She thought the tutorial writing was over. 'Just one term, then you can rest' they had said. Not for the first time, they had lied. She raised her weary body and stared at her screen. It made her feel anxious. She couldn't do it anymore, she hadn't the strength. 'Who will believe me?' She thought. People depend on me, I'm not allowed to show weakness. But how can I find the will? She remembered the box, the box from the Senza and very slowly her fingers started to type.

Resources {data-progressive=FALSE}

Statistics

r rproj()

Acknowledgement

I'm extremely grateful to Allison Horst for her very informative blog post on styling learnr tutorials with CSS and also for sending me a CSS template file and allowing me to adapt it. Without Allison, these tutorials would look a lot worse (but she can't be blamed for my colour scheme).

References



profandyfield/discovr documentation built on May 4, 2024, 4:32 p.m.