calc_collin_diag: Calculate Collinearity Diagnostics

View source: R/calc_collin_diag.R

calc_collin_diagR Documentation

Calculate Collinearity Diagnostics

Description

This function computes collinearity diagnostics, including variance inflation factors (VIF), tolerance, R-squared values, eigenvalues, condition indices, and more. It replicates functionality similar to what is described in the Stata collinearity diagnostics page.

Usage

calc_collin_diag(
  data,
  ...,
  method = "pearson",
  use = "complete.obs",
  method_for_eigen = "corr",
  show_inv_cor_mat = FALSE
)

Arguments

data

A data frame containing the variables to analyze.

...

Variables to include in the analysis, specified without quotes.

method

The method for calculating the correlation matrix. Default is "pearson".

use

How to handle missing values when calculating correlations. Default is "complete.obs".

method_for_eigen

Specifies the method for calculating eigenvalues and condition indices. Options are "corr" for the correlation matrix or "sscp" for the scaled sum of squares and cross-product matrix. Default is "corr".

show_inv_cor_mat

Logical. If TRUE, includes the inverse correlation matrix in the output. Default is FALSE.

Value

A list with the following components:

table

A tibble with the collinearity diagnostics for each variable. Includes VIF, tolerance, R-squared, eigenvalues, and condition indices.

summary

A tibble summarizing the mean VIF, condition number, and determinant of the correlation matrix.

inv_cor_mat

The inverse correlation matrix, if show_inv_cor_mat = TRUE.

Examples

# Example data
library(dplyr)
# Examples from Phil Ender
# http://www.philender.com/courses/categorical/notes2/collin.html

hsbdemo <- read.csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv")
dplyr::glimpse(hsbdemo)

calc_collin_diag(data = hsbdemo,
                 female,
                 schtyp,
                 read,
                 write,
                 math,
                 science,
                 socst,
                 method_for_eigen = "corr",
                 method  = "pearson")


set.seed(123) # Ensure reproducibility

n <- 100 # Number of rows

lahigh <- tibble(
  id = 1000 + seq_len(n),
  gender = sample(c("male", "female"), n, replace = TRUE),
  ethnic = sample(c("hispanic", "filipino", "afr-amer", "asian", "white"), n, replace = TRUE),
  school = sample(1:2, n, replace = TRUE),
  algebra = sample(0:4, n, replace = TRUE),
  math = sample(0:4, n, replace = TRUE),
  eng95 = sample(0:4, n, replace = TRUE),
  eng94 = sample(0:4, n, replace = TRUE),
  mathnce = runif(n, 1, 100), # Continuous values between 1 and 100
  langnce = runif(n, 1, 100),
  mathpr = sample(1:100, n, replace = TRUE), # Integer percentiles
  langpr = sample(1:100, n, replace = TRUE),
  biling = sample(0:3, n, replace = TRUE),
  engprof = sample(0:4, n, replace = TRUE),
  daysatt = sample(40:90, n, replace = TRUE),
  daysabs = sample(0:35, n, replace = TRUE)
)

dplyr::glimpse(lahigh)

calc_collin_diag(data = lahigh,
                 mathnce,
                 langnce,
                 mathpr,
                 langpr,
                 method_for_eigen = "corr",
                 method  = "pearson")

calc_collin_diag(data = lahigh,
                 mathnce,
                 langnce,
                 method_for_eigen = "corr",
                 method  = "pearson")


emilelatour/lamisc documentation built on March 29, 2025, 1:23 p.m.