select_mds: Select a Minimum Data Set (MDS) of Soil Quality Indicators

View source: R/mds.R

select_mdsR Documentation

Select a Minimum Data Set (MDS) of Soil Quality Indicators

Description

Identifies the most informative subset of soil variables (the Minimum Data Set, MDS) using Principal Component Analysis (PCA). Only variables with high factor loadings on principal components explaining eigenvalue > 1 (Kaiser criterion) are retained. Where multiple variables load highly on the same component, the one with the highest correlation to others in that component is selected to minimise redundancy.

This approach follows the widely cited method of Andrews et al. (2004) and Sharma et al. (2008), and is equivalent to the PCAIndex algorithm in Wani et al. (2023).

Usage

select_mds(
  data,
  group_cols = "LandUse",
  load_threshold = 0.5,
  vif_threshold = 10,
  n_pc = "auto",
  verbose = TRUE
)

Arguments

data

A data frame of scored or raw soil variables (numeric columns only, or with group columns specified in group_cols).

group_cols

Character vector of grouping columns to exclude from the analysis. Default: "LandUse".

load_threshold

Numeric in (0, 1). Minimum absolute factor loading for a variable to be considered for MDS membership. Default: 0.6 (Andrews et al., 2004).

vif_threshold

Numeric. Maximum allowable Variance Inflation Factor among MDS variables. Variables exceeding this are iteratively removed. Set to Inf to skip VIF filtering. Default: 10.

n_pc

Integer or "auto". Number of principal components to consider. "auto" (default) uses the Kaiser criterion (eigenvalue > 1).

verbose

Logical. Print MDS selection summary. Default TRUE.

Details

**Algorithm steps:**

  1. Standardise all numeric variables (mean = 0, sd = 1).

  2. Perform PCA; retain components with eigenvalue > 1.

  3. For each retained component, identify variables with absolute loading \ge load_threshold.

  4. Among those, select the variable with the highest sum of absolute Pearson correlations to all others in the set (i.e., the most correlated, least redundant variable).

  5. Optionally, remove variables with high Variance Inflation Factor (VIF > vif_threshold) among the MDS candidates.

Value

A list of class sqi_mds with:

mds_vars

Character vector of selected MDS variable names.

all_vars

Character vector of all candidate variable names.

pca

The PCA result object.

loadings

Matrix of factor loadings.

eigenvalues

Numeric vector of eigenvalues.

var_explained

Numeric vector of variance explained (%) per component.

References

Andrews, S.S., Karlen, D.L., & Cambardella, C.A. (2004). The soil management assessment framework: A quantitative soil quality evaluation method. Soil Science Society of America Journal, 68(6), 1945–1962. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.2136/sssaj2004.1945")}

Kaiser, H.F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20(1), 141–151. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1177/001316446002000116")}

Sharma, K.L., et al. (2008). Long-term soil management effects on soil quality indices. Geoderma, 144, 290–300. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.geoderma.2007.11.019")}

Examples

data(soil_data)
cfg <- make_config(
  variable = c("pH","EC","BD","OC","MBC","PMN","Clay","WHC","DEH","AP","TN"),
  type     = c("opt","less","less","more","more","more",
               "opt","more","more","more","more"),
  opt_low  = c(6.0, NA, NA, NA, NA, NA, 20, NA, NA, NA, NA),
  opt_high = c(7.0, NA, NA, NA, NA, NA, 35, NA, NA, NA, NA)
)
scored <- score_all(soil_data, cfg, group_cols = c("LandUse","Depth"))
mds    <- select_mds(scored, group_cols = c("LandUse","Depth"))
mds$mds_vars


SQIpro documentation built on April 20, 2026, 5:06 p.m.