select_mds: Select a Minimum Data Set (MDS) of Soil Quality Indicators
In SQIpro: Comprehensive Soil Quality Index Computation and Visualization

select_mds

R Documentation

Select a Minimum Data Set (MDS) of Soil Quality Indicators

Description

Identifies the most informative subset of soil variables (the Minimum Data Set, MDS) using Principal Component Analysis (PCA). Only variables with high factor loadings on principal components explaining eigenvalue > 1 (Kaiser criterion) are retained. Where multiple variables load highly on the same component, the one with the highest correlation to others in that component is selected to minimise redundancy.

This approach follows the widely cited method of Andrews et al. (2004) and Sharma et al. (2008), and is equivalent to the PCAIndex algorithm in Wani et al. (2023).

Usage

select_mds(
  data,
  group_cols = "LandUse",
  load_threshold = 0.5,
  vif_threshold = 10,
  n_pc = "auto",
  verbose = TRUE
)

Arguments

`data`	A data frame of scored or raw soil variables (numeric columns only, or with group columns specified in `group_cols`).
`group_cols`	Character vector of grouping columns to exclude from the analysis. Default: `"LandUse"`.
`load_threshold`	Numeric in (0, 1). Minimum absolute factor loading for a variable to be considered for MDS membership. Default: `0.6` (Andrews et al., 2004).
`vif_threshold`	Numeric. Maximum allowable Variance Inflation Factor among MDS variables. Variables exceeding this are iteratively removed. Set to `Inf` to skip VIF filtering. Default: `10`.
`n_pc`	Integer or `"auto"`. Number of principal components to consider. `"auto"` (default) uses the Kaiser criterion (eigenvalue > 1).
`verbose`	Logical. Print MDS selection summary. Default `TRUE`.

Details

**Algorithm steps:**

Standardise all numeric variables (mean = 0, sd = 1).
Perform PCA; retain components with eigenvalue > 1.
For each retained component, identify variables with absolute loading \ge load_threshold.
Among those, select the variable with the highest sum of absolute Pearson correlations to all others in the set (i.e., the most correlated, least redundant variable).
Optionally, remove variables with high Variance Inflation Factor (VIF > vif_threshold) among the MDS candidates.

Value

A list of class sqi_mds with:

mds_vars: Character vector of selected MDS variable names.
all_vars: Character vector of all candidate variable names.
pca: The PCA result object.
loadings: Matrix of factor loadings.
eigenvalues: Numeric vector of eigenvalues.
var_explained: Numeric vector of variance explained (%) per component.

References

Andrews, S.S., Karlen, D.L., & Cambardella, C.A. (2004). The soil management assessment framework: A quantitative soil quality evaluation method. Soil Science Society of America Journal, 68(6), 1945–1962. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.2136/sssaj2004.1945")}

Kaiser, H.F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20(1), 141–151. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1177/001316446002000116")}

Sharma, K.L., et al. (2008). Long-term soil management effects on soil quality indices. Geoderma, 144, 290–300. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.geoderma.2007.11.019")}

Examples

data(soil_data)
cfg <- make_config(
  variable = c("pH","EC","BD","OC","MBC","PMN","Clay","WHC","DEH","AP","TN"),
  type     = c("opt","less","less","more","more","more",
               "opt","more","more","more","more"),
  opt_low  = c(6.0, NA, NA, NA, NA, NA, 20, NA, NA, NA, NA),
  opt_high = c(7.0, NA, NA, NA, NA, NA, 35, NA, NA, NA, NA)
)
scored <- score_all(soil_data, cfg, group_cols = c("LandUse","Depth"))
mds    <- select_mds(scored, group_cols = c("LandUse","Depth"))
mds$mds_vars

SQIpro documentation built on April 20, 2026, 5:06 p.m.