x2y: Ranked Predictive Power of Cross-Features (x2y)

View source: R/x2y.R

x2yR Documentation

Ranked Predictive Power of Cross-Features (x2y)

Description

The relative reduction in error when we go from a baseline model (average for continuous and most frequent for categorical features) to a predictive model, can measure the strength of the relationship between two features. In other words, x2y measures the ability of x to predict y. We use CART (Classification And Regression Trees) models to be able to 1) compare numerical and non-numerical features, 2) detect non-linear relationships, and 3) because they are easy/quick to train.

Usage

x2y(
  df,
  target = NULL,
  symmetric = FALSE,
  target_x = FALSE,
  target_y = FALSE,
  plot = FALSE,
  top = 20,
  quiet = "auto",
  ohse = FALSE,
  corr = FALSE,
  ...
)

x2y_metric(x, y, confidence = FALSE, bootstraps = 20, max_cat = 20)

## S3 method for class 'x2y_preds'
plot(x, corr = FALSE, ...)

## S3 method for class 'x2y'
plot(x, type = 1, ...)

x2y_preds(x, y, max_cat = 10)

Arguments

df

data.frame. Note that variables with no variance will be ignored.

target

Character vector. If you are only interested in the x2y values between particular variable(s) in df, set name(s) of the variable(s) you are interested in. Keep NULL to calculate for every variable (column). Check target_x and target_y parameters as well.

symmetric

Boolean. x2y metric is not symmetric with respect to x and y. The extent to which x can predict y can be different from the extent to which y can predict x. Set symmetric=TRUE if you wish to average both numbers.

target_x, target_y

Boolean. Force target features to be part of x OR y?

plot

Boolean. Return a plot? If not, only a data.frame with calculated results will be returned.

top

Integer. Show/plot only top N predictive cross-features. Set to NULL to return all.

quiet

Boolean. Keep quiet? If not, show progress bar.

ohse

Boolean. Use lares::ohse() to pre-process the data?

corr

Boolean. Add correlation and pvalue data to compare with? For more custom studies, use lares::corr_cross() directly.

...

Additional parameters passed to x2y_metric()

x, y

Vectors. Categorical or numerical vectors of same length.

confidence

Boolean. Calculate 95% confidence intervals estimated with N bootstraps.

bootstraps

Integer. If confidence=TRUE, how many bootstraps? The more iterations we run the more precise the confidence internal will be.

max_cat

Integer. Maximum number of unique x or y values when categorical. Will select then most frequent values and the rest will be passed as "".

type

Integer. Plot type: 1 for tile plot, 2 for ranked bar plot.

Details

This x2y metric is based on Rama Ramakrishnan's post: An Alternative to the Correlation Coefficient That Works For Numeric and Categorical Variables. This analysis complements our lares::corr_cross() output.

Value

Depending on plot input, a plot or a data.frame with x2y results.

Examples


data(dft) # Titanic dataset
x2y_results <- x2y(dft, quiet = TRUE, max_cat = 10, top = NULL)
head(x2y_results, 10)
plot(x2y_results, type = 2)

# Confidence intervals with 10 bootstrap iterations
x2y(dft,
  target = c("Survived", "Age"),
  confidence = TRUE, bootstraps = 10, top = 8
)

# Compare with mean absolute correlations
x2y(dft, "Fare", corr = TRUE, top = 6, target_x = TRUE)

# Plot (symmetric) results
symm <- x2y(dft, target = "Survived", symmetric = TRUE)
plot(symm, type = 1)

# Symmetry: x2y vs y2x
on.exit(set.seed(42))
x <- seq(-1, 1, 0.01)
y <- sqrt(1 - x^2) + rnorm(length(x), mean = 0, sd = 0.05)

# Knowing x reduces the uncertainty about the value of y a lot more than
# knowing y reduces the uncertainty about the value of x. Note correlation.
plot(x2y_preds(x, y), corr = TRUE)
plot(x2y_preds(y, x), corr = TRUE)


laresbernardo/lares documentation built on Jan. 14, 2025, 2:22 a.m.