knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(relper)
library(dplyr)
library(ggplot2)

x <- rnorm(100)
y <- rexp(100)

calc_ functions compute a certain value.

calc_acf

The goal of calc_acf is to compute the auto-correlation function, given by:

$$\frac{\sum_\limits{t = k+1}^{n}(x_t - \bar{x})(x_{t-k} - \bar{x})}{\sum_\limits{t = 1}^{n} (x_t - \bar{x})^2 },$$ where:

calc_acf(x)

If you pass a second vector in the argument y the cross-correlation will be computed instead:

$$\frac{n \left( \sum_\limits{t = 1}^{n}x_ty_t \right) - \left[\left(\sum_\limits{t = 1}^{n}x_t \right) \left(\sum_\limits{t = 1}^{n}y_t\right) \right]}{\sqrt{\left[n \left( \sum_\limits{t = 1}^{n}x_t^2 \right) - \left( \sum_\limits{t = 1}^{n}x_t \right)^2\right]\left[n \left( \sum_\limits{t = 1}^{n}y_t^2 \right) - \left( \sum_\limits{t = 1}^{n}y_t \right)^2\right]}},$$ where:

calc_acf(x,y)

calc_association

The goal of calc_association is to compute associations metrics.

Contingency

Contingency is a measure of the degree to which two nominal variables are associated. It has a value between 0 and 1, with 0 indicating no relationship and 1 indicating perfect association, and is calculated as follows:

$$\sqrt{\frac{X^2}{n+X^2}},$$

where:

calc_association(mtcars$am,mtcars$vs,type = "contingency")

Cramér's V

Cramér's V is a measure of the degree to which two nominal variables are associated. It has a value between 0 and 1, with 0 indicating no relationship and 1 indicating perfect association, and is calculated as follows:

$$\sqrt{\frac{X^2}{n\min(r-1,c-1)}},$$

where:

calc_association(mtcars$am,mtcars$vs,type = "cramers-v")

Phi

Phi is a measure of association between two nominal dichotomous variables that takes into account a marginal table of the variables given by:

| | y = 0 | y = 1 | Total | |-----------|-----------|------------|-----------| | x = 0 | $n_{00}$ | $n_{01}$ | $n_{0.}$ | | x = 1 | $n_{10}$ | $n_{11}$ | $n_{1.}$ | | Total | $n_{.0}$ | $n_{.1}$ | $n$ |

Then the phi coefficient is given by:

$$\frac{n_{11}n_{00} - n_{10}n_{01} }{\sqrt{n_{1.}n_{0.}n_{.1}*n_{.0}}}.$$

calc_association(mtcars$am,mtcars$vs,type = "phi")

calc_auc

The goal of calc_auc is to compute the area under a curve (AUC).

x <- seq(-3,3,l = 100)

y <- dnorm(x)
ggplot(tibble(x = x,
              y = y),
       aes(x,y))+
  geom_point()+
  plt_theme_y()

The function default compute the area considering the range of x.

#from min to max of x
range(x)

calc_auc(x,y)
df_auc <-
  tibble(
    x = x,
    y = y)

df_auc %>% 
ggplot(aes(x,y))+
  geom_area(alpha = .7, fill = "chocolate2")+
  geom_point()+
  plt_theme_y()+
  annotate("text",
           x = mean(x),y = mean(y),label = round(calc_auc(x,y),3),
           fontface = "bold", size = 7)+
  geom_vline(xintercept = c(-3,3), linetype = "dashed")+
  scale_x_continuous(breaks = -5:5)+
  scale_y_continuous(expand = c(0,0))

But you can define the argument limits to get the AUC of that respective range.

#from -2 to 2
calc_auc(x,y,limits = c(-2,2))
df_auc %>% 
  ggplot(aes(x,y))+
  geom_area(data = df_auc %>% 
              filter(between(x,-2,2)),
              alpha = .7, fill = "chocolate2")+
  geom_point()+
  plt_theme_y()+
  annotate("text",
           x = mean(x),y = mean(y),label = round(calc_auc(x,y,limits = c(-2,2)),3),
           fontface = "bold", size = 7)+
  geom_vline(xintercept = c(-2,2), linetype = "dashed")+
  scale_x_continuous(breaks = -5:5)+
  scale_y_continuous(expand = c(0,0))
#from -1 to 1
calc_auc(x,y,limits = c(-1,1))
df_auc %>% 
  ggplot(aes(x,y))+
  geom_area(data = df_auc %>% 
              filter(between(x,-1,1)),
              alpha = .7, fill = "chocolate2")+
  geom_point()+
  plt_theme_y()+
  annotate("text",
           x = mean(x),y = mean(y),label = round(calc_auc(x,y,limits = c(-1,1)),3),
           fontface = "bold", size = 7)+
  geom_vline(xintercept = c(-1,1), linetype = "dashed")+
  scale_x_continuous(breaks = -5:5)+
  scale_y_continuous(expand = c(0,0))

calc_combination

The goal of calc_combination is to compute the number of combinations/permutations. Given that there are a total of $n$ observations and that $r$ will be chosen.

Order matter with repetition

$$n^r.$$

calc_combination(n = 10,r = 4,order_matter = TRUE,with_repetition = TRUE)

Order matter without repetition

$$\frac{n!}{(n-r)!}.$$

calc_combination(n = 10,r = 4,order_matter = TRUE,with_repetition = FALSE)

Order does not matter with repetition

$$\frac{(n+r-1)!}{r!(n-1)!}.$$

calc_combination(n = 10,r = 4,order_matter = FALSE,with_repetition = TRUE)

Order does not matter without repetition

$$\frac{n!}{r!(n-r)!}.$$

calc_combination(n = 10,r = 4,order_matter = FALSE,with_repetition = FALSE)

calc_correlation

The goal of calc_correlation is to compute associations metrics.

Kendall

The Kendall correlation coefficient, also known as the Kendall's Tau coefficient, measures the relationship between two ranked variables.

Maurice Kendall created it, and it is especially useful for analyzing non-linear relationships or ranked data. The coefficient is calculated by counting the number of concordant pairs (ranks in the same order) and discordant pairs (ranks in opposite order) in the data.

$$\frac{n_c-n_d}{\frac{1}{2}*n(n/1)},$$ where:

calc_correlation(mtcars$hp,mtcars$drat,type = "kendall")

Pearson

The Pearson correlation coefficient quantifies the linear relationship that exists between two continuous variables. It ranges from -1 to 1, indicating the association's strength and direction.

A value of 1 indicates a perfect positive linear relationship, a value of -1 indicates a perfect negative linear relationship, and a value of 0 indicates no linear relationship.

$$\frac{\sigma_{xy}}{\sigma_x\sigma_y},$$ where:

calc_correlation(mtcars$hp,mtcars$drat,type = "pearson")

Spearman

The Spearman correlation coefficient assesses the strength and direction of a monotonic relationship between two variables, regardless of whether it is linear or non-linear.

It also has a value between -1 and 1, with 1 representing a perfect monotonic relationship and -1 representing a perfect inverse monotonic relationship. A value of 0 indicates that there is no monotonic relationship.

$$1- \frac{6\sum\limits_{i=1}^{n}d_i^2}{n(n^2-1)},$$

where:

calc_correlation(mtcars$hp,mtcars$drat,type = "spearman")

calc_cv

The goal of calc_cv is to compute the coefficient of variation (CV), given by:

$$\frac{s}{\bar{x}},$$ where:

set.seed(123);x <- rexp(n = 100)

calc_cv(x)

If you set the argument as_perc to TRUE, the CV will be multiplied by 100.

calc_cv(x,as_perc = TRUE)

calc_error

The goal of calc_error is to compute errors metrics.

Mean Absolute Error (MAE)

MAE measures the average absolute difference between the predicted and actual values:

$$\frac{\sum\limits_{i=1}^{n}|X_i-Y_i|}{n}.$$

Mean Absolute Percentage Error (MAPE)

MAPE measures the average percentage difference between the predicted and actual values relative to the actual values:

$$\frac{\sum\limits_{i=1}^{n}\left|\frac{X_i-Y_i}{X_i}\right|}{n}.$$

Mean Squared Error (MSE)

MSE measures the average of the squared differences between the predicted and actual values:

$$\frac{\sum\limits_{i=1}^{n}(X_i-Y_i)^2}{n}.$$

Root Mean Squared Error (RMSE)

RMSE is the square root of the MSE, providing the measure of average prediction error in the same units as the target variable:

$$\sqrt{\text{MSE}}.$$

Root Mean Squared Percentage Error (RMSPE)

RMSPE is the square root of the average of the squared percentage differences between the predicted and actual values relative to the actual values:

$$\sqrt{\frac{\sum\limits_{i=1}^{n}\left(\frac{X_i-Y_i}{X_i}\right)^2}{n}}.$$

calc_kurtosis

The goal of calc_kurtosis is to compute a kurtosis coefficient.

calc_kurtosis(x = x)

Biased

The biased kurtosis coefficient, is given by:

$$\frac{\sum\limits_{i=1}^n(x_i - \bar{x})^4}{n*s_x^4},$$

where:

calc_kurtosis(x = x,type = "biased")

Excess

The excess kurtosis coefficient, is given by:

$$\frac{\sum\limits_{i=1}^n(x_i - \bar{x})^4}{n*s_x^4}-3,$$

where:

calc_kurtosis(x = x,type = "excess")

Percentile

The percentile kurtosis coefficient, is given by:

$$\frac{Q_3-Q_1}{P_{90}-P_{10}},$$ where:

calc_kurtosis(x = x,type = "percentile")

Unbiased

The unbiased kurtosis coefficient, is given by:

$$\frac{(n+1)n}{(n-1)(n-2)(n-3)}\frac{\sum\limits_{i=1}^n(x_i - \bar{x})^4}{ns_x^4} - 3\frac{(n-1)^2}{(n-2)*(n-3)},$$

where:

calc_kurtosis(x = x,type = "unbiased")

calc_mean

The goal of calc_mean is to compute the mean.

Arithmetic

Simple arithmetic mean

$$\frac{1}{n}\sum\limits_{i=1}^{n}x_i,$$ where:

calc_mean(x = 1:10,type = "arithmetic")

Weighted arithmetic mean

$$\frac{1}{\sum\limits_{i=1}^{n}w_i}\sum\limits_{i=1}^{n}w_ix_i,$$ where:

calc_mean(x = 1:10,type = "arithmetic",weight = 1:10)

Trimmed arithmetic mean

calc_mean(x = 1:10,type = "arithmetic",trim = .4)

Geometric

$$\sqrt[n]{\prod\limits_{i=1}^{n}x_i} = \sqrt[n]{x_1\times x_2 \times...\times x_n},$$

where:

calc_mean(x = 1:10,type = "geometric")

Harmonic

$$\frac{n}{\sum\limits_{i=1}^{n}\frac{1}{x_i}},$$ where:

calc_mean(x = 1:10,type = "harmonic")

calc_modality

The goal of calc_modality is to compute the number of modes.

calc_modality(x = c("a","a","b","b"))

calc_mode

The goal of calc_mode is to compute the mode.

set.seed(123);cat_var <- sample(letters,100,replace = TRUE)

table(cat_var)

We can see that the letter "y" appears the most, indicating that it is the variable's mode.

calc_mode(cat_var)

calc_peak_density

The goal of calc_peak_density is to compute the peak density value of a numeric value.

pd_plot <-
  ggplot2::ggplot(data = dplyr::tibble(x = x),
                  ggplot2::aes(x))+
  ggplot2::geom_density()+
  relper::plt_theme_y()+
  ggplot2::scale_x_continuous(breaks = 0:20, expand = c(0,0))+
  plt_flip_y_title+
  labs(y = "Density (x)")

pd_plot

Assume we want to know what the density's peak value is.

calc_peak_density(x)
x_peak <- relper::calc_peak_density(x)

pd_plot+
  ggplot2::geom_vline(
    xintercept = relper::calc_peak_density(x),
    col = "royalblue4",
    size = 1
    )+
  ggplot2::scale_x_continuous(
    breaks = 0:20,
    expand = c(0,0),
    sec.axis = sec_axis(~.,
                        breaks = x_peak,
                        labels = format_num(x_peak,digits = 3)
                        )
    )

calc_perc

The goal of calc_perc is to compute the percentage.

#without main_var
calc_perc(mtcars,grp_var = c(cyl,vs))

#main_var within grp_var
calc_perc(mtcars,grp_var = c(cyl,vs),main_var = vs)

#main_var not within grp_var
calc_perc(mtcars,grp_var = c(cyl),main_var = vs)

calc_skewness

The goal of calc_skewness is to compute a skewness coefficient.

calc_skewness(x = x)

Where different types of coefficients are provided, they are:

Bowley

The Bowley skewness coefficient, is given by:

$$\frac{Q_3+Q_1-2Q_2}{Q_3-Q_1},$$ where:

calc_skewness(x = x,type = "bowley")

Fisher-Pearson

The Fisher-Pearson skewness coefficient, is given by:

$$\frac{\sum_\limits{i=1}^{n}(x_i - \bar{x})^3}{n*(s_x)^3},$$

where:

calc_skewness(x = x,type = "fisher_pearson")

Kelly

The Kelly skewness coefficient, is given by:

$$\frac{P_{90}+P_{10}-2Q_2}{P_{90}-P_{10}},$$ where:

calc_skewness(x = x,type = "kelly")

Pearson median

The Pearson median skewness coefficent, or second skewness coefficient, is given by:

$$\frac{3(\bar{x}- \tilde{x})}{s_x},$$

where:

calc_skewness(x = x,type = "pearson_median")

Rao

The Rao skewness coefficient, is given by:

$$\frac{n/(n-1)}{\sqrt{(n-2)/n}},$$

where:

calc_skewness(x = x,type = "rao")


vbfelix/relper documentation built on May 10, 2024, 10:50 p.m.