In youngroklee-ml/dmtr: Functions and Data for "Data Mining Techniques"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(dmtr)
library(purrr)
library(dplyr)
library(stringr)

데이터

data(binaryclass2, package = "dmtr")

knitr::kable(
  binaryclass2,
  align = c('r', 'r', 'r', 'r'),
  caption = '판별분석 학습표본 데이터'
)

피셔 판별 함수

각 객체는 변수벡터 $\mathbf{x} \in \mathbb{R}^p$와 범주 $y \in {1, 2}$로 이루어진다고 하자. 아래는 변수 $\mathbf{x}$의 기대치와 분산-공분산행렬(varinace-covariance matrix)을 나타낸다.

\begin{eqnarray} \boldsymbol\mu_1 = E(\mathbf{x} | y = 1)\ \boldsymbol\mu_2 = E(\mathbf{x} | y = 2)\ \boldsymbol\Sigma = Var(\mathbf{x} | y = 1) = Var(\mathbf{x} | y = 2) \end{eqnarray}

다음과 같이 변수들의 선형조합으로 새로운 변수 $z$를 형성하는 함수를 피셔 판별함수(Fisher's discriminant function)라 한다.

\begin{equation} z = \mathbf{w}^\top \mathbf{x} (#eq:fisher-discriminant-function) \end{equation}

여기서 계수벡터 $\mathbf{w} \in \mathbb{R}^p$는 통상 아래와 같이 변수 $z$의 범주간 평균 차이 대 변수 $z$의 분산의 비율을 최대화하는 것으로 결정한다.

\begin{equation} {\arg!\min}_{\mathbf{w}} \frac{\mathbf{w}^\top ( \boldsymbol\mu_1 - \boldsymbol\mu_2 )}{\mathbf{w}^\top \boldsymbol\Sigma \mathbf{w}} (#eq:fisher-discriminant-function-coef) \end{equation}

위 식 \@ref(eq:fisher-discriminant-function-coef)의 해는

\begin{equation} \mathbf{w} \propto \boldsymbol\Sigma^{-1}(\boldsymbol\mu_1 - \boldsymbol\mu_2) \end{equation}

의 조건을 만족하며, 편의상 비례상수를 1로 두면 아래와 같은 해가 얻어진다.

\begin{equation} \mathbf{w} = \boldsymbol\Sigma^{-1}(\boldsymbol\mu_1 - \boldsymbol\mu_2) (#eq:fisher-discriminant-function-coef-sol) \end{equation}

실제 모집단의 평균 및 분산을 알지 못하는 경우, 학습표본으로부터 $\boldsymbol\mu_1, \boldsymbol\mu_2, \boldsymbol\Sigma$의 추정치를 얻어 식 \@ref(eq:fisher-discriminant-function-coef-sol)에 대입하는 방식으로 판별계수를 추정한다. 자세한 내용은 교재 [@jun2012datamining] 참조.

Table \@ref(tab:fisher-data-print)에 주어진 학습표본을 이용하여 피셔 판별함수를 구해보도록 하자. 우선 각 범주별 관측수 $n_1, n_2$, 평균벡터 $\hat{\boldsymbol\mu}_1, \hat{\boldsymbol\mu}_2$, 그리고 범주별 표본 분산-공분산행렬 $\mathbf{S}_1, \mathbf{S}_2$를 다음과 같이 함수 group_summary()를 사용하여 구한다.

summary_within_group <- group_summary(binaryclass2, class, c(x1, x2))
print(summary_within_group)

위에서 얻은 결과를 이용하여 합동 분산-공분산행렬을 아래와 같이 추정한다.

\begin{equation} \hat{\boldsymbol\Sigma} = \mathbf{S}_p = \frac{(n_1 - 1)\mathbf{S}_1 + (n_2 - 1)\mathbf{S}_2}{n_1 + n_2 - 2} \end{equation}

sigma_hat <- pooled_variance(binaryclass2, class, c(x1, x2))
print(sigma_hat)

위에서 구한 추정치들을 이용하여 아래와 같이 판별함수 계수 추정치 $\hat{\mathbf{w}}$를 구한다. \begin{equation} \hat{\mathbf{w}} = \hat{\boldsymbol\Sigma}^{-1}(\hat{\boldsymbol\mu}_1 - \hat{\boldsymbol\mu}_2) \end{equation}

mu_hat <- map(summary_within_group, ~ .[["mean"]])
w_hat <- solve(sigma_hat) %*% 
  as.matrix(mu_hat[[1]] - mu_hat[[2]], ncol = 1L) %>%
  as.vector()
print(w_hat)

위 피셔 판별함수를 구하는 과정을 하나의 함수 fisher_ld()로 구현하였다.

w_hat <- fisher_ld(binaryclass2, class, c(x1, x2))
print(w_hat)

분류 규칙

피셔 판별함수에 따른 분류 경계값은 학습표본에 대한 판별함수값의 평균으로 아래와 같이 구할 수 있다.

\begin{equation} \bar{z} = \frac{1}{N} \sum_i^N \hat{\mathbf{w}}^\top \mathbf{x}_i \end{equation}

z_mean <- binaryclass2 %>% 
  select(x1, x2) %>% 
  summarize_all(mean) %>%
  `*`(w_hat) %>%
  sum()
print(z_mean)

위 분류 경계값을 구하는 과정을 함수 fisher_ld_threshold()로 구현하였다.

z_mean <- fisher_ld_threshold(binaryclass2, class, c(x1, x2))
print(z_mean)

위 결과를 통해, 분류규칙은 다음과 같이 주어진다.

$\hat{\mathbf{w}}^\top \mathbf{x} \ge \bar{z}$ 이면, $\mathbf{x}$를 범주 1로 분류
$\hat{\mathbf{w}}^\top \mathbf{x} < \bar{z}$ 이면, $\mathbf{x}$를 범주 2로 분류

class_level <- attr(summary_within_group, "group")

prediction_df <- binaryclass2 %>%
  mutate(
    z = w_hat[1] * x1 + w_hat[2] * x2,
    .pred_class = factor(
      if_else(z >= z_mean, class_level[1], class_level[2]), 
      levels = class_level
    )
  )

위 분류 과정을 함수 fisher_ld_prediction()을 이용하여 아래와 같이 수행할 수 있다.

prediction_df <- fisher_ld_prediction(
  .w = w_hat, 
  .z = z_mean, 
  .newdata = binaryclass2, 
  .xvar = c(x1, x2),
  .levels = attr(w_hat, "group")
)

분류 결과는 아래와 같다.

knitr::kable(
  prediction_df,
  align = c('r', 'r', 'r', 'r', 'r', 'r'),
  caption = '학습표본에 대한 피셔 분류 결과'
)

위 결과 객체 r prediction_df %>% filter(class != .pred_class) %>% pull(id) %>% paste(collapse = ", ")가 오분류된다.

의사결정론에 의한 분류규칙

다음과 같이 객체가 각 범주에 속할 사전확률과 각 범주 내에서의 분류변수의 확률밀도함수에 대한 기호를 정의한다.

$\pi_k$: 임의의 객체가 범주 $k$에 속할 사전확률
$f_k(\mathbf{x})$: 범주 $k$에 대한 변수의 확률밀도함수

이 때 통상적으로 $\mathbf{x}$는 다변량 정규분포를 따르는 것으로 가정하여 아래와 같이 평균벡터 $\boldsymbol\mu_k$와 분산-공분산행렬 $\boldsymbol\Sigma$로 확률밀도함수를 정의할 수 있다. 이 때 분산-공분산행렬 $\boldsymbol\Sigma$는 모든 범주에 대해 동일하다고 가정한다.

\begin{equation} f_k(\mathbf{x}) = \frac{1}{(2\pi)^{p/2}|\boldsymbol\Sigma|^{1/2}} \exp { -\frac{1}{2} \left(\mathbf{x} - \boldsymbol\mu_k\right)^\top \boldsymbol\Sigma^{-1} \left(\mathbf{x} - \boldsymbol\mu_k\right) } (#eq:mv-gaussian-dist) \end{equation}

각 범주에 대한 판별함수를

\begin{equation} u_k(\mathbf{x}) = \boldsymbol\mu_k^\top \boldsymbol\Sigma^{-1}\mathbf{x} - \frac{1}{2} \boldsymbol\mu_k^\top \boldsymbol\Sigma^{-1} \boldsymbol\mu_k + \ln \pi_k (#eq:lda-discriminant-function) \end{equation}

로 정의한다. 위 판별함수를 함수 ld_fun()로 구현하였다.

f <- ld_fun(binaryclass2, class, c(x1, x2))

위 함수를 이용하여 학습표본에 대해 판별함수값을 다음과 같이 얻을 수 있다.

u_df <- map_dfc(f, 
  ~ binaryclass2 %>% 
    select(x1, x2) %>%
    transmute(u = apply(., 1, .x))
)

names(u_df) <- str_c(".score", attr(f, "group"), sep = "_")
score_df <- bind_cols(binaryclass2, u_df)

knitr::kable(
  score_df,
  align = c('r', 'r', 'r', 'r', 'r', 'r'),
  digits = 2L,
  caption = '학습표본에 대한 LDA 적용 결과: 판별함수값'
)

베이즈 정리(Bayes's theorem)에 따라 변수 $\mathbf{x}$값이 주어졌을 때 범주 $k$에 속할 사후확률(posterior)은 아래와 같이 구할 수 있다.

\begin{equation} P(y = k \, | \, \mathbf{x}) = \frac{\pi_k f_k(\mathbf{x})}{f(\mathbf{x})} (#eq:lda-posterior) \end{equation}

p_df <- u_df %>%
  mutate_all(exp) %>%
  mutate_all(function(x) x / rowSums(.))

names(p_df) <- str_c(".pred", attr(f, "group"), sep = "_")
posterior_df <- bind_cols(binaryclass2, p_df)

knitr::kable(
  posterior_df,
  align = c('r', 'r', 'r', 'r', 'r', 'r'),
  digits = 2L,
  caption = '학습표본에 대한 LDA 적용 결과: 사후확률값'
)

최종적으로, 각 범주에 대한 사후확률이 높은 쪽으로 범주를 추정한다.

\begin{equation} \hat{y} = \begin{cases} 1, & \text{if } P(y = 1 \, | \, \mathbf{x}) \ge P(y = 2 \, | \, \mathbf{x})\ 2, & \text{otherwise} \end{cases} (#eq:lda-posterior-rule) \end{equation}

yhat_df <- tibble(
  .pred_class = attr(f, "group")[apply(p_df, 1, which.max)]
)

prediction_df <- bind_cols(posterior_df, yhat_df)

knitr::kable(
  prediction_df,
  align = c('r', 'r', 'r', 'r', 'r', 'r', 'r'),
  digits = 2L,
  caption = '학습표본에 대한 LDA 적용 결과: 사후확률값 및 추정범주'
)

위와 같이 새 데이터에 대해 판별함수값, 사후확률 및 범주추정을 수행하는 함수를 predict_da()로 구현하였다.

predict_da(f, .new_data = binaryclass2, c(x1, x2), 
  .include_score = TRUE,
  .include_posterior = TRUE,
  .include_class = TRUE)

이차 판별함수

각 범주의 확률밀도함수는 아래와 같이 다변량 정규분포로 정의된다.

\begin{equation} f_k(\mathbf{x}) = \frac{1}{(2\pi)^{p/2}|\boldsymbol\Sigma_k|^{1/2}} \exp { -\frac{1}{2} \left(\mathbf{x} - \boldsymbol\mu_k\right)^\top \boldsymbol\Sigma_k^{-1} \left(\mathbf{x} - \boldsymbol\mu_k\right) } (#eq:qda-mv-gaussian-dist) \end{equation}

위 식 \@ref(eq:qda-mv-gaussian-dist)이 선형판별함수에서 사용한 식 \@ref(eq:mv-gaussian-dist)과 다른 부분은 분산-공분산분포 $\boldsymbol\Sigma_k$가 범주 $k$에 대해 각각 정의된다는 점이다.

이 경우 각 범주에 대한 판별함수는 아래와 같이 정의된다.

\begin{equation} u_k(\mathbf{x}) = - \frac{1}{2} (\mathbf{x} - \boldsymbol\mu_k)^\top \boldsymbol\Sigma_k^{-1} (\mathbf{x} - \boldsymbol\mu_k) - \frac{1}{2} \ln \left| \boldsymbol\Sigma_k \right| + \ln \pi_k (#eq:qda-discriminant-function) \end{equation}

위 판별함수를 데이터로부터 얻는 함수를 qd_fun()로 구현하였다.

f <- qd_fun(binaryclass2, class, c(x1, x2))

이후, 선형판별함수에서와 같이 함수 predict_df()를 이용하여 판별함수값, 사후확률 및 범주추정값을 얻을 수 있다.

prediction_df <- predict_da(f, .new_data = binaryclass2, c(x1, x2), 
  .include_score = TRUE,
  .include_posterior = TRUE,
  .include_class = TRUE)

prediction_df <- bind_cols(binaryclass2, prediction_df)

knitr::kable(
  prediction_df,
  align = c('r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r'),
  digits = 2L,
  caption = '학습표본에 대한 QDA 적용 결과: 판별함수값, 사후확률 및 추정범주'
)

세 범주 이상의 분류

데이터

3개의 범주를 지닌 붓꽃(iris) 데이터에 대해 판별분석을 수행하고자 한다.

data(iris90, package = "dmtr")

선형 판별 분석

함수 ld_fun()을 이용하여 선형 판별함수를 구하고, predict_da()를 이용하여 판별함수 값을 구하고 범주를 추정한다.

f <- ld_fun(iris90, class, x1:x4)

new_data <- tibble::tribble(
  ~x1, ~x2, ~x3, ~x4, ~class,
  5.1, 3.5, 1.4, 0.2, "setosa",
  7.0, 3.2, 4.7, 1.4, "versicolor",
  6.3, 3.3, 6.0, 2.5, "virginica"
) %>% 
  mutate(
    class = factor(class, levels = c("setosa", "versicolor", "virginica"))
  )

predict_df <- predict_da(f, .new_data = new_data, .xvar = x1:x4, 
  .include_score = TRUE)

predict_df <- bind_cols(new_data, predict_df)

knitr::kable(
  predict_df,
  align = rep('r', ncol(prediction_df)),
  digits = 2L,
  caption = '붗꽃 표본에 대한 LDA 적용 결과: 판별함수값 및 추정범주'
)

youngroklee-ml/dmtr documentation built on June 12, 2022, 6:24 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

youngroklee-ml/dmtr
Functions and Data for "Data Mining Techniques"

In youngroklee-ml/dmtr: Functions and Data for "Data Mining Techniques"

데이터

피셔 판별 함수

분류 규칙

의사결정론에 의한 분류규칙

이차 판별함수

세 범주 이상의 분류

데이터

선형 판별 분석

R Package Documentation

Browse R Packages

We want your feedback!

youngroklee-ml/dmtr Functions and Data for "Data Mining Techniques"

In youngroklee-ml/dmtr: Functions and Data for "Data Mining Techniques"

데이터

피셔 판별 함수

분류 규칙

의사결정론에 의한 분류규칙

이차 판별함수

세 범주 이상의 분류

데이터

선형 판별 분석

R Package Documentation

Browse R Packages

We want your feedback!

youngroklee-ml/dmtr
Functions and Data for "Data Mining Techniques"