In kilhwan/biztextp: Businss Text Mining Practice

library(learnr)
library(tidyverse)
library(textdata)
library(tidytext)
library(biztextp)
library(wordcloud)

knitr::opts_chunk$set(echo = FALSE)

# AFINN 예제를 위한 데이터
tc <- commencement %>% mutate(para_number = row_number()) %>%
  unnest_tokens(output = word, input = text) %>%
  anti_join(stop_words)
sent_mat <- commencement %>% mutate(para_number = row_number()) %>%
  unnest_tokens(output = word, input = text) %>%
  anti_join(stop_words) %>%
  inner_join(sentiments) %>%
  count(para_number, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n = 0)) %>%
  mutate(sent = positive - negative) %>%
  select(para_number, sent)

감정 어휘 사전

다음은 tidytext 패키지가 제공하는 sentiments의 감정 어휘 사전이다.

sentiments

sentiments 감정 어휘 사전의 감정(sentiment)의 빈도수를 출력해 보시오.

sentiments %>% count(...)

sentiments %>% count(sentiment)

다음은 textdata 패키지가 제공하는 nrc 감정 어휘 사전이다.

get_sentiments("nrc")

get_sentiments 함수를 이용하여 nrc 감정 어휘 사전을 불러와서, nrc 사전의 감정(sentiment)의 빈도수를 출력해 보시오. 단, 빈도가 많은 감정이 위에 나타나도록 하시오.

get_sentiments(...) %>% count(...)

get_sentiments("nrc") %>% count(sentiment, sort=TRUE)

다음은 textdata 패키지가 제공하는 afinn 감정 어휘 사전이다.

get_sentiments("afinn")

get_sentiments 함수를 이용하여 afinn 감정 어휘 사전을 불러와서, afinn 사전의 단어의 긍부정 값(value)에 대한 막대 그래프를 그려보시오.

get_sentiments(...) %>% 
  ggplot() + geom_bar(aes(...))

get_sentiments("afinn") %>% 
  ggplot() + geom_bar(aes(value))

텍스트의 정서 변화 분석

biztextp 패키지의 commencement 데이터를 이용하여 다음 문제를 풀어보시오. commencement 데이터는 Steve Jobs가 스탠포드 대학의 졸업식에서 한 연설문이다.

author는 저자
text 문단 단위로 텍스를 가지고 있는 문자열.

텍스트 열이 보이지 않으면 상단에 표시된 오른쪽 화살표를 클릭해 보시오.

commencement

다음 명령을 수정하여 comencement 데이터에 문단 단위로 번호를 부여하여 para_number라는 열로 추가하시오. 결과에서 새롭게 부여한 열이 보이지 않으면 상단에 표시된 오른쪽 화살표를 클릭해 보시오.

commencement %>% mutate(... = ...)

commencement %>% mutate(para_number = row_number())

앞의 page_number가 부여된 결과에서 다음 명령을 수정하여 text 열을 단어로 토큰화하시오. 토큰화된 결과는 word라는 열 이름을 부여하시오.

commencement %>% mutate(... = ...) %>%
  unnest_tokens(output = ..., input = ...)

commencement %>% mutate(para_number = row_number()) %>%
  unnest_tokens(output = word, input = text)

앞의 토큰화한 결과에서 stop_words 불용어 사전을 이용하여 불용어를 제거하시오.

commencement %>% mutate(... = ...) %>%
  unnest_tokens(output = ..., input = ...) %>%
  anti_join(...)

commencement %>% mutate(para_number = row_number()) %>%
  unnest_tokens(output = word, input = text) %>%
  anti_join(stop_words)

앞의 결과에서 sentiments 감정 어휘 사전을 이용하여 감정 어휘를 추출하시오.

commencement %>% mutate(... = ...) %>%
  unnest_tokens(output = ..., input = ...) %>%
  anti_join(...) %>%
  inner_join(...)

commencement %>% mutate(para_number = row_number()) %>%
  unnest_tokens(output = word, input = text) %>%
  anti_join(stop_words) %>%
  inner_join(sentiments)

앞의 결과에서 문단 별로 긍정과 부정 단어의 수를 세어보시오.

commencement %>% mutate(... = ...) %>%
  unnest_tokens(output = ..., input = ...) %>%
  anti_join(...) %>%
  inner_join(...) %>%
  count(..., ...)

commencement %>% mutate(para_number = row_number()) %>%
  unnest_tokens(output = word, input = text) %>%
  anti_join(stop_words) %>%
  inner_join(sentiments) %>%
  count(para_number, sentiment)

앞의 결과를 넓은 형식으로 변환해 보시오. sentiment 열의 내용이 열의 이름으로, n이 각 열의 값으로 들어가게 하시오. 아울러 데이터가 없는 셀의 값은 0이 되도록 하시오.

commencement %>% mutate(... = ...) %>%
  unnest_tokens(output = ..., input = ...) %>%
  anti_join(...) %>%
  inner_join(...) %>%
  count(..., ...) %>%
  pivot_wider(names_from = ..., values_from = ..., values_fill = list(... = ...))

commencement %>% mutate(para_number = row_number()) %>%
  unnest_tokens(output = word, input = text) %>%
  anti_join(stop_words) %>%
  inner_join(sentiments) %>%
  count(para_number, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n = 0))

앞의 결과에다 각 문단의 감성을 점수화하기 위해 positive 열에서 negative 열을 뺀 결과를 sent 열로 덧붙이시오.

commencement %>% mutate(... = ...) %>%
  unnest_tokens(output = ..., input = ...) %>%
  anti_join(...) %>%
  inner_join(...) %>%
  count(..., ...) %>%
  pivot_wider(names_from = ..., values_from = ..., values_fill = list(... = ...)) %>%
  mutate(... = ...)

commencement %>% mutate(para_number = row_number()) %>%
  unnest_tokens(output = word, input = text) %>%
  anti_join(stop_words) %>%
  inner_join(sentiments) %>%
  count(para_number, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n = 0)) %>%
  mutate(sent = positive - negative)

앞의 결과를 이용하여 문단이 진행됨에 따라 감성의 변화가 어떻게 변화하였는지에 대한 그래프를 그려보시오. 가로축은 para_number, 세로축은 sent로 하여 막대 그래프를 그리시오.

commencement %>% mutate(... = ...) %>%
  unnest_tokens(output = ..., input = ...) %>%
  anti_join(...) %>%
  inner_join(...) %>%
  count(..., ...) %>%
  pivot_wider(names_from = ..., values_from = ..., values_fill = list(... = ...)) %>%
  mutate(... = ...) %>%
  ggplot() + geom_col(aes(..., ...))

commencement %>% mutate(para_number = row_number()) %>%
  unnest_tokens(output = word, input = text) %>%
  anti_join(stop_words) %>%
  inner_join(sentiments) %>%
  count(para_number, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n = 0)) %>%
  mutate(sent = positive - negative) %>%
  ggplot() + geom_col(aes(para_number, sent))

앞의 그래프에서 문단의 감정(sent)가 0보다 크면 "긍정", 0보다 작거나 같으면 "부정"으로 나누어 막대 그래프의 채우기 색이 달리 나타나도록 해 보시오.

commencement %>% mutate(... = ...) %>%
  unnest_tokens(output = ..., input = ...) %>%
  anti_join(...) %>%
  inner_join(...) %>%
  count(..., ...) %>%
  pivot_wider(names_from = ..., values_from = ..., values_fill = list(... = ...)) %>%
  mutate(... = ...) %>%
  ggplot() + geom_col(aes(..., ..., fill = ifelse(... > 0 , ..., ...)))

commencement %>% mutate(para_number = row_number()) %>%
  unnest_tokens(output = word, input = text) %>%
  anti_join(stop_words) %>%
  inner_join(sentiments) %>%
  count(para_number, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n = 0)) %>%
  mutate(sent = positive - negative) %>%
  ggplot() + geom_col(aes(para_number, sent, fill = ifelse(sent > 0, "긍정", "부정")))

앞의 그래프에서 채우기 색의 레이블을 "감성"으로, x-축의 레이블을 "문단 번호", y-축의 레이블을 "감성 점수"로 바꾸어 보시오.

commencement %>% mutate(... = ...) %>%
  unnest_tokens(output = ..., input = ...) %>%
  anti_join(...) %>%
  inner_join(...) %>%
  count(..., ...) %>%
  pivot_wider(names_from = ..., values_from = ..., values_fill = list(... = ...)) %>%
  mutate(... = ...) %>%
  ggplot() + geom_col(aes(..., ..., fill = ifelse(... > 0 , ..., ...))) +
  labs(... = ..., ... = ..., ... = ...)

commencement %>% mutate(para_number = row_number()) %>%
  unnest_tokens(output = word, input = text) %>%
  anti_join(stop_words) %>%
  inner_join(sentiments) %>%
  count(para_number, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n = 0)) %>%
  mutate(sent = positive - negative) %>%
  ggplot() + geom_col(aes(para_number, sent, fill = ifelse(sent > 0, "긍정", "부정"))) +
  labs(fill = "감성", x = "문단 번호", y = "감성 점수")

AFINN 감정 어휘 사전을 이용한 텍스트 감정 변화 분석

다음은 앞 단게에서 수행한 commencement 데이터를 토큰화한 결과이다. 이 데이터는 현재 tc라는 이름의 변수에 저장되어 있다.

tc

tc에서 afinn 감정 어휘 사전을 이용하여 감정 어휘를 추출하시오.

tc %>%  inner_join(...)

tc %>% inner_join(get_sentiments("afinn"))

앞의 결과에서 문단 별로 value의 값을 합산하여 긍부정 점수를 계산하시오.

tc %>%  inner_join(...) %>%
  count(..., wt = ...)

tc %>% inner_join(get_sentiments("afinn")) %>%
  count(para_number, wt = value)

앞의 결과를 이용하여 문단이 진행됨에 따라 감성의 변화가 어떻게 변화하였는지에 대한 그래프를 그려보시오. 가로축은 para_number, 세로축은 n로 하여 막대 그래프를 그리시오.

tc %>%  inner_join(...) %>%
  count(..., wt = ...) %>%
  ggplot() + geom_col(aes(..., ...))

tc %>% inner_join(get_sentiments("afinn")) %>%
  count(para_number, wt = value) %>%
  ggplot() + geom_col(aes(para_number, n))

앞의 그래프에서 문단의 감정(n)이 0보다 크면 "긍정", 0보다 작거나 같으면 "부정"으로 나누어 막대 그래프의 채우기 색이 달리 나타나도록 해 보시오. 아울러 채우기 색의 레이블을 "감성"으로, x-축의 레이블을 "문단 번호", y-축의 레이블을 "감성 점수"로 바꾸어 보시오.

tc %>%  inner_join(...) %>%
  count(..., wt = ...) %>%
  ggplot() + geom_col(aes(..., ..., fill = ifelse(... > 0 , ..., ...))) +
  labs(... = ..., ... = ..., ... = ...)

tc %>% inner_join(get_sentiments("afinn")) %>%
  count(para_number, wt = value) %>%
  ggplot() + geom_col(aes(para_number, n, fill = ifelse(n > 0, "긍정", "부정"))) +
  labs(fill = "감성", x = "문단 번호", y = "감성 점수")

다음은 앞 단계에서 sentiments 감정 어휘 사전을 사용하여 commencement의 문단별 감성을 분석한 결과이다. 이 데이터는 sent_mat 데이터에 저장되어 있다.

sent_mat

앞에서 수행한 afinn 사전을 이용한 감성 분석의 결과와 sent_mat의 결과를 문답 번호(para_number)로 내부 조인하시오.

tc %>%  inner_join(...) %>%
  count(..., wt = ...) %>%
  inner_join(...)

tc %>% inner_join(get_sentiments("afinn")) %>%
  count(para_number, wt = value) %>%
  inner_join(sent_mat)

앞의 결과에서 afinn 사전과 sentiments 사전의 감성 분석 결과에 대한 산점도를 그려보시오. 단, afinn 사전의 결과가 가로축, sentiments 사전의 결과가 세로축이 되도록 하시오.

tc %>%  inner_join(...) %>%
  count(..., wt = ...) %>%
  inner_join(...) %>%
  ggplot() + geom_point(aes(..., ...))

tc %>% inner_join(get_sentiments("afinn")) %>%
  count(para_number, wt = value) %>%
  inner_join(sent_mat) %>%
  ggplot() + geom_point(aes(n, sent))

긍부정 단어 빈도 시각화

다음은 앞 단게에서 수행한 commencement 데이터를 토큰화한 결과이다. 이 데이터는 현재 tc라는 이름의 변수에 저장되어 있다.

tc

tc에서 sentiments 감정 어휘 사전을 이용하여 감정 어휘를 추출하시오.

tc %>% inner_join(...)

tc %>% inner_join(sentiments)

앞의 결과를 이용하여 전체 텍스트에서 단어(word)-감정(sentiment)의 각각의 조합에 대한 발생 빈도수를 출력하시오.

tc %>% inner_join(...) %>%
  count(..., ...)

tc %>% inner_join(sentiments) %>%
  count(word, sentiment)

앞의 결과에서 발생 빈도가 2회 이상(2 포함)인 단어의 빈도만 추출하시오.

tc %>% inner_join(...) %>%
  count(..., ...) %>%
  filter(... )

tc %>% inner_join(sentiments) %>%
  count(word, sentiment) %>%
  filter(n >= 2)

앞의 결과를 이용하여 단어 별로 발생 빈도에 대한 막대 그래프를 그리시오. 단, 가로축과 세로축이 반전되도록 하시오.

tc %>% inner_join(...) %>%
  count(..., ..., sort = ...) %>%
  filter(... ) %>%
  ggplot() + geom_col(aes(..., ...)) +
  coord_....()

tc %>% inner_join(sentiments) %>%
  count(word, sentiment, sort = TRUE) %>%
  filter(n >= 2) %>%
  ggplot() + geom_col(aes(word, n)) +
  coord_flip()

앞의 그래프에서 단어의 빈도가 클수록 그래프의 위에 표시되도록 해 보시오.

tc %>% inner_join(...) %>%
  count(..., ..., sort = ...) %>%
  filter(... ) %>%
  mutate(word = reorder(..., ...)) %>%
  ggplot() + geom_col(aes(..., ...)) +
  coord_....()

tc %>% inner_join(sentiments) %>%
  count(word, sentiment, sort = TRUE) %>%
  filter(n >= 2) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot() + geom_col(aes(word, n)) +
  coord_flip()

앞의 그래프에서 긍정과 부정 감정에 따라 단어 빈도가 별도의 그래프로 나타나도록 하시오. 단, 단어들이 표시된 세로축이 두 그래프에서 서로 다르게 표시될 수 있도록 하시오.

tc %>% inner_join(...) %>%
  count(..., ..., sort = ...) %>%
  filter(... ) %>%
  mutate(word = reorder(..., ...)) %>%
  ggplot() + geom_col(aes(..., ...)) +
  coord_....() +
  facet_wrap(..., scales = ...)

tc %>% inner_join(sentiments) %>%
  count(word, sentiment, sort = TRUE) %>%
  filter(n >= 2) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot() + geom_col(aes(word, n)) +
  coord_flip() + 
  facet_wrap(~ sentiment, scales = "free_y")