In rstudio-conf-2020/big-data: Content and setup files for the Big Data with R class

eval_mining <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_mining <- Sys.getenv("GLOBAL_EVAL")

library(wordcloud2)
library(sparklyr)
library(dplyr)

Text mining with `sparklyr`

For this example, there are two files that will be analyzed. They are both the full works of Sir Arthur Conan Doyle and Mark Twain. The files were downloaded from the Gutenberg Project site via the gutenbergr package. Intentionally, no data cleanup was done to the files prior to this analysis. See the appendix below to see how the data was downloaded and prepared.

readLines("/usr/share/class/books/arthur_doyle.txt", 30)

Data Import

Read the book data into Spark

Load the sparklyr library r library(sparklyr)
Open a Spark session r sc <- spark_connect(master = "local")
Use the spark_read_text() function to read the mark_twain.txt file, assign it to a variable called twain r twain <- spark_read_text(sc, "twain", "/usr/share/class/books/mark_twain.txt")
Use the spark_read_text() function to read the arthur_doyle.txt file, assign it to a variable called doyle r doyle <-

Tidying data

Prepare the data for analysis

Load the dplyr library r library(dplyr)
Add a column to twain named author with a value of "twain". Assign it to a new variable called twain_id ```r

```
Add a column to doyle named author with a value of "doyle". Assign it to a new variable called doyle_id ```r

```
Use sdf_bind_rows() to append the two files together in a variable called both ```r

```
Preview both ```r

```
Filter out empty lines into a variable called all_lines r all_lines <-
Use Hive's regexp_replace to remove punctuation, assign it to the same all_lines variable r all_lines <- all_lines %>% mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " "))

Transform the data

Use feature transformers to make additional preparations

Use ft_tokenizer() to separate each word. in the line. Set the output_col to "word_list". Assign to a variable called word_list r word_list <-
Remove "stop words" with the ft_stop_words_remover() transformer. Set the output_col to "wo_stop_words". Assign to a variable called wo_stop r wo_stop <-
Un-nest the tokens inside wo_stop_words using explode(). Assign to a variable called exploded r exploded <-
Select the word and author columns, and remove any word with less than 3 characters. Assign to all_words r all_words <-
Cache the all_words variable using compute()
r all_words <- all_words %>% compute("all_words")

Data Exploration

Used word clouds to explore the data

Create a variable with the word count by author, name it word_count r word_count <-
Filter word_cout to only retain "twain", assign it to twain_most r twain_most <- word_count %>% filter(author == "twain")
Use wordcloud to visualize the top 50 words used by Twain r twain_most %>% head(50) %>% collect() %>% with(wordcloud::wordcloud( word, n, colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")) )
Filter word_cout to only retain "doyle", assign it to doyle_most r doyle_most <-
Used wordcloud to visualize the top 50 words used by Doyle that have more than 5 characters ```r

```
Use anti_join() to figure out which words are used by Doyle but not Twain. Order the results by number of words. r doyle_unique <-
Use wordcloud to visualize top 50 records in the previous step r doyle_unique %>% head(50) %>% collect() %>% with(wordcloud::wordcloud( word, n, colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")) )
Find out how many times Twain used the word "sherlock" ```r

```
Against the twain variable, use Hive's instr and lower to make all ever word lower cap, and then look for "sherlock" in the line ```r

```
Close Spark session r spark_disconnect(sc)

Most of these lines are in a short story by Mark Twain called A Double Barrelled Detective Story. As per the Wikipedia page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.

rstudio-conf-2020/big-data documentation built on Feb. 4, 2020, 5:24 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

rstudio-conf-2020/big-data
Content and setup files for the Big Data with R class

In rstudio-conf-2020/big-data: Content and setup files for the Big Data with R class

Text mining with `sparklyr`

Data Import

Tidying data

Transform the data

Data Exploration

R Package Documentation

Browse R Packages

We want your feedback!

rstudio-conf-2020/big-data Content and setup files for the Big Data with R class

In rstudio-conf-2020/big-data: Content and setup files for the Big Data with R class

Text mining with sparklyr

Data Import

Tidying data

Transform the data

Data Exploration

R Package Documentation

Browse R Packages

We want your feedback!

rstudio-conf-2020/big-data
Content and setup files for the Big Data with R class

Text mining with `sparklyr`