eval_mining <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_mining <- Sys.getenv("GLOBAL_EVAL")
library(wordcloud2)
library(sparklyr)
library(dplyr)

Text mining with sparklyr

For this example, there are two files that will be analyzed. They are both the full works of Sir Arthur Conan Doyle and Mark Twain. The files were downloaded from the Gutenberg Project site via the gutenbergr package. Intentionally, no data cleanup was done to the files prior to this analysis. See the appendix below to see how the data was downloaded and prepared.

readLines("/usr/share/class/books/arthur_doyle.txt", 30) 

Data Import

Read the book data into Spark

  1. Load the sparklyr library r library(sparklyr)

  2. Open a Spark session r sc <- spark_connect(master = "local")

  3. Use the spark_read_text() function to read the mark_twain.txt file, assign it to a variable called twain r twain <- spark_read_text(sc, "twain", "/usr/share/class/books/mark_twain.txt")

  4. Use the spark_read_text() function to read the arthur_doyle.txt file, assign it to a variable called doyle r doyle <-

Tidying data

Prepare the data for analysis

  1. Load the dplyr library r library(dplyr)

  2. Add a column to twain named author with a value of "twain". Assign it to a new variable called twain_id ```r

    ```

  3. Add a column to doyle named author with a value of "doyle". Assign it to a new variable called doyle_id ```r

    ```

  4. Use sdf_bind_rows() to append the two files together in a variable called both ```r

    ```

  5. Preview both ```r

    ```

  6. Filter out empty lines into a variable called all_lines r all_lines <-

  7. Use Hive's regexp_replace to remove punctuation, assign it to the same all_lines variable r all_lines <- all_lines %>% mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " "))

Transform the data

Use feature transformers to make additional preparations

  1. Use ft_tokenizer() to separate each word. in the line. Set the output_col to "word_list". Assign to a variable called word_list r word_list <-

  2. Remove "stop words" with the ft_stop_words_remover() transformer. Set the output_col to "wo_stop_words". Assign to a variable called wo_stop r wo_stop <-

  3. Un-nest the tokens inside wo_stop_words using explode(). Assign to a variable called exploded r exploded <-

  4. Select the word and author columns, and remove any word with less than 3 characters. Assign to all_words r all_words <-

  5. Cache the all_words variable using compute()
    r all_words <- all_words %>% compute("all_words")

Data Exploration

Used word clouds to explore the data

  1. Create a variable with the word count by author, name it word_count r word_count <-

  2. Filter word_cout to only retain "twain", assign it to twain_most r twain_most <- word_count %>% filter(author == "twain")

  3. Use wordcloud to visualize the top 50 words used by Twain r twain_most %>% head(50) %>% collect() %>% with(wordcloud::wordcloud( word, n, colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")) )

  4. Filter word_cout to only retain "doyle", assign it to doyle_most r doyle_most <-

  5. Used wordcloud to visualize the top 50 words used by Doyle that have more than 5 characters ```r

    ```

  6. Use anti_join() to figure out which words are used by Doyle but not Twain. Order the results by number of words. r doyle_unique <-

  7. Use wordcloud to visualize top 50 records in the previous step r doyle_unique %>% head(50) %>% collect() %>% with(wordcloud::wordcloud( word, n, colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")) )

  8. Find out how many times Twain used the word "sherlock" ```r

    ```

  9. Against the twain variable, use Hive's instr and lower to make all ever word lower cap, and then look for "sherlock" in the line ```r

    ```

  10. Close Spark session r spark_disconnect(sc)

Most of these lines are in a short story by Mark Twain called A Double Barrelled Detective Story. As per the Wikipedia page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.



rstudio-conf-2020/big-data documentation built on Feb. 4, 2020, 5:24 p.m.