eval_mining <- FALSE if(Sys.getenv("GLOBAL_EVAL") != "") eval_mining <- Sys.getenv("GLOBAL_EVAL")
library(wordcloud2) library(sparklyr) library(dplyr)
sparklyr
For this example, there are two files that will be analyzed. They are both the full works of Sir Arthur Conan Doyle and Mark Twain. The files were downloaded from the Gutenberg Project site via the gutenbergr
package. Intentionally, no data cleanup was done to the files prior to this analysis. See the appendix below to see how the data was downloaded and prepared.
readLines("/usr/share/class/books/arthur_doyle.txt", 30)
Read the book data into Spark
Load the sparklyr
library
r
library(sparklyr)
Open a Spark session
r
sc <- spark_connect(master = "local")
Use the spark_read_text()
function to read the mark_twain.txt file, assign it to a variable called twain
r
twain <- spark_read_text(sc, "twain", "/usr/share/class/books/mark_twain.txt")
Use the spark_read_text()
function to read the arthur_doyle.txt file, assign it to a variable called doyle
r
doyle <-
Prepare the data for analysis
Load the dplyr
library
r
library(dplyr)
Add a column to twain
named author
with a value of "twain". Assign it to a new variable called twain_id
```r
```
Add a column to doyle
named author
with a value of "doyle". Assign it to a new variable called doyle_id
```r
```
Use sdf_bind_rows()
to append the two files together in a variable called both
```r
```
Preview both
```r
```
Filter out empty lines into a variable called all_lines
r
all_lines <-
Use Hive's regexp_replace to remove punctuation, assign it to the same all_lines
variable
r
all_lines <- all_lines %>%
mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " "))
Use feature transformers to make additional preparations
Use ft_tokenizer()
to separate each word. in the line. Set the output_col
to "word_list". Assign to a variable called word_list
r
word_list <-
Remove "stop words" with the ft_stop_words_remover()
transformer. Set the output_col
to "wo_stop_words". Assign to a variable called wo_stop
r
wo_stop <-
Un-nest the tokens inside wo_stop_words using explode()
. Assign to a variable called exploded
r
exploded <-
Select the word and author columns, and remove any word with less than 3 characters. Assign to all_words
r
all_words <-
Cache the all_words
variable using compute()
r
all_words <- all_words %>%
compute("all_words")
Used word clouds to explore the data
Create a variable with the word count by author, name it word_count
r
word_count <-
Filter word_cout
to only retain "twain", assign it to twain_most
r
twain_most <- word_count %>%
filter(author == "twain")
Use wordcloud
to visualize the top 50 words used by Twain
r
twain_most %>%
head(50) %>%
collect() %>%
with(wordcloud::wordcloud(
word,
n,
colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9"))
)
Filter word_cout
to only retain "doyle", assign it to doyle_most
r
doyle_most <-
Used wordcloud
to visualize the top 50 words used by Doyle that have more than 5 characters
```r
```
Use anti_join()
to figure out which words are used by Doyle but not Twain. Order the results by number of words.
r
doyle_unique <-
Use wordcloud
to visualize top 50 records in the previous step
r
doyle_unique %>%
head(50) %>%
collect() %>%
with(wordcloud::wordcloud(
word,
n,
colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9"))
)
Find out how many times Twain used the word "sherlock" ```r
```
Against the twain
variable, use Hive's instr and lower to make all ever word lower cap, and then look for "sherlock" in the line
```r
```
Close Spark session
r
spark_disconnect(sc)
Most of these lines are in a short story by Mark Twain called A Double Barrelled Detective Story. As per the Wikipedia page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.