J. Harmse, T. Haley, S. Sethi, V. Mulholland 2018-04-15
library(watsonNLU)
The steps below are only necessary for the authentication in this manual. No need to replicate.
# credentials saved locally
credentials <- readRDS("../tests/testthat/credentials.rds")
username <- credentials$username
password <- credentials$password
The watsonNLU
R wrapper package integrates with the IBM Watson Natural Language Understanding service to produce a variety of outputs including:
Natural language processing analyses semantic features of the text while the Watson API cleans the HTML content so that the information can be handled by the R wrapper to produce a neat data frame output for each of the functions.
The following examples will help demonstrate the application of the NLU services through the use of a sample of text provided by IBM.
IBMtext <- "In the rugged Colorado Desert of California, there lies buried a treasure ship sailed there hundreds of years ago by either Viking or Spanish explorers. Some say this is legend; others insist it is fact. A few have even claimed to have seen the ship, its wooden remains poking through the sand like the skeleton of a prehistoric beast. Among those who say they’ve come close to the ship is small-town librarian Myrtle Botts. In 1933, she was hiking with her husband in the Anza-Borrego Desert, not far from the border with Mexico. It was early March, so the desert would have been in bloom, its washed-out yellows and grays beaten back by the riotous invasion of wildflowers. Those wildflowers were what brought the Bottses to the desert, and they ended up near a tiny settlement called Agua Caliente. Surrounding place names reflected the strangeness and severity of the land: Moonlight Canyon, Hellhole Canyon, Indian Gorge. Try Newsweek for only $1.25 per week To enter the desert is to succumb to the unknowable. One morning, a prospector appeared in the couple’s camp with news far more astonishing than a new species of desert flora: He’d found a ship lodged in the rocky face of Canebrake Canyon. The vessel was made of wood, and there was a serpentine figure carved into its prow. There were also impressions on its flanks where shields had been attached—all the hallmarks of a Viking craft. Recounting the episode later, Botts said she and her husband saw the ship but couldn’t reach it, so they vowed to return the following day, better prepared for a rugged hike. That wasn’t to be, because, several hours later, there was a 6.4 magnitude earthquake in the waters off Huntington Beach, in Southern California."
The authentication function will take the credentials generated here (you must be signed into your personal account).
With the credentials provided by IBM, enter your username and password. This step should be performed at the begining of every new intance. The following arguments are populated in auth_NLU
:
# Authenticate using Watson NLU API Credentials
auth_NLU(username, password)
## [1] "Valid credentials provided."
As credential expire, you will have to create new ones following the steps delineated in the Installation Manual. Before you create new credentials, try re-running auth_NLU
.
The keyword_sentiment function takes a text or URL input, along with the input type. The function then returns a dataframe containing the sentiments of the keywords extracted from the input, and the likelihood that the input is described by that sentiment.
Argument Description input Either a text string input or website URL. Either text or url argument has to be specified, but not both. input_type Specify what type of input was entered. Either text or urlargument has to be specified, but not both. version The release date of the API version to use. Default value is version="?version=2018-03-16"Using the keyword_sentiment
function is a useful tool for measuring the tone of a body of text. It could be used to assess the subjectivity of certain articles for instance by setting a threshold for neutral/objective text and comparing the polarization of articles on a similar topic.
# Find the keywords and related sentiment score in the given text input.
sentiments <- keyword_sentiment(input = IBMtext, input_type='text')
head(sentiments)
## keyword key_relevance score label
## 1 rugged Colorado Desert 0.976091 -0.246817 negative
## 2 librarian Myrtle Botts 0.969207 -0.500747 negative
## 3 Anza-Borrego Desert 0.702170 0.000000 neutral
## 4 riotous invasion 0.698587 0.559545 positive
## 5 treasure ship 0.695311 0.000000 neutral
## 6 Moonlight Canyon 0.693246 0.313185 positive
The keyword_emotions
function takes a text or URL input, along with the input type. The function then returns a dataframe containing the emotions of the keywords extracted from the input, and the likelihood that the input is described by that emotion.
A standard example of a use case for keyword_emotions
would be for expanding on the positive versus negative sentiments.
# Find the keywords and related emotions in the given text input.
emotions <- keyword_emotions(input = IBMtext, input_type='text')
head(emotions)
## keyword key_relevance sadness joy fear disgust
## 1 rugged Colorado Desert 0.976091 0.146294 0.221358 0.066544 0.076084
## 2 librarian Myrtle Botts 0.969207 0.448906 0.407977 0.065132 0.016412
## 3 Anza-Borrego Desert 0.702170 0.283731 0.281543 0.054903 0.138030
## 4 riotous invasion 0.698587 0.257926 0.085886 0.286756 0.266443
## 5 treasure ship 0.695311 0.468585 0.168619 0.029721 0.067962
## 6 Moonlight Canyon 0.693246 0.132472 0.189404 0.080563 0.074394
## anger
## 1 0.104871
## 2 0.082902
## 3 0.104731
## 4 0.258685
## 5 0.302241
## 6 0.083085
The output provides a wealth of information that needs to be wrangled to display the highlights. We make use of the dplyr
package to gather the emotions per keyword and display their score. This facilitates the plotting process with ggplot2
. First off, let's summarize the emotions of the whole document weighing each keyword's emotions by its relevance:
library(dplyr)
library(ggplot2)
library(tidyr)
# wrangle the keywords to display a mean score proportional to the relevance
weighed_relevance <- emotions %>%
gather(key = emotion, value = score, sadness, joy, fear, disgust, anger) %>%
group_by(emotion) %>%
summarize(mean.score= mean(score*key_relevance)) %>%
mutate(mean.score = mean.score/sum(mean.score))
# display the results
ggplot(weighed_relevance, aes(x = emotion, y=mean.score, fill=emotion)) +
geom_bar(stat = 'identity', position = "dodge") +
labs(x = 'Emotions', y ='Emotion Score', title = 'Emotions of IBMtext', subtitle = "Word relevance weighed average") +
scale_fill_discrete('Emotion') +
guides(fill=FALSE)
theme(axis.text.x = element_text(angle = 25, hjust = 0.7, vjust = 0.8))
## List of 1
## $ axis.text.x:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 0.7
## ..$ vjust : num 0.8
## ..$ angle : num 25
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
# gather and summarize the data grouped by most relevant keywords
emotions_long <- emotions %>%
arrange(desc(key_relevance)) %>%
head(5) %>%
gather( key = emotion, value = score, sadness, joy, fear, disgust, anger) %>%
group_by(keyword, emotion) %>% arrange(desc(score)) %>%
summarize(mean.score=mean(score))
# display the 5 most relevant keywords and their emotion scores
ggplot(emotions_long, aes(x = keyword, y=mean.score, fill=emotion)) +
geom_bar(stat = 'identity', position = "dodge") +
labs(x = '5 Most Relevant Keywords', y ='Emotion Score', title = 'Emotions of IBMtext Keywords', subtitle = "Filtered for the 5 most relevant keyword") +
scale_fill_discrete('Emotion') +
theme(axis.text.x = element_text(angle = 25, hjust = 0.7, vjust = 0.8))
The keyword_relevance
function takes a text or URL input, along with the input type. The function then returns a dataframe that contains keywords and their likelihood of being a keyword, from the given input.
Relevance of specific keywords can be useful for determining what are the most recurring and pertinent terms of a document. To facilitate use, the limit
argument can be set to return up to a specific number of keywords.
# Top 5 keywords from the text input.
keyword_relevance(input = IBMtext, input_type='text', limit = 5)
## keyword relevance
## 1 rugged Colorado Desert 0.976091
## 2 librarian Myrtle Botts 0.969207
## 3 Anza-Borrego Desert 0.702170
## 4 riotous invasion 0.698587
## 5 treasure ship 0.695311
# Top 5 keywords from the URL input.
keyword_relevance(input = 'http://www.nytimes.com/guides/well/how-to-be-happy', input_type='url', limit = 5)
## keyword relevance
## 1 happiness 0.962492
## 2 people 0.790289
## 3 World Happiness Report 0.610324
## 4 happier life 0.586591
## 5 so-called happiness ladder 0.540760
As we can see here, the keywords are locations and adventure related terms.
The text_categories
function takes a text or URL input along with the input type. The function then returns a dataframe that contains the likelihood that the contents of the URL or text belong to a particular category.
User's may be interested in gathering the general topics of a text or the contents of a site very quickly.
# Find 5 categories that describe the text input.
text_categories(input = IBMtext, input_type='text')
## score category_level_1 category_level_2
## 1 0.480819 home and garden <NA>
## 2 0.464129 travel tourist destinations
## 3 0.360872 science geology
## category_level_3 category_level_4
## 1 <NA> <NA>
## 2 mexico and central america <NA>
## 3 seismology earthquakes
The results will return a variable number of themes that can be drilled down into category levels. The hierarchy will go from general topics to more specific subject matter as the level number increases.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.