Description Usage Arguments Details Value Note References See Also Examples
Approximate the sentiment (polarity) of text by sentence. This function allows
the user to easily alter (add, change, replace) the default polarity an
valence shifters dictionaries to suit the context dependent needs of a particular
data set. See the polarity_dt
and valence_shifters_dt
arguments
for more information. Other hyper-parameters may add additional fine tuned
control of the algorithm that may boost performance in different contexts.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | sentiment(
text.var,
polarity_dt = lexicon::hash_sentiment_jockers_rinker,
valence_shifters_dt = lexicon::hash_valence_shifters,
hyphen = "",
amplifier.weight = 0.8,
n.before = 5,
n.after = 2,
question.weight = 1,
adversative.weight = 0.25,
neutral.nonverb.like = FALSE,
missing_value = 0,
retention_regex = "\\d:\\d|\\d\\s|[^[:alpha:]',;: ]",
...
)
|
text.var |
The text variable. Can be a |
polarity_dt |
A data.table of positive/negative words and weights with x and y as column names. The lexicon package has several dictionaries that can be used, including:
Additionally, the
|
valence_shifters_dt |
A data.table of valence shifters that can alter a polarized word's meaning and an integer key for negators (1), amplifiers [intensifiers] (2), de-amplifiers [downtoners] (3) and adversative conjunctions (4) with x and y as column names. |
hyphen |
The character string to replace hyphens with. Default replaces
with nothing so 'sugar-free' becomes 'sugarfree'. Setting |
amplifier.weight |
The weight to apply to amplifiers/de-amplifiers [intensifiers/downtoners] (values from 0 to 1). This value will multiply the polarized terms by 1 + this value. |
n.before |
The number of words to consider as valence shifters before
the polarized word. To consider the entire beginning portion of a sentence
use |
n.after |
The number of words to consider as valence shifters after
the polarized word. To consider the entire ending portion of a sentence
use |
question.weight |
The weighting of questions (values from 0 to 1). Default is 1. A 0 corresponds with the belief that questions (pure questions) are not polarized. A weight may be applied based on the evidence that the questions function with polarized sentiment. In an opinion tasks such as a course evalaution the questions are more likely polarized, not designed to gain information. On the other hand, in a setting with more natural dialogue, the question is less likely polarized and is likely to function as a means to gather information. |
adversative.weight |
The weight to give to adversative conjunctions or
contrasting conjunctions (e.g., "but") that overrule the previous clause
(Halliday & Hasan, 2013). Weighting a contrasting statement stems from the
belief that the adversative conjunctions like "but", "however", and "although"
amplify the current clause and/or down weight the prior clause. If an
adversative conjunction is located before the polarized word in the context
cluster the cluster is up-weighted 1 + number of occurrences of the
adversative conjunctions before the polarized word times the
weight given (1 + N_{adversative\,conjunctions} * z_2 where z_2
is the |
neutral.nonverb.like |
logical. If |
missing_value |
A value to replace |
retention_regex |
A regex of what characters to keep. All other
characters will be removed. Note that when this is used all text is lower
case format. Only adjust this parameter if you really understand how it is
used. Note that swapping the |
... |
Ignored. |
The equation used by the algorithm to assign value to polarity of each sentence fist utilizes the sentiment dictionary to tag polarized words. Each paragraph (p_i = {s_1, s_2, ... s_n}) composed of sentences, is broken into element sentences (s_i,j = {w_1, w_2, ... w_n}) where w are the words within sentences. Each sentence (s_j) is broken into a an ordered bag of words. Punctuation is removed with the exception of pause punctuations (commas, colons, semicolons) which are considered a word within the sentence. I will denote pause words as cw (comma words) for convenience. We can represent these words as an i,j,k notation as w_{i,j,k}. For example w_{3,2,5} would be the fifth word of the second sentence of the third paragraph. While I use the term paragraph this merely represent a complete turn of talk. For example t may be a cell level response in a questionnaire composed of sentences.
The words in each sentence (w_{i,j,k}) are searched and compared to a dictionary of polarized words (e.g., Jockers (2017) dictionary found in the lexicon package). Positive (w_i,j,k^+) and negative (w_i,j,k^-) words are tagged with a +1 and -1 respectively. I will denote polarized words as pw for convenience. These will form a polar cluster (c_i,j,l) which is a subset of the a sentence (l_i,j,l \subseteq s_i,j).
The polarized context cluster (c_{i,j,l}) of words is pulled from around
the polarized word (pw) and defaults to 4 words before and two words
after pw) to be considered as valence shifters. The cluster can be represented as
(c_i,j,l = {pw_i,j,k - nb, ..., pw_i,j,k , ..., pw_i,j,k - na}),
where nb & na are the parameters n.before
and n.after
set by the user. The words in this polarized context cluster are tagged as
neutral (w_i,j,k^0), negator (w_i,j,k^n),
amplifier [intensifier]] (w_i,j,k^a), or de-amplifier
[downtoner] (w_i,j,k^d). Neutral words hold no value in
the equation but do affect word count (n). Each polarized word is then
weighted (w) based on the weights from the polarity_dt
argument
and then further weighted by the function and number of the valence shifters
directly surrounding the positive or negative word (pw). Pause
(cw) locations (punctuation that denotes a pause including commas,
colons, and semicolons) are indexed and considered in calculating the upper
and lower bounds in the polarized context cluster. This is because these marks
indicate a change in thought and words prior are not necessarily connected
with words after these punctuation marks. The lower bound of the polarized
context cluster is constrained to
\max \{pw_{i,j,k - nb}, 1, \max \{cw_{i,j,k} < pw_{i,j,k}\}\} and the upper bound is
constrained to \min \{pw_{i,j,k + na}, w_{i,jn}, \min \{cw_{i,j,k} > pw_{i,j,k}\}\}
where w_{i,jn} is the number of words in the sentence.
The core value in the cluster, the polarized word is acted upon by valence shifters. Amplifiers (intensifiers) increase the polarity by 1.8 (.8 is the default weight (z)). Amplifiers (w_{i,j,k}^{a}) become de-amplifiers if the context cluster contains an odd number of negators (w_{i,j,k}^{n}). De-amplifiers (downtoners) work to decrease the polarity. Negation (w_{i,j,k}^{n}) acts on amplifiers/de-amplifiers as discussed but also flip the sign of the polarized word. Negation is determined by raising -1 to the power of the number of negators (w_{i,j,k}^{n}) + 2. Simply, this is a result of a belief that two negatives equal a positive, 3 negatives a negative and so on.
The adversative conjunctions (i.e., 'but', 'however', and 'although') also weight the context cluster. A adversative conjunction before the polarized word (w_{adversative\,conjunction}, ..., w_{i, j, k}^{p}) up-weights the cluster by 1 + z_2 * \{|w_{adversative\,conjunction}|, ..., w_{i, j, k}^{p}\} (.85 is the default weight (z_2)). An adversative conjunction after the polarized word down-weights the cluster by 1 + \{w_{i, j, k}^{p}, ..., |w_{adversative\,conjunction}| * -1\} * z_2. The number of occurrences before and after the polarized word are multiplied by 1 and -1 respectively and then summed within context cluster. It is this value that is multiplied by the weight and added to 1. This corresponds to the belief that an adversative conjunction makes the next clause of greater values while lowering the value placed on the prior clause.
The researcher may provide a weight z to be utilized with amplifiers/de-amplifiers (default is .8; de-amplifier weight is constrained to -1 lower bound). Last, these weighted context clusters (c_i,j,l) are summed (c'_i,j) and divided by the square root of the word count (√n w_i,jn) yielding an unbounded polarity score (C) for each sentence.
C=c'_i,j,l/√(w_i,jn)
Where:
c'_{i,j}=∑{((1 + w_{amp} + w_{deamp})\cdot w_{i,j,k}^{p}(-1)^{2 + w_{neg}})}
w_{amp}= (w_{b} > 1) + ∑{(w_{neg}\cdot (z \cdot w_{i,j,k}^{a}))}
w_{deamp} = \max(w_{deamp'}, -1)
w_{deamp'}= (w_{b} < 1) + ∑{(z(- w_{neg}\cdot w_{i,j,k}^{a} + w_{i,j,k}^{d}))}
w_{b} = 1 + z_2 * w_{b'}
w_{b'} = ∑{\\(|w_{adversative\,conjunction}|, ..., w_{i, j, k}^{p}, w_{i, j, k}^{p}, ..., |w_{adversative\,conjunction}| * -1}\\)
w_{neg}= ≤ft(∑{w_{i,j,k}^{n}}\right) \bmod {2}
Returns a data.table of:
element_id - The id number of the original vector passed to sentiment
sentence_id - The id number of the sentences within each element_id
word_count - Word count
sentiment - Sentiment/polarity score (note: sentiments less than zero is negative, 0 is neutral, and greater than zero positive polarity)
The polarity score is dependent upon the polarity dictionary used.
This function defaults to a combined and augmented version of Jocker's (2017)
[originally exported by the syuzhet package] & Rinker's augmented Hu & Liu (2004)
dictionaries in the lexicon package, however, this may not be appropriate, for
example, in the context of children in a classroom. The user may (is
encouraged) to provide/augment the dictionary (see the as_key
function). For instance the word "sick" in a high school setting may mean
that something is good, whereas "sick" used by a typical adult indicates
something is not right or negative connotation (deixis).
Jockers, M. L. (2017). Syuzhet: Extract sentiment and plot arcs from text. Retrieved from https://github.com/mjockers/syuzhet
Hu, M., & Liu, B. (2004). Mining opinion features in customer reviews. National Conference on Artificial Intelligence.
Halliday, M. A. K. & Hasan, R. (2013). Cohesion in English. New York, NY: Routledge.
https://www.slideshare.net/jeffreybreen/r-by-example-mining-twitter-for
http://hedonometer.org/papers.html Links to papers on hedonometrics
Original URL: https://github.com/trestletech/Sermon-Sentiment-Analysis
Other sentiment functions:
sentiment_by()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | mytext <- c(
'do you like it? But I hate really bad dogs',
'I am the best friend.',
"Do you really like it? I'm not a fan",
"It's like a tree."
)
## works on a character vector but not the preferred method avoiding the
## repeated cost of doing sentence boundary disambiguation every time
## `sentiment` is run. For small batches the loss is minimal.
## Not run:
sentiment(mytext)
## End(Not run)
## preferred method avoiding paying the cost
mytext <- get_sentences(mytext)
sentiment(mytext)
sentiment(mytext, question.weight = 0)
sam_dat <- get_sentences(gsub("Sam-I-am", "Sam I am", sam_i_am))
(sam <- sentiment(sam_dat))
plot(sam)
plot(sam, scale_range = TRUE, low_pass_size = 5)
plot(sam, scale_range = TRUE, low_pass_size = 10)
## Not run: ## legacy transform functions from suuzhet
plot(sam, transformation.function = syuzhet::get_transformed_values)
plot(sam, transformation.function = syuzhet::get_transformed_values,
scale_range = TRUE, low_pass_size = 5)
## End(Not run)
y <- get_sentences(
"He was not the sort of man that one would describe as especially handsome."
)
sentiment(y)
sentiment(y, n.before=Inf)
## Not run: ## Categorize the polarity (tidyverse vs. data.table):
library(dplyr)
sentiment(mytext) %>%
as_tibble() %>%
mutate(category = case_when(
sentiment < 0 ~ 'Negative',
sentiment == 0 ~ 'Neutral',
sentiment > 0 ~ 'Positive'
) %>%
factor(levels = c('Negative', 'Neutral', 'Positive'))
)
library(data.table)
dt <- sentiment(mytext)[, category := factor(fcase(
sentiment < 0, 'Negative',
sentiment == 0, 'Neutral',
sentiment > 0, 'Positive'
), levels = c('Negative', 'Neutral', 'Positive'))][]
dt
## End(Not run)
dat <- data.frame(
w = c('Person 1', 'Person 2'),
x = c(paste0(
"Mr. Brown is nasty! He says hello. i give him rage. i will ",
"go at 5 p. m. eastern time. Angry thought in between!go there"
), "One more thought for the road! I am going now. Good day and good riddance."),
y = state.name[c(32, 38)],
z = c(.456, .124),
stringsAsFactors = FALSE
)
sentiment(get_sentences(dat$x))
sentiment(get_sentences(dat))
## Not run:
## tidy approach
library(dplyr)
library(magrittr)
hu_liu_cannon_reviews %>%
mutate(review_split = get_sentences(text)) %$%
sentiment(review_split)
## End(Not run)
## Emojis
## Not run:
## Load R twitter data
x <- read.delim(system.file("docs/r_tweets.txt", package = "textclean"),
stringsAsFactors = FALSE)
x
library(dplyr); library(magrittr)
## There are 2 approaches
## Approach 1: Replace with words
x %>%
mutate(Tweet = replace_emoji(Tweet)) %$%
sentiment(Tweet)
## Approach 2: Replace with identifier token
combined_emoji <- update_polarity_table(
lexicon::hash_sentiment_jockers_rinker,
x = lexicon::hash_sentiment_emojis
)
x %>%
mutate(Tweet = replace_emoji_identifier(Tweet)) %$%
sentiment(Tweet, polarity_dt = combined_emoji)
## Use With Non-ASCII
## Warning: sentimentr has not been tested with languages other than English.
## The example below is how one might use sentimentr if you believe the
## language you are working with are similar enough in grammar to for
## sentimentr to be viable (likely Germanic languages)
## english_sents <- c(
## "I hate bad people.",
## "I like yummy cookie.",
## "I don't love you anymore; sorry."
## )
## Roughly equivalent to the above English
danish_sents <- stringi::stri_unescape_unicode(c(
"Jeg hader d\\u00e5rlige mennesker.",
"Jeg kan godt lide l\\u00e6kker is.",
"Jeg elsker dig ikke mere; undskyld."
))
danish_sents
## Polarity terms
polterms <- stringi::stri_unescape_unicode(
c('hader', 'd\\u00e5rlige', 'undskyld', 'l\\u00e6kker', 'kan godt', 'elsker')
)
## Make polarity_dt
danish_polarity <- as_key(data.frame(
x = stringi::stri_unescape_unicode(polterms),
y = c(-1, -1, -1, 1, 1, 1)
))
## Make valence_shifters_dt
danish_valence_shifters <- as_key(
data.frame(x='ikke', y="1"),
sentiment = FALSE,
comparison = NULL
)
sentiment(
danish_sents,
polarity_dt = danish_polarity,
valence_shifters_dt = danish_valence_shifters,
retention_regex = "\\d:\\d|\\d\\s|[^\\p{L}',;: ]"
)
## A way to test if you need [:alpha:] vs \p{L} in `retention_regex`:
## 1. Does it wreck some of the non-ascii characters by default?
sentimentr:::make_sentence_df2(danish_sents)
## 2.Does this?
sentimentr:::make_sentence_df2(danish_sents, "\\d:\\d|\\d\\s|[^\\p{L}',;: ]")
## If you answer yes to #1 but no to #2 you likely want \p{L}
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.