knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "80%", fig.align = 'center' ) data(sample_reviews) library(GSPtext) library(tidyverse)
The goal of GSPtext is to quickly and easily gather and analyze reviews from Amazon.
See the vignette for more detail about the functions used below.
You can install the development version of GSPtext from GitHub with:
# install.packages("devtools") devtools::install_github("taylorgrant/GSPtext")
Below is a simple workflow for pulling and analyzing review data. Any product URL from Amazon will work. Don't worry about cleaning the tacked on parameters. The function will clean it, determine the number of pages to crawl, scrape the reviews, and return a tidy data frame.
If you want UGC imagery included in the reviews, specify get_images = "true"
as an argument and a composite will be created and saved to your desktop.
library(GSPtext) library(tidyverse) # specify the url url <- "https://www.amazon.com/Fashion-Focus-Sweaters-Chihuahua-Clothing/dp/B07L1LHNGN/?_encoding=UTF8" # pull reviews and images data <- amzn_get_reviews(url, get_images = "false") # either "true" or "false" for images sample_n(data, 6)[,1:4] # dropping the link
sample_n(sample_reviews, 4)[,1:4] %>% data.frame()
Tokenize the review text and sum by star rating. We can specify how many top terms to keep. Returns a data frame with the top n per star rating and a bar graph, faceted by star rating.
freq_terms <- amzn_frequent_terms(sample_reviews, 15) # can access the data # freq_terms$data # plot the graph freq_terms$graph
Track ratings over time the product has been on Amazon. Specify whether it's annual or monthly, bar or line plot, and whether to include a trend line (either "loess" or "lm").
library(patchwork) p1 <- amzn_ratings_over_time(sample_reviews, time = "year", viz_type = 'bar') p2 <- amzn_ratings_over_time(sample_reviews, time = "month", viz_type = 'line', trend = "lm") p1 + p2
For every term, calculate how frequently its used in the data set, estimate the average rating of the reviews that include that word, and then plot it as a scatter plot. The overall average rating for the product is represented by a dashed line, so terms above (below) that line are more often associated with positive (negative) reviews. This is interesting when looking for language to tie back to a product.
amzn_terms_by_rating(sample_reviews)
Some people want them, so there are three types. Overall across all ratings, comparative based upon low (1 & 2 star) and high (4 and 5 star) ratings, and comparative based upon positive or negative sentiment. Sentiment is estimated usign the Bing lexicon.
# amzn_review_wc(sample_reviews, type = "overall") # total # amzn_review_wc(sample_reviews, type = "comparison", comp_type = "sentiment") # by sentiment amzn_review_wc(sample_reviews, type = "comparison", comp_type = "rating") # by hi and lo rating
Convenience function that wraps around the kwic()
function from the quanteda package https://quanteda.io/. Dig into how specific words are being used either via word match, regex, or phrase matching. Control the window - the number of terms returned on either side of the key term.
# term_context(sample_reviews, pattern = "perfect", window = 8, valuetype = "glob") # standard # term_context(sample_reviews, pattern = "perf", valuetype = "regex", window = 4) # regex term_context(sample_reviews, pattern = "perfect fit", window = 4) # phrase
Using the NRC lexicon to match terms to emotions. What emotions are more prevalent with 5 star reviews compared to 1 star?
library(patchwork) emotion_star <- text_to_emotion(sample_reviews, "stars") # emotion$data p1 <-emotion_star$graph1 p2 <- emotion_star$graph2 p1 / p2
The emotional valence of reviews can also be split out by the year of the review.
library(patchwork) emotion_year <- text_to_emotion(sample_reviews, "year") # emotion$data p1 <-emotion_year$graph1 p2 <- emotion_year$graph2 p1 / p2
Estimating sentiment of each review using the sentimentr package from Rinker. Sentiment is calculated at the sentence level and the "review senetiment" is then the weighted average across all sentences comprising that review. The package accounts for "valence shifters" by adding additional weights for terms that can negate or amplify sentiment.
The returned data is the full review data frame with sentiment data added - word count for each review, standard deviation for the sentiment estimate for each review (only if the review includes more than 1 sentence), and the estimated sentiment for each review.
sentiment <- amzn_review_sentiment(sample_reviews) sentiment$graph
Visualize which terms most frequently co-occur within reviews. These are split out by star rating, the user must specify which star rating to plot and the floor number of co-occurrences that must be present.
# term co-occurrence # amzn_cooccur_net(sample_reviews, star = 1, nn = 4) amzn_cooccur_net(sample_reviews, star = 5, nn = 15)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.