knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "80%",
  fig.align = 'center'
)
data(sample_reviews)
library(GSPtext)
library(tidyverse)

GSPtext

The goal of GSPtext is to quickly and easily gather and analyze reviews from Amazon.

See the vignette for more detail about the functions used below.

Installation

You can install the development version of GSPtext from GitHub with:

# install.packages("devtools")
devtools::install_github("taylorgrant/GSPtext")

Basic Example

Below is a simple workflow for pulling and analyzing review data. Any product URL from Amazon will work. Don't worry about cleaning the tacked on parameters. The function will clean it, determine the number of pages to crawl, scrape the reviews, and return a tidy data frame.

If you want UGC imagery included in the reviews, specify get_images = "true" as an argument and a composite will be created and saved to your desktop.

library(GSPtext)
library(tidyverse)

# specify the url 
url <- "https://www.amazon.com/Fashion-Focus-Sweaters-Chihuahua-Clothing/dp/B07L1LHNGN/?_encoding=UTF8"

# pull reviews and images 
data <- amzn_get_reviews(url, get_images = "false") # either "true" or "false" for images
sample_n(data, 6)[,1:4] # dropping the link
sample_n(sample_reviews, 4)[,1:4] %>% 
  data.frame()

Most frequent terms

Tokenize the review text and sum by star rating. We can specify how many top terms to keep. Returns a data frame with the top n per star rating and a bar graph, faceted by star rating.

freq_terms <- amzn_frequent_terms(sample_reviews, 15)
# can access the data 
# freq_terms$data

# plot the graph
freq_terms$graph

How have ratings changed over time

Track ratings over time the product has been on Amazon. Specify whether it's annual or monthly, bar or line plot, and whether to include a trend line (either "loess" or "lm").

library(patchwork)
p1 <- amzn_ratings_over_time(sample_reviews, time = "year", viz_type = 'bar')
p2 <- amzn_ratings_over_time(sample_reviews, time = "month", viz_type = 'line', trend = "lm")
p1 + p2

Plotting terms by their average ratings

For every term, calculate how frequently its used in the data set, estimate the average rating of the reviews that include that word, and then plot it as a scatter plot. The overall average rating for the product is represented by a dashed line, so terms above (below) that line are more often associated with positive (negative) reviews. This is interesting when looking for language to tie back to a product.

amzn_terms_by_rating(sample_reviews)

Wordclouds

Some people want them, so there are three types. Overall across all ratings, comparative based upon low (1 & 2 star) and high (4 and 5 star) ratings, and comparative based upon positive or negative sentiment. Sentiment is estimated usign the Bing lexicon.

# amzn_review_wc(sample_reviews, type = "overall") # total
# amzn_review_wc(sample_reviews, type = "comparison", comp_type = "sentiment") # by sentiment
amzn_review_wc(sample_reviews, type = "comparison", comp_type = "rating") # by hi and lo rating

Put the terms in context

Convenience function that wraps around the kwic() function from the quanteda package https://quanteda.io/. Dig into how specific words are being used either via word match, regex, or phrase matching. Control the window - the number of terms returned on either side of the key term.

# term_context(sample_reviews, pattern = "perfect", window = 8, valuetype = "glob") # standard
# term_context(sample_reviews, pattern = "perf", valuetype = "regex", window = 4) # regex
term_context(sample_reviews, pattern = "perfect fit", window = 4) # phrase

Track emotional valence of reviews

Using the NRC lexicon to match terms to emotions. What emotions are more prevalent with 5 star reviews compared to 1 star?

library(patchwork)
emotion_star <- text_to_emotion(sample_reviews, "stars")
# emotion$data
p1 <-emotion_star$graph1
p2 <- emotion_star$graph2
p1 / p2

The emotional valence of reviews can also be split out by the year of the review.

library(patchwork)
emotion_year <- text_to_emotion(sample_reviews, "year")
# emotion$data
p1 <-emotion_year$graph1
p2 <- emotion_year$graph2
p1 / p2

Estimate the sentiment of the reviews

Estimating sentiment of each review using the sentimentr package from Rinker. Sentiment is calculated at the sentence level and the "review senetiment" is then the weighted average across all sentences comprising that review. The package accounts for "valence shifters" by adding additional weights for terms that can negate or amplify sentiment.

The returned data is the full review data frame with sentiment data added - word count for each review, standard deviation for the sentiment estimate for each review (only if the review includes more than 1 sentence), and the estimated sentiment for each review.

sentiment <- amzn_review_sentiment(sample_reviews)
sentiment$graph

Co-occurrence networks

Visualize which terms most frequently co-occur within reviews. These are split out by star rating, the user must specify which star rating to plot and the floor number of co-occurrences that must be present.

# term co-occurrence 
# amzn_cooccur_net(sample_reviews, star = 1, nn = 4)
amzn_cooccur_net(sample_reviews, star = 5, nn = 15)


taylorgrant/GSPtext documentation built on April 15, 2023, 10:56 p.m.