knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE )
The goal of the grocerycart package is to provide:
A suite of collection functions that scrape data from 2 online grocery services: elGrocer & Ocado.
Clean the collected data from the 2 websites.
Datasets containing details from real grocery stores (e.g., products, prices, reviews).
Ready to use grocery data: customer, order and basket datasets generated using real products. See more info in this vignette on how to quickly generate more grocery store data.
This vignette aims to be a tutorial that walks you through the entire process of scraping, cleaning and analyzing grocery data from elGrocer and Ocado. Feel free to skip directly to the 'Available Datasets' sction if you simply want to learn about the built-in datasets.
# Load packages library(grocerycart) library(RSelenium) library(robotstxt)
# Start the Selenium Server remDr <- RSelenium::rsDriver(port = netstat::free_port(), browser = "firefox", verbose = FALSE)$client
# Check which webpages are not bot friendly eg_url <- "https://www.elgrocer.com" # oc_url <- "https://www.ocado.com" eg_rtxt <- robotstxt(domain = eg_url) eg_rtxt$comments eg_rtxt$crawl_delay eg_rtxt$permissions # oc_rtxt <- robotstxt(domain = oc_url) # Can we collect data from the specific webpages that we are interested in? paths_allowed(domain = eg_url, paths = c("/store", "/stores")) # paths_allowed(domain = oc_url, paths = c("/browse")) # Navigate to website remDr$navigate(eg_url) # remDr$navigate(oc_url)
Note: In order to play nice with the 2 websites, the scraper functions have
a built in 'sleep functionality'. This means that the functions will
stop executing (i.e., go to sleep) for a random time interval, usually
between 5 and 10 seconds whenever the sleep function, nytnyt()
, is
called within the scraper functions. Also, you can tell the function to
sleep for longer after each iteration by overriding the default
arguments sleep_min
(default 0) and sleep_max
(default 1). An iteration
is defined depending on what the function is doing.
For example, setting sleep_min = 4 and sleep_max = 8
in oc_collect_product_reviews()
will trigger the function to
suspend execution for an additional 4 to 8 seconds (time is chosen randomly
by the runif()
function) after collecting reviews from a product's webpage.
The 5 functions that are used to scrape different parts of
the elGrocer website all have the same
pre-fix eg_collect_*
. Use them in the chronological order presented
below.
Consistent features for all collector/scraper functions:
+ Functional Programming: via the map()
function in the purrr
package
+ Output: returns a tibble/table of the data collected
+ Verbose: cat
and crayon
packages print to the console the progress
being made
+ Beep: beepr
package sounds a ‘Work Complete’ audio once the required
data is collected
The name of the function indicates the type of data that is scraped
and returned (e.g., eg_collect_categories()
scrapes/returns category data).
These functions are verbose, allowing the user to get a sense of
the progress being made.
First, let's grab the links for the locations/areas that contain a store on the elGrocer app.
eg_location <- eg_collect_location_links(remDr = remDr, url = "https://www.elgrocer.com")
eg_location[1:3,]
Next, let's collect the store details from 5 locations. To scrape the store details from all locations, simply drop '[1:5]' from the code below.
The store details data is only visible after clicking on the 'i' icon for a store. To see an example of this, visit the JLT grocery stores webpage and then click on the 'i' icon next to the store card. This will reveal the data that the function below collects (i.e., minimum order amount).
Notice that one of the arguments used is the column of location links that was collected above.
eg_store <- eg_collect_stores_details(remDr, eg_location$location_link[1:5])
eg_store[1:3,]
Next, let's collect the product categories from only 3 stores. Notice that one of the arguments used is the column of store links that was collected above. It is important that you keep the same object name - 'eg_category' - as it is used internally in the eg_collect_subcategories()
function mentioned next.
eg_category <- eg_collect_categories(remDr, eg_store$store_link[1:3])
eg_category[1:3,]
Next, let's grab 3 subcategories from the categories that were returned from the function above.
# Randomly choose 3 categories to collect the subcategories from random_category_links <- sample(x = 1:length(eg_category$category_link), size = 3, replace = FALSE) eg_subcategory <- eg_collect_subcategories(remDr, eg_category$category_link[random_category_links])
eg_subcategory[1:3,]
Finally, let's collect product data from 2 subcategories. The function uses Javascript in order to actively scroll to the bottom of each subcategory page to check for (and potentially load) more products. It stops scrolling when all the products have loaded.
# Randomly choose 2 subcategories to collect the product data from random_subcategory_links <- sample(x = 1:length(eg_subcategory$subcategory_link), size = 2, replace = FALSE) eg_item <- eg_collect_items(remDr, eg_subcategory$subcategory_link[random_subcategory_links])
eg_product[1:3,]
That is what you need to know to successfully scrape the elGrocer website for grocery data. Here's a summary of elGrocer data collected:
+ Location (UAE) of grocery stores that elGrocer delivers from
+ Details of each store (i.e., delivery times, minimum order amount)
+ Random categories & subcategories of products available in each store
+ All 3,279 listed distinct categories
+ Subcategory data was collected from 300 randomly chosen categories
(from total of 3289). 1164 subcategories were collected
+ Product details for 17,114 products (i.e., name, price, weight, image
link)
+ The 17,114 products were collected from 1,000 randomly chosen
subcategories
The 5 functions that are used to scrape different parts of the Ocado website all have the same pre-fix oc_collect_*
. Use them in the chronological order presented below.
The name of the function indicates the type of data that is scraped
and returned (e.g., oc_collect_product_reviews()
scrapes/returns product
reviews). These functions are verbose, allowing the user to get a sense of
the progress being made.
First, let's grab the category links from the dropdown menu ont the Ocado website.
oc_category <- oc_collect_categories(remDr = remDr)
oc_category[1:3,]
Now we can collect general product details (i.e., name, price, image). This function interacts with the javascript elements on the webpage (i.e., click on 'show more' until there's no more 'show more') and slowly scrolls down and up the webpage in order to ensure that all products are loaded before scraping begins.
Here, we will collect the data from only 1 category. The more categories that you scrape, the longer the code will run for. The function will keep you updated by outputting verbose milestones, but it will not predict how much time it will take to collect the data.
chosen_category_links <- 7 oc_product_general <- oc_collect_product_general(oc_category$link[chosen_category_links])
oc_product_general[1:3,]
We can also collect extra product details such as the country of origin and rating. We will do that for 3 random products in the code below.
random_product_links <- sample(x = 1:length(oc_product_general$product_link), size = 3, replace = FALSE) oc_product_extra <- oc_collect_product_extra(-oc_product_general$product_link[random_product_links[1:3]])
oc_product_extra[1:3,]
If a product has reviews, then we can collect those too. The function will check how many times it needs to click on the next arrow ('>') on each product page in order to collect all the reviews associated with that product. If no reiews exist, then it will return r NA
and move on to the next product. The verbose output will print to the console how many reviews the function has found as it visits each product page.
oc_product_review <- oc_collect_product_reviews(oc_product_general$product_link[random_product_links[1:3]])
oc_product_review[1:3,]
It is also possible to grab the nutrition table, if it exists, associated with the products. If it does not exist, then the function returns r NA
and moves on to the next product.
oc_nutrition_table <- oc_collect_nutrition_table(oc_product_general$product_link[random_product_links[1:3]])
oc_nutrition_table[1:3,]
Finally, we also collected country names and flags from worldometers. The purpose of this was to make it possible to extract the country of origin for the products on the Ocado website.
This covers the section on scraping data from the Ocado website. Here's a summary of Ocado data collected:
+ All available categories
+ Product details for 1,000 products (i.e., name, price, weight,
nutrition table, ingredients, country of origin, rating, text reviews)
+ The 1,000 products were randomly selected from 3 (of the 13)
categories due to the large number of products available. All products
would have taken > 11 hours to collect (regardless of hardware) because
the system/bot was instructured to sleep within each collector function
to prevent overloading the website. The time would be less with parallel
processing (i.e., opening multiple RSelenium servers at once and using
parallel functional programming vua future package in R).
# Shut down Selenium server remDr$close() gc(remDr) rm(remDr)
This ends the data collection part of the process. Next up is cleaning the data that you scraped.
A lot of the data cleaning process can be handled with the dplyr package. However, some data wrangling functions were created specifically to clean the data that was scraped from the 2 websites.
For example, the 2 functions extract_energy
and extract_kcal
can be used sequentially to extract the number of kcals in a product from its nutrition table, even if the calories are in different units, like kJ.
# Extract product kcals frm nutrition table data("oc_data") calories <- extract_energy(oc_data, item = "product", nutrition = "nutrition") kcal <- extract_kcal(calories)
Check the files raw_eg_data.R and raw_oc_data.R for more information on how to clean the collected data.
What if you only want to use grocery data, without having to scrape the websites? Well, this package also comes with 16 build-in grocery related datasets.
The elGrocer and Ocado websites were partially scraped, and the data collected was put into different tibbles that can be further analyzed (e.g., joined and visualized).
Datasets collected from elGrocer have the pre-fix eg_
, while Ocados' have the pre-fix oc_
. View the help page for each dataset for more info (e.g., ?oc_data
). Continue below to see the available datasets in this package.
# Run the following command to load any of the datasets (in the global environment) # data("name of any dataset listed below") data(eg_location)
eg_location
: names and links of 131 locations that have grocery stores listed on the online grocery delivery service's app. eg_store
: details for 184 grocery stores that provide online delivery services. eg_category
: 3,278 product categories in different grocery stores. eg_subcategory
: 1,156 product subcategories chosen randomly from 300 categories in different grocery stores. eg_product
: name, weight, price and image link of more than 15,000 grocery products. eg_data
: names and other attributes of over 15,000 grocery products. This table was built by joining eg_product, eg_subcategory and eg_category. eg_data[c(5, 10, 1000, 2000, 2005),] %>% str()
oc_category
: 13 category names and links. oc_product_general
: general info for almost 9,000 grocery products. oc_product_extra
: extra info (e.g., rating, brand) for almost 1,000 grocery products. oc_product_review
: reviews for almost 1,000 grocery products. oc_nutrition_table
: nutrition tables for almost 1,000 grocery products. oc_data
: names and other attributes of almost 9,000 grocery products. This table was built by joining oc_product_general, oc_product_extra, oc_category, and oc_product_review and oc_nutrition_table. oc_data[5006:5010,] %>% str()
The datasets below were generated to mimic simple databases of a fake grocery store, which we will call 'funmart':
1. customer_db_funmart
: customer id, name, age, household size and location (4,996 customers).
2. order_db_funmart
: order id, customer id, order date, payment method and order time (12,000 orders).
3. basket_db_funmart
: basket id, order id, products purchased in each basket and price of products. There were 200 products, with different probabilities for each, to select from in the fake grocery store, 'funmart'. Over 140,000 products were bought in all baskets combined.
4. grocery_cart
: grocery dataset that was created by joining the 3 databases above.
Here's a brief overview of the data generation rules:
+ 40% of orders are from 2020 and 60% from 2021
+ 30% of orders are from the 1st half of the year (Jan - June) and 70% from the 2nd half of the year (Jul - Dec)
+ The probability of shopping at each store was calculated according to
the # of products (i.e., more products available in a store —> higher
probability of ordering from that store).
The probability of ordering a product (total products = 12,539) was based on a ‘score’ metric =
nummber of reviews for that product + % of customers that recommend it
(i.e., higher score for a product —> higher probability of ordering that
product).
+ The number of products in each basket is normally distributed with a
mean of 16 and standard deviation of 4 (minimum is 5 products/basket)
+ 5% of orders from 00:00 to 8:00 am
+ 20% of orders from 8:00 to 10:00 am
+ 25% of orders from 10:00 to 12:00 pm
+ 25% of orders from 12:00 to 6:00 pm
+ 15% of orders from 6:00 to 10:00 pm
+ 10% of orders from 10:00 to 12:00 am
In addition to the 4 datasets above, you are able to generate new grocery store data to use in your anlysis using the project associated with this package: grocery project's Github.
A myriad of analysis can be conducted on the data in this package. Here are some ideas of what you can do:
1. Analyze text from the product reviews and/or ingredients.
2. Build interactive tables.
3. Create all kinds of graphs to summarize the data.
4. Deploy a recommendation system.
5. Employ a market basket analysis algorithm (e.g., Apriori or FP-Growth algorithms).
Here are are 2 practical examples:
library(tidyverse) library(ggimage) library(ggrepel) blue_palette <- c("#99D8EB", "#81C3D7", "#62A7C1", "#3A7CA5", "#285F80", "#16425B", "#0C2C3E", "#051E2C") # Grab the top 5 most reviewed products from Ocado data oc_top5_rev <- oc_data %>% select(product, rating, num_of_reviews, recommend, image_link) %>% slice_max(n = 5, order_by = num_of_reviews) %>% mutate(product = product %>% fct_reorder(num_of_reviews) %>% fct_rev()) %>% bind_cols(palette = c("#DFBF61", blue_palette[7], "#BFB394", "#D85252", "#D87B3D")) # Graph the images of the products and add labels oc_top5_rev %>% ggplot(aes(x = product, y = num_of_reviews)) + geom_image(aes(image = image_link), size = .2) + geom_label_repel(aes(label = glue::glue("{num_of_reviews} reviews\n{recommend}% recommend"), fill = product), colour = "white", segment.colour = oc_top5_rev$palette, segment.curvature = -0.5, segment.ncp = 3, segment.angle = 20, fontface = "bold", box.padding = unit(2, "cm"), point.padding = unit(2, "cm")) + labs(x = "Product", y = "Reviews", title = ("5 Most Reviewed Products"), subtitle = "Customer recommendation rate (%)") + hrbrthemes::theme_ipsum(grid = FALSE) + coord_cartesian(ylim = c(0, 1000)) + scale_x_discrete(labels = function(x) str_wrap(x, width = 15)) + scale_fill_manual(values = setNames(oc_top5_rev$palette, levels(oc_top5_rev$product))) + theme(legend.position = "none")
library(reactable) data(oc_data) # Create palette oc_palette <- c("#D3CAEC", "#B3A2E7", "#9F8BDC", "#7D67BD", "#664EAB", "#513C90", "#36246C", "#281956") # Number of products per brand oc_pro <- oc_data %>% select(brand, product) %>% filter(!is.na(brand)) %>% count(brand, name = "products") %>% arrange(desc(products)) # Create interactive table that highlights average price for each brand oc_top_pro <- oc_data %>% inner_join(oc_pro, by = "brand") %>% group_by(brand) %>% summarise(products = n(), avg_price = round(mean(price, na.rm = TRUE), 2), median_price = round(median(price, na.rm = TRUE), 2), max_price = round(max(price, na.rm = TRUE), 2), min_price = round(min(price, na.rm = TRUE), 2)) oc_pal <- function(x) rgb(colorRamp(c(oc_palette[1], oc_palette[6]))(x), maxColorValue = 255) oc_top_pro %>% reactable( defaultSortOrder = "desc", defaultSorted = c("products", "avg_price"), columns = list( avg_price = colDef(style = function(.x) { norm_avg_price <- (.x - min(oc_top_pro$avg_price)) / (max(oc_top_pro$avg_price) - min(oc_top_pro$avg_price)) color <- oc_pal(norm_avg_price) list(background = color) }) ), defaultColDef = colDef( header = function(.x) {str_replace(.x, "_", " ") %>% str_to_title()}, cell = function(.x) format(.x, nsmall = 1), align = "center", minWidth = 70, headerStyle = list(background = "light grey") ), defaultPageSize = 20, bordered = TRUE, striped = TRUE, highlight = TRUE )
This is the end of the vignette For more information visit the Github of the grocery project that is associated with this package.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.