In moamiristat/grocerycart: Collect, Clean & Analyze Grocery Data from elGrocer & Ocado

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  eval = FALSE
)

The goal of the grocerycart package is to provide:

A suite of collection functions that scrape data from 2 online grocery services: elGrocer & Ocado.
Clean the collected data from the 2 websites.
Datasets containing details from real grocery stores (e.g., products, prices, reviews).
Ready to use grocery data: customer, order and basket datasets generated using real products. See more info in this vignette on how to quickly generate more grocery store data.

This vignette aims to be a tutorial that walks you through the entire process of scraping, cleaning and analyzing grocery data from elGrocer and Ocado. Feel free to skip directly to the 'Available Datasets' sction if you simply want to learn about the built-in datasets.

Initiate Selenium Server

# Load packages
library(grocerycart)
library(RSelenium)
library(robotstxt)

# Start the Selenium Server
remDr <- RSelenium::rsDriver(port = netstat::free_port(), 
                             browser = "firefox", 
                             verbose = FALSE)$client

Check robots.txt files

# Check which webpages are not bot friendly
eg_url <- "https://www.elgrocer.com"
# oc_url <- "https://www.ocado.com"

eg_rtxt <- robotstxt(domain = eg_url)
eg_rtxt$comments
eg_rtxt$crawl_delay
eg_rtxt$permissions

# oc_rtxt <- robotstxt(domain = oc_url)

# Can we collect data from the specific webpages that we are interested in?
paths_allowed(domain = eg_url, paths = c("/store", "/stores"))
# paths_allowed(domain = oc_url, paths = c("/browse"))

# Navigate to website
remDr$navigate(eg_url)
# remDr$navigate(oc_url)

Note: In order to play nice with the 2 websites, the scraper functions have a built in 'sleep functionality'. This means that the functions will stop executing (i.e., go to sleep) for a random time interval, usually between 5 and 10 seconds whenever the sleep function, nytnyt(), is called within the scraper functions. Also, you can tell the function to sleep for longer after each iteration by overriding the default arguments sleep_min (default 0) and sleep_max (default 1). An iteration is defined depending on what the function is doing.

For example, setting sleep_min = 4 and sleep_max = 8 in oc_collect_product_reviews() will trigger the function to suspend execution for an additional 4 to 8 seconds (time is chosen randomly by the runif() function) after collecting reviews from a product's webpage.

Collect Data from elGrocer

The 5 functions that are used to scrape different parts of the elGrocer website all have the same pre-fix eg_collect_*. Use them in the chronological order presented below.

Consistent features for all collector/scraper functions:
+ Functional Programming: via the map() function in the purrr package
+ Output: returns a tibble/table of the data collected
+ Verbose: cat and crayon packages print to the console the progress being made
+ Beep: beepr package sounds a ‘Work Complete’ audio once the required data is collected

The name of the function indicates the type of data that is scraped and returned (e.g., eg_collect_categories() scrapes/returns category data). These functions are verbose, allowing the user to get a sense of the progress being made.

First, let's grab the links for the locations/areas that contain a store on the elGrocer app.

eg_location <- eg_collect_location_links(remDr = remDr, url = "https://www.elgrocer.com")

eg_location[1:3,]

Next, let's collect the store details from 5 locations. To scrape the store details from all locations, simply drop '[1:5]' from the code below.

The store details data is only visible after clicking on the 'i' icon for a store. To see an example of this, visit the JLT grocery stores webpage and then click on the 'i' icon next to the store card. This will reveal the data that the function below collects (i.e., minimum order amount).

Notice that one of the arguments used is the column of location links that was collected above.

eg_store <- eg_collect_stores_details(remDr, eg_location$location_link[1:5])

eg_store[1:3,]

Next, let's collect the product categories from only 3 stores. Notice that one of the arguments used is the column of store links that was collected above. It is important that you keep the same object name - 'eg_category' - as it is used internally in the eg_collect_subcategories() function mentioned next.

eg_category <- eg_collect_categories(remDr, eg_store$store_link[1:3])

eg_category[1:3,]

Next, let's grab 3 subcategories from the categories that were returned from the function above.

# Randomly choose 3 categories to collect the subcategories from
random_category_links <- sample(x = 1:length(eg_category$category_link), 
                                size = 3, 
                                replace = FALSE)

eg_subcategory <- eg_collect_subcategories(remDr, 
                                           eg_category$category_link[random_category_links])

eg_subcategory[1:3,]

Finally, let's collect product data from 2 subcategories. The function uses Javascript in order to actively scroll to the bottom of each subcategory page to check for (and potentially load) more products. It stops scrolling when all the products have loaded.

# Randomly choose 2 subcategories to collect the product data from
random_subcategory_links <- sample(x = 1:length(eg_subcategory$subcategory_link), 
                                   size = 2, 
                                   replace = FALSE)

eg_item <- eg_collect_items(remDr, 
                            eg_subcategory$subcategory_link[random_subcategory_links])

eg_product[1:3,]

That is what you need to know to successfully scrape the elGrocer website for grocery data. Here's a summary of elGrocer data collected:
+ Location (UAE) of grocery stores that elGrocer delivers from
+ Details of each store (i.e., delivery times, minimum order amount)
+ Random categories & subcategories of products available in each store + All 3,279 listed distinct categories
+ Subcategory data was collected from 300 randomly chosen categories (from total of 3289). 1164 subcategories were collected
+ Product details for 17,114 products (i.e., name, price, weight, image link)
+ The 17,114 products were collected from 1,000 randomly chosen subcategories

Collect Data from Ocado

The 5 functions that are used to scrape different parts of the Ocado website all have the same pre-fix oc_collect_*. Use them in the chronological order presented below.

The name of the function indicates the type of data that is scraped and returned (e.g., oc_collect_product_reviews() scrapes/returns product reviews). These functions are verbose, allowing the user to get a sense of the progress being made.

First, let's grab the category links from the dropdown menu ont the Ocado website.

oc_category <- oc_collect_categories(remDr = remDr)

oc_category[1:3,]

Now we can collect general product details (i.e., name, price, image). This function interacts with the javascript elements on the webpage (i.e., click on 'show more' until there's no more 'show more') and slowly scrolls down and up the webpage in order to ensure that all products are loaded before scraping begins.

Here, we will collect the data from only 1 category. The more categories that you scrape, the longer the code will run for. The function will keep you updated by outputting verbose milestones, but it will not predict how much time it will take to collect the data.

chosen_category_links <- 7

oc_product_general <- oc_collect_product_general(oc_category$link[chosen_category_links])

oc_product_general[1:3,]

We can also collect extra product details such as the country of origin and rating. We will do that for 3 random products in the code below.

random_product_links <- sample(x = 1:length(oc_product_general$product_link), 
                               size = 3, 
                               replace = FALSE)

oc_product_extra <- oc_collect_product_extra(-oc_product_general$product_link[random_product_links[1:3]])

oc_product_extra[1:3,]

If a product has reviews, then we can collect those too. The function will check how many times it needs to click on the next arrow ('>') on each product page in order to collect all the reviews associated with that product. If no reiews exist, then it will return r NA and move on to the next product. The verbose output will print to the console how many reviews the function has found as it visits each product page.

oc_product_review <- oc_collect_product_reviews(oc_product_general$product_link[random_product_links[1:3]])

oc_product_review[1:3,]

It is also possible to grab the nutrition table, if it exists, associated with the products. If it does not exist, then the function returns r NA and moves on to the next product.

oc_nutrition_table <- oc_collect_nutrition_table(oc_product_general$product_link[random_product_links[1:3]])

oc_nutrition_table[1:3,]

Finally, we also collected country names and flags from worldometers. The purpose of this was to make it possible to extract the country of origin for the products on the Ocado website.

This covers the section on scraping data from the Ocado website. Here's a summary of Ocado data collected:
+ All available categories
+ Product details for 1,000 products (i.e., name, price, weight, nutrition table, ingredients, country of origin, rating, text reviews)
+ The 1,000 products were randomly selected from 3 (of the 13) categories due to the large number of products available. All products would have taken > 11 hours to collect (regardless of hardware) because the system/bot was instructured to sleep within each collector function to prevent overloading the website. The time would be less with parallel processing (i.e., opening multiple RSelenium servers at once and using parallel functional programming vua future package in R).

Close Selenium Server

# Shut down Selenium server
remDr$close()
gc(remDr)
rm(remDr)

This ends the data collection part of the process. Next up is cleaning the data that you scraped.

Cleaning Functions

A lot of the data cleaning process can be handled with the dplyr package. However, some data wrangling functions were created specifically to clean the data that was scraped from the 2 websites.

For example, the 2 functions extract_energy and extract_kcal can be used sequentially to extract the number of kcals in a product from its nutrition table, even if the calories are in different units, like kJ.

# Extract product kcals frm nutrition table
data("oc_data")
calories <- extract_energy(oc_data, item = "product", nutrition = "nutrition")
kcal <- extract_kcal(calories)

Check the files raw_eg_data.R and raw_oc_data.R for more information on how to clean the collected data.

Available Datasets

What if you only want to use grocery data, without having to scrape the websites? Well, this package also comes with 16 build-in grocery related datasets.

The elGrocer and Ocado websites were partially scraped, and the data collected was put into different tibbles that can be further analyzed (e.g., joined and visualized).

Datasets collected from elGrocer have the pre-fix eg_, while Ocados' have the pre-fix oc_. View the help page for each dataset for more info (e.g., ?oc_data). Continue below to see the available datasets in this package.

# Run the following command to load any of the datasets (in the global environment)
# data("name of any dataset listed below")
data(eg_location)

elGrocer Data

eg_location: names and links of 131 locations that have grocery stores listed on the online grocery delivery service's app.
eg_store: details for 184 grocery stores that provide online delivery services.
eg_category: 3,278 product categories in different grocery stores.
eg_subcategory: 1,156 product subcategories chosen randomly from 300 categories in different grocery stores.
eg_product: name, weight, price and image link of more than 15,000 grocery products.
eg_data: names and other attributes of over 15,000 grocery products. This table was built by joining eg_product, eg_subcategory and eg_category.

eg_data[c(5, 10, 1000, 2000, 2005),] %>% str()

Ocado Data

oc_category: 13 category names and links.
oc_product_general: general info for almost 9,000 grocery products.
oc_product_extra: extra info (e.g., rating, brand) for almost 1,000 grocery products.
oc_product_review: reviews for almost 1,000 grocery products.
oc_nutrition_table: nutrition tables for almost 1,000 grocery products.
oc_data: names and other attributes of almost 9,000 grocery products. This table was built by joining oc_product_general, oc_product_extra, oc_category, and oc_product_review and oc_nutrition_table.

oc_data[5006:5010,] %>% str()

Grocery Store Fake Data

The datasets below were generated to mimic simple databases of a fake grocery store, which we will call 'funmart':
1. customer_db_funmart: customer id, name, age, household size and location (4,996 customers).
2. order_db_funmart: order id, customer id, order date, payment method and order time (12,000 orders).
3. basket_db_funmart: basket id, order id, products purchased in each basket and price of products. There were 200 products, with different probabilities for each, to select from in the fake grocery store, 'funmart'. Over 140,000 products were bought in all baskets combined.
4. grocery_cart: grocery dataset that was created by joining the 3 databases above.

Here's a brief overview of the data generation rules:
+ 40% of orders are from 2020 and 60% from 2021
+ 30% of orders are from the 1st half of the year (Jan - June) and 70% from the 2nd half of the year (Jul - Dec)
+ The probability of shopping at each store was calculated according to the # of products (i.e., more products available in a store —> higher probability of ordering from that store).
The probability of ordering a product (total products = 12,539) was based on a ‘score’ metric = nummber of reviews for that product + % of customers that recommend it (i.e., higher score for a product —> higher probability of ordering that product).
+ The number of products in each basket is normally distributed with a mean of 16 and standard deviation of 4 (minimum is 5 products/basket)
+ 5% of orders from 00:00 to 8:00 am
+ 20% of orders from 8:00 to 10:00 am
+ 25% of orders from 10:00 to 12:00 pm
+ 25% of orders from 12:00 to 6:00 pm
+ 15% of orders from 6:00 to 10:00 pm
+ 10% of orders from 10:00 to 12:00 am

Generate Your Own Grocery Store Data

In addition to the 4 datasets above, you are able to generate new grocery store data to use in your anlysis using the project associated with this package: grocery project's Github.

Analyze Package Data

A myriad of analysis can be conducted on the data in this package. Here are some ideas of what you can do:
1. Analyze text from the product reviews and/or ingredients.
2. Build interactive tables.
3. Create all kinds of graphs to summarize the data.
4. Deploy a recommendation system.
5. Employ a market basket analysis algorithm (e.g., Apriori or FP-Growth algorithms).

Here are are 2 practical examples:

Top 5 Most Reviewed Products

library(tidyverse)
library(ggimage)
library(ggrepel)

blue_palette <- c("#99D8EB", "#81C3D7", "#62A7C1", "#3A7CA5", 
                  "#285F80", "#16425B", "#0C2C3E", "#051E2C")

# Grab the top 5 most reviewed products from Ocado data
oc_top5_rev <- 
  oc_data %>% 
    select(product, rating, num_of_reviews, recommend, image_link) %>% 
    slice_max(n = 5, order_by = num_of_reviews) %>% 
    mutate(product = product %>% fct_reorder(num_of_reviews) %>% fct_rev()) %>% 
    bind_cols(palette = c("#DFBF61", blue_palette[7], "#BFB394", "#D85252", "#D87B3D"))

# Graph the images of the products and add labels
oc_top5_rev %>% 
  ggplot(aes(x = product, y = num_of_reviews)) + 
  geom_image(aes(image = image_link), size = .2) + 
  geom_label_repel(aes(label = glue::glue("{num_of_reviews} reviews\n{recommend}% recommend"), 
                       fill = product), 
                   colour = "white", 
                   segment.colour = oc_top5_rev$palette, 
                   segment.curvature = -0.5, 
                   segment.ncp = 3,
                   segment.angle = 20, 
                   fontface = "bold", 
                   box.padding = unit(2, "cm"),
                   point.padding = unit(2, "cm")) + 
  labs(x = "Product", y = "Reviews", 
       title = ("5 Most Reviewed Products"), 
       subtitle = "Customer recommendation rate (%)") + 
  hrbrthemes::theme_ipsum(grid = FALSE) + 
  coord_cartesian(ylim = c(0, 1000)) + 
  scale_x_discrete(labels = function(x) str_wrap(x, width = 15)) + 
  scale_fill_manual(values = setNames(oc_top5_rev$palette, levels(oc_top5_rev$product))) + 
  theme(legend.position = "none")

Interactive Table

library(reactable)
data(oc_data)

# Create palette
oc_palette <- c("#D3CAEC", "#B3A2E7", "#9F8BDC", "#7D67BD",
                "#664EAB", "#513C90", "#36246C", "#281956")

# Number of products per brand
oc_pro <- 
  oc_data %>% 
    select(brand, product) %>% 
    filter(!is.na(brand)) %>% 
    count(brand, name = "products") %>% 
    arrange(desc(products))

# Create interactive table that highlights average price for each brand
oc_top_pro <- 
  oc_data %>% 
    inner_join(oc_pro, by = "brand") %>% 
    group_by(brand) %>% 
    summarise(products = n(), 
              avg_price = round(mean(price, na.rm = TRUE), 2), 
              median_price = round(median(price, na.rm = TRUE), 2), 
              max_price = round(max(price, na.rm = TRUE), 2), 
              min_price = round(min(price, na.rm = TRUE), 2))

oc_pal <- function(x) rgb(colorRamp(c(oc_palette[1], oc_palette[6]))(x), 
                          maxColorValue = 255)

oc_top_pro %>% 
  reactable(
    defaultSortOrder = "desc", 
    defaultSorted = c("products", "avg_price"), 
    columns = list(
      avg_price = colDef(style = function(.x) {
        norm_avg_price <- 
          (.x - min(oc_top_pro$avg_price)) / (max(oc_top_pro$avg_price) - min(oc_top_pro$avg_price))

        color <- oc_pal(norm_avg_price)

        list(background = color)
      })
    ), 
    defaultColDef = colDef(
      header = function(.x) {str_replace(.x, "_", " ") %>% str_to_title()},
      cell = function(.x) format(.x, nsmall = 1),
      align = "center",
      minWidth = 70, 
      headerStyle = list(background = "light grey")
    ), 
    defaultPageSize = 20, 
    bordered = TRUE, striped = TRUE, highlight = TRUE
  )

This is the end of the vignette For more information visit the Github of the grocery project that is associated with this package.

moamiristat/grocerycart documentation built on June 15, 2022, 10 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

moamiristat/grocerycart
Collect, Clean & Analyze Grocery Data from elGrocer & Ocado

In moamiristat/grocerycart: Collect, Clean & Analyze Grocery Data from elGrocer & Ocado

Initiate Selenium Server

Check robots.txt files

Collect Data from elGrocer

Collect Data from Ocado

Close Selenium Server

Cleaning Functions

Available Datasets

elGrocer Data

Ocado Data

Grocery Store Fake Data

Generate Your Own Grocery Store Data

Analyze Package Data

Top 5 Most Reviewed Products

Interactive Table

R Package Documentation

Browse R Packages

We want your feedback!

moamiristat/grocerycart Collect, Clean & Analyze Grocery Data from elGrocer & Ocado

In moamiristat/grocerycart: Collect, Clean & Analyze Grocery Data from elGrocer & Ocado

Initiate Selenium Server

Check robots.txt files

Collect Data from elGrocer

Collect Data from Ocado

Close Selenium Server

Cleaning Functions

Available Datasets

elGrocer Data

Ocado Data

Grocery Store Fake Data

Generate Your Own Grocery Store Data

Analyze Package Data

Top 5 Most Reviewed Products

Interactive Table

R Package Documentation

Browse R Packages

We want your feedback!

moamiristat/grocerycart
Collect, Clean & Analyze Grocery Data from elGrocer & Ocado