In fellennert/familjelivscrapR: scraping familjeliv.se

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(familjelivscrapR)

Introduction

This vignette's goal is to enable you to use familjelivscrapR in a meaningful way. In order to achieve this, its functionality and the rationale behind it are to be described. First, a brief overview of the structure of is given. Please note that familjeliv.se and gamla.familjeliv.se are identical in their structure and content, but not in their "scrapability" -- familjeliv.se, for instance, limits the number of thread pages you can go back in time to 100 -- gamla.familjeliv.se doesn't. Second, the functions of familjelivscrapR are described. Third, examples are provided which show how to use it.

The structure of gamla.familjeliv.se

Figure 1 depicts the structure of www.familjeliv.se.

Fig. 1: structure of www.familjeliv.se

It is extremely simple: there are sections which contain sub-sections which contain threads.

The functions

A tibble containing the main sections' names and urls can be obtained using get_sections().
get_threads returns the threads within one section. The function takes needs at least one argument: the section's suffix (e.g., "/Forum-19-89/"). The arguments thread_start_date and latest_entry can be used to specify the search. The former corresponds to when the thread was started, the latter to the time point at which the latest posting was added to the thread. Their need to be entered as character strings, formatted like "YYYY-MM-DD". scrape_threads() then returns a thread's content. It takes two arguments. suffix is the suffix of the thread to be scraped (e.g., "Forum-27-260/m49908859.html").

Exemplary workflow

Due to its simple forum structure, there are not many possibilities when it comes to scraping familjeliv. In the following, I will show how you could go across scraping postings that relate to the COVID-19 pandemic in Sweden. I will perform data wrangling using dplyr and other packages from the tidyverse. As I want scrape multiple links, I need functions that iterate over them. I will use the purrr package for this. An introduction to this tidy approach to using R can be found in Hadley Wickham's and Garrett Grolemund's "R for Data Science" which can be found online, too. However, you could also perform these operations using Base R and the apply family (or simple loops). I am interested in postings that are in economy and society sections. First, I need to acquire this section's suffix. There are multiple main sections in the familjeliv forum. I will scrape them all and extract the links for the sections that contain the terms "samhälle" or "ekonomi":

(section_suffix <- get_main_sections() %>%
  dplyr::pull(suffix) %>% 
  purrr::map_dfr(get_sub_sections) %>% 
  dplyr::filter(stringr::str_detect(name, stringr::fixed("samhälle", ignore_case = TRUE)) |
                  stringr::str_detect(name, stringr::fixed("ekonomi", ignore_case = TRUE))) %>% 
  dplyr::pull(suffix))

The second step now is to get all the threads that are within this section. Thereafter, I can filter the ones out whose names do not contain relevant keywords. However, the Corona pandemic really started to gain traction in the public discourse in mid February. Hence, I do not need to scrape threads that were started before February 1. I can achieve this by providing the date in the thread_start_date() argument. Thereafter, I look for keywords that hint on COVID19-related content.

covid_keywords <- paste(c("corona", "pandemi", "covid"), collapse = "|")
(threads <- purrr::map_dfr(section_suffix, get_threads, thread_start_date = "2020-02-01") %>% 
  dplyr::filter(stringr::str_detect(thread_name, stringr::regex(covid_keywords, ignore_case = TRUE))) %>% 
  dplyr::pull(thread_link))

Now, the threads can be scraped (disclaimer: because they have to be scraped from anew every time this document is knit, I will just exemplify the process by scraping one of them):

(covid_content <- purrr::map(threads[2], scrape_thread) %>% purrr::pluck(1))

Some notes on the scraping

In my experience, the scraping functions work well for approximately 95 per cent of threads (and more than 99 percent of respective thread pages). But still, even though is an incredibly basic web page, there are some bugs in it as well. An example would be this page: http://gamla.familjeliv.se/Forum-26-434/m61706935-157.html. Here, the "normal" quote function the forum provides the users with has failed and the entire posting was copied in there. This results in the scraper function counting more postings, dates, and times than there actually are -- and would probably go totally unnoticed if it weren't for the author column, which is still correct and only of length 10. In this case, I have not found a proper solution to remove these postings properly. I could have forced the tibble columns to be of the same length (e.g., by setting the length to the length of the shortest vector), but this would result in misassigned postings and, hence, has the potential to distort the results. Therefore, I chose a defensive approach and if there is a problem with the page, the function will only fill in the thread url column and insert "broken thread page, approximately 10 postings are missing" in the posting column.

fellennert/familjelivscrapR documentation built on Oct. 4, 2020, 1:35 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com