knitr::opts_chunk$set(tidy = TRUE, tidy.opts=list(blank=FALSE, width.cutoff=50), cache=TRUE)
knitr::opts_chunk$set(
  tidy = TRUE,
  tidy.opts = list(blank = FALSE, width.cutoff = 50),
  cache = 1
)
knitr::knit_hooks$set(
  source = function(x, options) {
    if (options$engine == 'R') {
      # format R code
      x = highr::hilight(x, format = 'html')
    } else if (options$engine == 'bash') {
      # format bash code
      x = paste0('<span class="hl std">$</span> ',
                 unlist(stringr::str_split(x, '\\n')),
                 '\n',
                 collapse = '')
    }
    x = paste(x, collapse = "\n")
    sprintf(
      "<div class=\"%s\"><pre class=\"%s %s\"><code class=\"%s %s\">%s</code></pre></div>\n",
      'sourceCode',
      'sourceCode',
      tolower(options$engine),
      'sourceCode',
      tolower(options$engine),
      x
    )
  }
)
library(tidyverse)
library(rvest)

Today

  1. Now: Scraping data from web sites

    • HTML and CSS selectors
    • rvest
  2. Later: Social media

    • JSON and OATH
    • rtweet

Crash course in HTML

Basic structure of an HTML document

<html>
  <head>
    <title>This appears in the title window</title>
    ... CSS and JavaScript typically goes here ...
  </head>
  <body>
    ... Most of what you see goes here ...
  </body>
</html>
html_doc <- xml2::read_html('<html>
  <head>
    <title>This appears in the title window</title>
  </head>
  <body>
  </body>
</html>')
library(tidyverse)
library(rvest)
html_doc %>% html_nodes('head title') %>% html_text

HTML Body

HTML is very flexible about what goes between the <body> and </body> tags Originally, HTML comprised tags like * <h1> for header-level-1 * <p> for paragraph break * <a> for hyperlinks (a = "anchor"), etc. The standard was made a bit more formal (so computers could read HTML more quickly) * E.g., <p> was paired with </p>, whereas before it could stand alone * <br/> became legal, and is the same as <br></br> (br = "line break") And the standard was made a bit more flexible (so designers could create better looking pages) * <div>'s contain blocks of content * <span>'s contain small sections of content (usually individual words) that should be styled in a special way * Cascading style sheets (CSS) provide a way to define how content should look and/or behave

Tags and attributes

CSS

<div style="{background-color: black; color: white;}" id="story-id-43234">
... the headline goes here ...
</div>
<style>
div .headline {
  background-color: black;
  color: white;
}
</style>

<div class="headline" id="story-id-43234">
... the headline goes here ...
</div>

<style>
div.headline {
  background-color: black;
  color: white;
}
</style>
<div class="headline ptw" id="story-id-43234"></div>

could be matched by div.headline div div.headline.ptw div#story-id-43234 and many other selectors

snip <- read_html( '<div class="headline ptw" id="story-id-43234"></div>' ) %>% html_node('body')
snip %>% html_nodes('div.headline')

Nesting tags

<div class="articleBody">
<div class="mainBody">
<div class="quotedBlock">
A quote by a famous thinker
</div>
</div>
</div>
div.quotedBlock
.quotedBlock
div.articleBody > div.mainBody > div.quotedBlock
html_body <- read_html('<div class="articleBody">
<div class="mainBody">
<div class="quotedBlock">
A quote by a famous thinker
</div>
</div>
</div>
') %>% html_nodes('body')
html_body %>% html_nodes('div.articleBody > div.mainBody > div.quotedBlock')

CSS Selectors

|Selector |Example |Example description | |---------|--------|-------------------------------------------------------------| |.class |.headline |Selects all elements with class="headline"| |#id |#firstname |Selects the element with id="firstname"| | |* |Selects all elements | |element |p |Selects all <p> | |element element|div p |Selects all <p> elements inside <div>| |element > element|div > p |Selects all <p> elements where the parent is a <div> element| |element + element*|div + p |Selects all <p> elements that are placed immediately after <div> elements| |:first-child |p:first-child |Selects every <p> element that is the first child of its parent| |:first-of-type |p:first-of-type|Selects every <p> element that is the first <p> element of its parent|

CSS examples

```r
cat_html <- . %>% as.character %>% stringr::str_replace( '.*\\n','') %>% knitr::asis_output()
cat_break_html <- . %>% as.character %>% stringr::str_replace_all('><','>\n<') %>% stringr::str_replace( '.*\\n','') %>% knitr::asis_output()

html_snippet <- read_html( '<html><body>
<div class="outer">
<div class="inner">
This is the target
</div>
</div>
</body></html>' )
html_snippet %>% cat_html
```
library(rvest)
html_snippet %>% html_nodes(css='div')
html_snippet %>% html_nodes(css='div.inner')

```r
html_snippet %>% cat_html
```
html_snippet %>% html_nodes('div.outer > div.inner')
html_snippet %>% html_nodes('div.outer > div.inner') %>% html_text()
html_snippet %>% html_nodes('.inner') %>% html_text()

```r
html_snippet <- read_html( '<html><body>
<div class="ads">...</div>
<div class="foo">
<p>This is the target</p>
</div>
<div class="foo">
<p>This is not the target</p>
</div>
</body></html>' )
html_snippet %>% cat_html
```
html_snippet %>% html_nodes('div.foo')
html_snippet %>% html_nodes('div.ads + div.foo > p') %>% html_text()

A useful tool

```r
html_snippet <- read_html('<html><body><div class="ads"><div class="dyn-ad"/><div class="dyn-ad"/></div><div class="headline"><h1>Bad yogurt</h1></div><div class="article"><p>It is everywhere these days.</p><p>Get some now.</p></div></body></html>')
html_snippet %>% cat_break_html
```

html_snippet %>% html_structure
wzxhzdk:21

Tags, attributes, text

html_snippet <- read_html('<a href="http://eur.nl">Erasmus</a>') %>% html_nodes('body')
html_snippet %>% html_nodes('a')
html_snippet %>% html_nodes('a') %>% html_attr('href')
html_snippet %>% html_nodes('a') %>% html_text()

Task 1

library(rvest)
library(tidyverse)
p <- read_html('https://www.eur.nl/en/about-eur/faculties-and-schools')
schools <- p %>% '[your code goes here]'

in order to yield this result:

wzxhzdk:28

Task 2

program_urls <- p %>% '[your code goes here]'

in order to yield this result:

wzxhzdk:30 wzxhzdk:31

Task 3: Job browsing demo

See https://www.dropbox.com/s/garkmh5hxlsxlmu/job_browsing_handout.html?dl=1 or http://bit.ly/2HrzRHf

Reading tabular data with html_table()

https://www.imdb.com/title/tt0081777/fullcredits

wzxhzdk:32

Task 4:

p <- read_html('https://en.wikipedia.org/wiki/Most_common_words_in_English')
p %>% html_nodes('table.wikitable') %>% map(~html_table(.)) %>% bind_rows
p <- read_html('https://en.wikipedia.org/wiki/Most_common_words_in_English')
p %>% 
  html_node('table.wikitable') %>% 
  html_table() %>%
  as_tibble()


jasonmtroos/rook documentation built on May 24, 2020, 3:16 p.m.