In jasonmtroos/rook: Introduction to Data Visualization, Web Scraping, and Text Analysis in R

knitr::opts_chunk$set(tidy = TRUE, tidy.opts=list(blank=FALSE, width.cutoff=50), cache=TRUE)
knitr::opts_chunk$set(
  tidy = TRUE,
  tidy.opts = list(blank = FALSE, width.cutoff = 50),
  cache = 1
)
knitr::knit_hooks$set(
  source = function(x, options) {
    if (options$engine == 'R') {
      # format R code
      x = highr::hilight(x, format = 'html')
    } else if (options$engine == 'bash') {
      # format bash code
      x = paste0('<span class="hl std">$</span> ',
                 unlist(stringr::str_split(x, '\\n')),
                 '\n',
                 collapse = '')
    }
    x = paste(x, collapse = "\n")
    sprintf(
      "<div class=\"%s\"><pre class=\"%s %s\"><code class=\"%s %s\">%s</code></pre></div>\n",
      'sourceCode',
      'sourceCode',
      tolower(options$engine),
      'sourceCode',
      tolower(options$engine),
      x
    )
  }
)

library(tidyverse)
library(rvest)

Today

Now: Scraping data from web sites
- HTML and CSS selectors
- rvest
Later: Social media
- JSON and OATH
- rtweet

Crash course in HTML

Basic structure of an HTML document

<html>
  <head>
    <title>This appears in the title window</title>
    ... CSS and JavaScript typically goes here ...
  </head>
  <body>
    ... Most of what you see goes here ...
  </body>
</html>

html_doc <- xml2::read_html('<html>
  <head>
    <title>This appears in the title window</title>
  </head>
  <body>
  </body>
</html>')

library(tidyverse)
library(rvest)
html_doc %>% html_nodes('head title') %>% html_text

HTML Body

HTML is very flexible about what goes between the <body> and </body> tags Originally, HTML comprised tags like * <h1> for header-level-1 * <p> for paragraph break * <a> for hyperlinks (a = "anchor"), etc. The standard was made a bit more formal (so computers could read HTML more quickly) * E.g., <p> was paired with </p>, whereas before it could stand alone * <br/> became legal, and is the same as <br></br> (br = "line break") And the standard was made a bit more flexible (so designers could create better looking pages) * <div>'s contain blocks of content * <span>'s contain small sections of content (usually individual words) that should be styled in a special way * Cascading style sheets (CSS) provide a way to define how content should look and/or behave

Tags and attributes

Tag
- <div>
- <img>
- <a>
Attribute
- <div class="headline" id="story-id-43234">
- <img src="/img/another-cat.gif">
- <a href="http://www.google.com">

CSS

<div style="{background-color: black; color: white;}" id="story-id-43234">
... the headline goes here ...
</div>

This requires a lot of typing and is prone to mistakes
A better approach is provided by CSS:

<style>
div .headline {
  background-color: black;
  color: white;
}
</style>

<div class="headline" id="story-id-43234">
... the headline goes here ...
</div>

With CSS, tags often include classes that indicate their purpose
When scraping content, these class definitions act as roadmaps for where the interesting content might be

In this style definition:

<style>
div.headline {
  background-color: black;
  color: white;
}
</style>

The style for div.headline will be applied to any object that matches this selector
In CSS, div.headline means "any div tag with class="headline"
You can have more than one class for a given tag, hence

<div class="headline ptw" id="story-id-43234"></div>

could be matched by div.headline div div.headline.ptw div#story-id-43234 and many other selectors

snip <- read_html( '<div class="headline ptw" id="story-id-43234"></div>' ) %>% html_node('body')
snip %>% html_nodes('div.headline')

Nesting tags

HTML tags can be nested

<div class="articleBody">
<div class="mainBody">
<div class="quotedBlock">
A quote by a famous thinker
</div>
</div>
</div>

The div containing the famous quote text is matched by all of the following selectors

div.quotedBlock
.quotedBlock
div.articleBody > div.mainBody > div.quotedBlock

html_body <- read_html('<div class="articleBody">
<div class="mainBody">
<div class="quotedBlock">
A quote by a famous thinker
</div>
</div>
</div>
') %>% html_nodes('body')
html_body %>% html_nodes('div.articleBody > div.mainBody > div.quotedBlock')

CSS Selectors

See http://www.w3schools.com/cssref/css_selectors.asp for a complete listing

|Selector |Example |Example description | |---------|--------|-------------------------------------------------------------| |.class |.headline |Selects all elements with class="headline"| |#id |#firstname |Selects the element with id="firstname"| | |* |Selects all elements | |element |p |Selects all <p> | |element element|div p |Selects all <p> elements inside <div>| |element > element|div > p |Selects all <p> elements where the parent is a <div> element| |element + element*|div + p |Selects all <p> elements that are placed immediately after <div> elements| |:first-child |p:first-child |Selects every <p> element that is the first child of its parent| |:first-of-type |p:first-of-type|Selects every <p> element that is the first <p> element of its parent|

Play this game to learn CSS (if you haven't already)

CSS examples

```r
cat_html <- . %>% as.character %>% stringr::str_replace( '.*\\n','') %>% knitr::asis_output()
cat_break_html <- . %>% as.character %>% stringr::str_replace_all('><','>\n<') %>% stringr::str_replace( '.*\\n','') %>% knitr::asis_output()

html_snippet <- read_html( '<html><body>
<div class="outer">
<div class="inner">
This is the target
</div>
</div>
</body></html>' )
html_snippet %>% cat_html
```

library(rvest)
html_snippet %>% html_nodes(css='div')
html_snippet %>% html_nodes(css='div.inner')

```r
html_snippet %>% cat_html
```

html_snippet %>% html_nodes('div.outer > div.inner')
html_snippet %>% html_nodes('div.outer > div.inner') %>% html_text()
html_snippet %>% html_nodes('.inner') %>% html_text()

```r
html_snippet <- read_html( '<html><body>
<div class="ads">...</div>
<div class="foo">
<p>This is the target</p>
</div>
<div class="foo">
<p>This is not the target</p>
</div>
</body></html>' )
html_snippet %>% cat_html
```

html_snippet %>% html_nodes('div.foo')
html_snippet %>% html_nodes('div.ads + div.foo > p') %>% html_text()

A useful tool

```r
html_snippet <- read_html('<html><body><div class="ads"><div class="dyn-ad"/><div class="dyn-ad"/></div><div class="headline"><h1>Bad yogurt</h1></div><div class="article"><p>It is everywhere these days.</p><p>Get some now.</p></div></body></html>')
html_snippet %>% cat_break_html
```

html_snippet %>% html_structure

wzxhzdk:21

Tags, attributes, text

html_snippet <- read_html('<a href="http://eur.nl">Erasmus</a>') %>% html_nodes('body')

<a href="http://eur.nl">Erasmus</a> produces this: Erasmus
<a> is a tag (node) indicating a link should be produced

html_snippet %>% html_nodes('a')

href="http://eur.nl" is the attribute determining the link target

html_snippet %>% html_nodes('a') %>% html_attr('href')

Erasmus is text embedded in this tag; it is highlighted and clickable

html_snippet %>% html_nodes('a') %>% html_text()

Task 1

library(rvest)
library(tidyverse)
p <- read_html('https://www.eur.nl/en/about-eur/faculties-and-schools')

Complete this code:

schools <- p %>% '[your code goes here]'

in order to yield this result:

wzxhzdk:28

Task 2

Complete this code:

program_urls <- p %>% '[your code goes here]'

in order to yield this result:

wzxhzdk:30 wzxhzdk:31

Task 3: Job browsing demo

See https://www.dropbox.com/s/garkmh5hxlsxlmu/job_browsing_handout.html?dl=1 or http://bit.ly/2HrzRHf

Reading tabular data with `html_table()`

https://www.imdb.com/title/tt0081777/fullcredits

wzxhzdk:32

Task 4:

p <- read_html('https://en.wikipedia.org/wiki/Most_common_words_in_English')
p %>% html_nodes('table.wikitable') %>% map(~html_table(.)) %>% bind_rows

Use html_table() to extract the first table at https://en.wikipedia.org/wiki/Most_common_words_in_English

p <- read_html('https://en.wikipedia.org/wiki/Most_common_words_in_English')
p %>% 
  html_node('table.wikitable') %>% 
  html_table() %>%
  as_tibble()

Hint: Use html_node() to select a table node with class wikitable---the CSS selector is therefore table.wikitable
Hint: Then use html_table() to extract the table

jasonmtroos/rook documentation built on May 24, 2020, 3:16 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jasonmtroos/rook
Introduction to Data Visualization, Web Scraping, and Text Analysis in R

In jasonmtroos/rook: Introduction to Data Visualization, Web Scraping, and Text Analysis in R

Today

Crash course in HTML

HTML Body

Tags and attributes

CSS

Nesting tags

CSS Selectors

CSS examples

A useful tool

Tags, attributes, text

Task 1

Task 2

Task 3: Job browsing demo

Reading tabular data with `html_table()`

Task 4:

R Package Documentation

Browse R Packages

We want your feedback!

jasonmtroos/rook Introduction to Data Visualization, Web Scraping, and Text Analysis in R

In jasonmtroos/rook: Introduction to Data Visualization, Web Scraping, and Text Analysis in R

Today

Crash course in HTML

HTML Body

Tags and attributes

CSS

Nesting tags

CSS Selectors

CSS examples

A useful tool

Tags, attributes, text

Task 1

Task 2

Task 3: Job browsing demo

Reading tabular data with html_table()

Task 4:

R Package Documentation

Browse R Packages

We want your feedback!

jasonmtroos/rook
Introduction to Data Visualization, Web Scraping, and Text Analysis in R

Reading tabular data with `html_table()`