knitr::opts_chunk$set(tidy = TRUE, tidy.opts=list(blank=FALSE, width.cutoff=50), cache=TRUE) knitr::opts_chunk$set( tidy = TRUE, tidy.opts = list(blank = FALSE, width.cutoff = 50), cache = 1 ) knitr::knit_hooks$set( source = function(x, options) { if (options$engine == 'R') { # format R code x = highr::hilight(x, format = 'html') } else if (options$engine == 'bash') { # format bash code x = paste0('<span class="hl std">$</span> ', unlist(stringr::str_split(x, '\\n')), '\n', collapse = '') } x = paste(x, collapse = "\n") sprintf( "<div class=\"%s\"><pre class=\"%s %s\"><code class=\"%s %s\">%s</code></pre></div>\n", 'sourceCode', 'sourceCode', tolower(options$engine), 'sourceCode', tolower(options$engine), x ) } )
library(tidyverse) library(rvest)
Now: Scraping data from web sites
rvest
Later: Social media
rtweet
Basic structure of an HTML document
<html> <head> <title>This appears in the title window</title> ... CSS and JavaScript typically goes here ... </head> <body> ... Most of what you see goes here ... </body> </html>
html_doc <- xml2::read_html('<html> <head> <title>This appears in the title window</title> </head> <body> </body> </html>')
library(tidyverse) library(rvest) html_doc %>% html_nodes('head title') %>% html_text
<body>
and </body>
tags
Originally, HTML comprised tags like
* <h1>
for header-level-1
* <p>
for paragraph break
* <a>
for hyperlinks (a
= "anchor"), etc.
The standard was made a bit more formal (so computers could read HTML more quickly)
* E.g., <p>
was paired with </p>
, whereas before it could stand alone
* <br/>
became legal, and is the same as <br></br>
(br
= "line break")
And the standard was made a bit more flexible (so designers could create better looking pages)
* <div>
's contain blocks of content
* <span>
's contain small sections of content (usually individual words) that should be styled in a special way
* Cascading style sheets (CSS) provide a way to define how content should look and/or behave
<div>
<img>
<a>
<div class="headline" id="story-id-43234">
<img src="/img/another-cat.gif">
<a href="http://www.google.com">
<div style="{background-color: black; color: white;}" id="story-id-43234"> ... the headline goes here ... </div>
<style> div .headline { background-color: black; color: white; } </style> <div class="headline" id="story-id-43234"> ... the headline goes here ... </div>
class
es that indicate their purpose class
definitions act as roadmaps for where the interesting content might be<style> div.headline { background-color: black; color: white; } </style>
div.headline
will be applied to any object that matches this selectordiv.headline
means "any div
tag with class="headline"
<div class="headline ptw" id="story-id-43234"></div>
could be matched by div.headline
div
div.headline.ptw
div#story-id-43234
and many other selectors
snip <- read_html( '<div class="headline ptw" id="story-id-43234"></div>' ) %>% html_node('body') snip %>% html_nodes('div.headline')
<div class="articleBody"> <div class="mainBody"> <div class="quotedBlock"> A quote by a famous thinker </div> </div> </div>
div
containing the famous quote text is matched by all of the following selectorsdiv.quotedBlock .quotedBlock div.articleBody > div.mainBody > div.quotedBlock
html_body <- read_html('<div class="articleBody"> <div class="mainBody"> <div class="quotedBlock"> A quote by a famous thinker </div> </div> </div> ') %>% html_nodes('body') html_body %>% html_nodes('div.articleBody > div.mainBody > div.quotedBlock')
|Selector |Example |Example description |
|---------|--------|-------------------------------------------------------------|
|.class |.headline
|Selects all elements with class="headline"
|
|#id |#firstname
|Selects the element with id="firstname"
|
| |*
|Selects all elements |
|element |p
|Selects all <p>
|
|element element|div p
|Selects all <p>
elements inside <div>
|
|element > element|div > p
|Selects all <p>
elements where the parent is a <div>
element|
|element + element*|div + p
|Selects all <p>
elements that are placed immediately after <div>
elements|
|:first-child |p:first-child
|Selects every <p>
element that is the first child of its parent|
|:first-of-type |p:first-of-type
|Selects every <p>
element that is the first <p>
element of its parent|
```r cat_html <- . %>% as.character %>% stringr::str_replace( '.*\\n','') %>% knitr::asis_output() cat_break_html <- . %>% as.character %>% stringr::str_replace_all('><','>\n<') %>% stringr::str_replace( '.*\\n','') %>% knitr::asis_output() html_snippet <- read_html( '<html><body> <div class="outer"> <div class="inner"> This is the target </div> </div> </body></html>' ) html_snippet %>% cat_html ```
library(rvest) html_snippet %>% html_nodes(css='div') html_snippet %>% html_nodes(css='div.inner')
```r html_snippet %>% cat_html ```
html_snippet %>% html_nodes('div.outer > div.inner') html_snippet %>% html_nodes('div.outer > div.inner') %>% html_text() html_snippet %>% html_nodes('.inner') %>% html_text()
```r html_snippet <- read_html( '<html><body> <div class="ads">...</div> <div class="foo"> <p>This is the target</p> </div> <div class="foo"> <p>This is not the target</p> </div> </body></html>' ) html_snippet %>% cat_html ```
html_snippet %>% html_nodes('div.foo') html_snippet %>% html_nodes('div.ads + div.foo > p') %>% html_text()
```r html_snippet <- read_html('<html><body><div class="ads"><div class="dyn-ad"/><div class="dyn-ad"/></div><div class="headline"><h1>Bad yogurt</h1></div><div class="article"><p>It is everywhere these days.</p><p>Get some now.</p></div></body></html>') html_snippet %>% cat_break_html ```
html_snippet %>% html_structure
html_snippet <- read_html('<a href="http://eur.nl">Erasmus</a>') %>% html_nodes('body')
<a href="http://eur.nl">Erasmus</a>
produces this: Erasmus<a>
is a tag (node) indicating a link should be producedhtml_snippet %>% html_nodes('a')
href="http://eur.nl"
is the attribute determining the link targethtml_snippet %>% html_nodes('a') %>% html_attr('href')
Erasmus
is text embedded in this tag; it is highlighted and clickablehtml_snippet %>% html_nodes('a') %>% html_text()
library(rvest) library(tidyverse) p <- read_html('https://www.eur.nl/en/about-eur/faculties-and-schools')
schools <- p %>% '[your code goes here]'
in order to yield this result:
program_urls <- p %>% '[your code goes here]'
in order to yield this result:
See https://www.dropbox.com/s/garkmh5hxlsxlmu/job_browsing_handout.html?dl=1 or http://bit.ly/2HrzRHf
html_table()
https://www.imdb.com/title/tt0081777/fullcredits
p <- read_html('https://en.wikipedia.org/wiki/Most_common_words_in_English') p %>% html_nodes('table.wikitable') %>% map(~html_table(.)) %>% bind_rows
html_table()
to extract the first table at https://en.wikipedia.org/wiki/Most_common_words_in_Englishp <- read_html('https://en.wikipedia.org/wiki/Most_common_words_in_English') p %>% html_node('table.wikitable') %>% html_table() %>% as_tibble()
html_node()
to select a table
node with class wikitable
---the CSS selector is therefore table.wikitable
html_table()
to extract the tableAdd the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.