node_which: Find the positions of nodes in a 'xml_nodeset' that match a...
In m-g-h/scrapurrr: Functional Programming Style Webscraping à la purrr

node_which

R Documentation

Find the positions of nodes in a `xml_nodeset` that match a regex pattern

Description

Find the positions of nodes in a xml_nodeset that match a regex pattern

Usage

node_which(nodelist, regex, inc = 0)

Arguments

`nodelist`	`xml_nodeset`, as e.g. returned from `html_elements`.
`regex`	`string scalar` giving the regular expression to search for. See the stringr cheatsheet on https://www.rstudio.com/resources/cheatsheets/
`inc`	`numeric scalar`. Increment added to the returned index. See examples for a use case.

Value

Returns a numeric scalar or vector.

Examples

library(rvest)
library(scrapurrr)

# Lets suppose we want to know the owner of "Alfreds Futterkiste":
html = "<table>
  <tr>
    <th>Company</th>
    <th>Contact</th>
    <th>Country</th>
  </tr>
  <tr>
    <td>Alfreds Futterkiste</td>
    <td>Maria Anders</td>
    <td>Germany</td>
  </tr>
  <tr>
    <td>Centro comercial Moctezuma</td>
    <td>Francisco Chang</td>
    <td>Mexico</td>
  </tr>
</table>" %>%
  read_html()

# Searching for `td` elements returns a list:
html_elements(x = html, "td")
# Of course we could match by position, but it may not be fixed if we have
# many tables. Let's use `node_which()`. since the "owner" is always two rows
# behind the "company" we increment by 2:
html_elements(x = html, "td") %>%
  node_which("Alfreds Futterkiste", inc = 2)

m-g-h/scrapurrr documentation built on Aug. 2, 2022, 9:43 a.m.