README.md
In cuche27/CSS: Extract information from an html document with CSS selectors

CSS

Extracting data from an html file may be done in R with package XML, but it is quite tedious. This is why I have developed this package : it provides wrapper functions that take as input CSS path and translate them in xpath queries. Moreover some functions facilitate extraction of specific information like numbers or urls.

In R console, execute the following commands

install.packages("devtools")
library(devtools)
install_github("CSS", "cuche27")

library(CSS)

# Let's create a fake html page
doc <- "<html>
<head></head>
<body>
  <div id='character1' class='character'>
    <span class='name'>Mike</span>
    <span class='level digit'>10</span>
    <a href='http://someurl.com'>Complete profile</a>
  </div>
  <div id='character2' class='character'>
    <span class='name'>Stan</span>
    <a href='http://someurl2.com'>Complete profile</a>
  </div>
</body>
</html>"

# parse the html
doc <- htmlParse(doc)

# Extract the names of the characters
cssApply(doc, ".character>.name", cssCharacter)

# Extract the name of character1
cssApply(doc, "#character1>.name", cssCharacter)

# Urls of the profiles
cssApply(doc, ".character>a", cssLink)

# Level of characters
cssApply(doc, ".character>.level", cssNumeric)

# character 2 does not have level, we would want to have a NA value instead of nothing
cssApplyInNodeSet(doc, ".character", ".level", cssNumeric)

cuche27/CSS documentation built on Oct. 20, 2019, 1:09 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com