htmltab: Assemble a data frame from HTML table data
In htmltab: Assemble Data Frames from HTML Tables

Description Usage Arguments Details Value Examples

Robust and flexible methods for extracting structured information out of HTML tables

htmltab(
  doc,
  which = NULL,
  header = NULL,
  headerFun = function(node) XML::xmlValue(node),
  headerSep = " >> ",
  body = NULL,
  bodyFun = function(node) XML::xmlValue(node),
  complementary = TRUE,
  fillNA = NA,
  rm_superscript = TRUE,
  rm_escape = " ",
  rm_footnotes = TRUE,
  rm_nodata_cols = TRUE,
  rm_nodata_rows = TRUE,
  rm_invisible = TRUE,
  rm_whitespace = TRUE,
  colNames = NULL,
  ...
)

`doc`	the HTML document which can be a file name or a URL or an already parsed document (by XML's parsing functions)
`which`	a vector of length one for identification of the table in the document. Either a numeric vector for the tables' rank or a character vector that describes an XPath for the table
`header`	the header formula, see details for specifics
`headerFun`	a function that is executed over the header cell nodes
`headerSep`	a character vector that is used as a separator in the construction of the table's variable names (default ' >> ')
`body`	a vector that specifies which table rows should be used as body information. A numeric vector can be specified where each element corresponds to a table row. A character vector may be specified that describes an XPath for the body rows. If left unspecified, htmltab tries to use semantic information from the HTML code
`bodyFun`	a function that is executed over the body cell nodes
`complementary`	logical, should htmltab ensure complementarity of header, inbody header and body elements (default TRUE)?
`fillNA`	character vector of symbols that are replaced by NA (default c(”))
`rm_superscript`	logical, should superscript information be removed from header and body cells (default TRUE)?
`rm_escape`	a character vector that, if specified, is used to replace escape sequences in header and body cells (default ' ')
`rm_footnotes`	logical, should semantic footer information be removed (default TRUE)?
`rm_nodata_cols`	logical, should columns that have no alphanumeric data be removed (default TRUE)?
`rm_nodata_rows`	logical, should rows that have no alphanumeric data be removed (default TRUE)?
`rm_invisible`	logical, should nodes that are not visible be removed (default TRUE)? This includes elements with class 'sortkey' and 'display:none' style.
`rm_whitespace`	logical, should leading/trailing whitespace be removed from cell values (default TRUE)?
`colNames`	a character vector of column names, or a function that can be used to replace specific column names (default NULL)
`...`	additional arguments passed to HTML parsers

The header formula has the following format: level1 + level2 + level3 + ... . level1 specifies the main header dimension (column names). This information must be for rows. level2 and deeper signify header dimensions that appear throughout the body. That information must be for cell elements, not rows. Header information may be one of the following types:

the NULL value (default). No information passed, htmltab will try to identify header elements through heuristics (heuristics only work for the main header)
A numeric vector that retrieves rows in the respective position
A character string of an XPath expression
A function that when evaluated produces a numeric or character vector
0, when the process of finding the main header should be skipped (only works for main header)

An R data frame

## Not run: 
# When no spans are present, htmltab produces output close to XML's readHTMLTable(),
but it removes many types of non-data elements (footnotes, non-visible HTML elements, etc)

 url <- "http://en.wikipedia.org/wiki/World_population"
 xp <- "//caption[starts-with(text(),'World historical')]/ancestor::table"
 htmltab(doc = url, which = xp)

 popFun <- function(node) {
   x <- XML::xmlValue(node)
   gsub(',', '', x)
 }

 htmltab(doc = url, which = xp, bodyFun = popFun)

#This table lacks header information. We provide them through colNames.
#We also need to set header = 0 to indicate that no header is present.
doc <- "http://en.wikipedia.org/wiki/FC_Bayern_Munich"
xp2 <- "//td[text() = 'Head coach']/ancestor::table"
htmltab(doc = doc, which = xp2, header = 0, encoding = "UTF-8", colNames = c("name", "role"))

#htmltab recognizes column spans and produces a one-dimension vector of variable information,
#also removes automatically superscript information since these are usually not of use.

 doc <- "http://en.wikipedia.org/wiki/Usage_share_of_web_browsers"
 xp3 <-  "//table[7]"
 bFun <- function(node) {
   x <- XML::xmlValue(node)
   gsub('%$', '', x)
 }

 htmltab(doc = doc, which = xp3, bodyFun = bFun)


htmltab("https://en.wikipedia.org/wiki/Arjen_Robben", which = 3,
header = 1:2)


#When header information appear throughout the body, you can specify their
#position in the header formula

htmltab(url, which = "//table[@id='team_gamelogs']", header = . + "//td[./strong]")

## End(Not run)