htmltab: Assemble Data Frames from HTML Tables

Share:

HTML tables are a valuable data source but extracting and recasting these data into a useful format can be tedious. This package allows to collect structured information from HTML tables. It is similar to readHTMLTable() of the XML package but provides three major advantages. First, the function automatically expands row and column spans in the header and body cells. Second, users are given more control over the identification of header and body rows which will end up in the R table, including semantic header information that appear throughout the body. Third, the function preprocesses table code, corrects common types of malformations, removes unneeded parts and so helps to alleviate the need for tedious post-processing.

Author
Christian Rubba [aut, cre]
Date of publication
2016-05-28 14:16:29
Maintainer
Christian Rubba <christian.rubba@gmail.com>
License
MIT + file LICENSE
Version
0.7.0
URLs

View on CRAN

Man pages

check_type
Produce the table node
create_inbody
Reshape in table header information into wide format
eval_body
Evaluate and deparse the body argument
eval_header
Evaluate and deparse the header argument
get_body_xpath
Return body xpath
get_cell_element
Extracts cells elements
get_header_elements
Extracts header elements
get_head_xpath
Return header xpath
get_span
Extracts rowspan information
get_trindex
Return trindex given an XPath
htmltab
Assemble a data frame from HTML table data
identify_elements
Assemble XPath expressions for header and body
normalize_tr
Normalizes rows to be nested in tr tags, header in thead,...
num_xpath
num_xpath: Generate numeric XPath expression
rm_empty_cols
Remove columns which do not have data values
rm_empty_rows
Remove rows which do not have data values
rm_nuisance
Remove nuisance elements from the the table code
select_tab
Selects the table from the HTML Code

Files in this package

htmltab
htmltab/inst
htmltab/inst/doc
htmltab/inst/doc/htmltab.html
htmltab/inst/doc/htmltab.Rmd
htmltab/inst/doc/htmltab.R
htmltab/tests
htmltab/tests/testthat.R
htmltab/tests/testthat
htmltab/tests/testthat/test_multi-dim-header.R
htmltab/tests/testthat/test_find_header.R
htmltab/tests/testthat/test_inputs.R
htmltab/tests/testthat/test_expand_spans.R
htmltab/NAMESPACE
htmltab/NEWS
htmltab/R
htmltab/R/utils.R
htmltab/R/header.R
htmltab/R/identify_rows.R
htmltab/R/setup_and_checks.R
htmltab/R/body.R
htmltab/R/colnames.R
htmltab/R/inbody_header.R
htmltab/R/htmltab.R
htmltab/R/zzz.R
htmltab/vignettes
htmltab/vignettes/htmltab.Rmd
htmltab/MD5
htmltab/build
htmltab/build/vignette.rds
htmltab/DESCRIPTION
htmltab/man
htmltab/man/eval_body.Rd
htmltab/man/rm_empty_rows.Rd
htmltab/man/check_type.Rd
htmltab/man/normalize_tr.Rd
htmltab/man/num_xpath.Rd
htmltab/man/eval_header.Rd
htmltab/man/get_header_elements.Rd
htmltab/man/get_trindex.Rd
htmltab/man/select_tab.Rd
htmltab/man/create_inbody.Rd
htmltab/man/rm_empty_cols.Rd
htmltab/man/get_cell_element.Rd
htmltab/man/rm_nuisance.Rd
htmltab/man/identify_elements.Rd
htmltab/man/get_body_xpath.Rd
htmltab/man/get_span.Rd
htmltab/man/htmltab.Rd
htmltab/man/get_head_xpath.Rd
htmltab/LICENSE