pdf2df: Convert pdf tables to data.frames
In cstubben/pmcXML: Tools to parse full text XML documents from PMC Open Access

Description Usage Arguments Details Value Note Author(s) Examples

Converts pdf tables loaded using readLines to a data.frame

1	pdf2df(x, split, captionRow = 1, headerRow = 2, labels, subset)

`x`	a vector of pdf text containing a structured table
`split`	space delimited string defining columns where w = single word (no spaces), s=single letter, d = decimal 0-9 and characters in scientific notation [Ee.-], a = any character including spaces.
`captionRow`	row(s) containing caption
`headerRow`	row(s) containing header
`labels`	an optional vector to specify which words in headerRow to assign to column names, see note for details)
`subset`	optional vector for indexing x to avoid dropping attributes

see pmcSupp to read supplementary tables in pdf formats. This function converts vector into a data.frame

A data.frame

If the headerRow contains more words than columns, the labels option is used to specify which words to assign to column names. For example, if a two column table has a header row containing "Primer name Sequence", then there are three options for assigning column names 1) list words from header to keep as colNames = 1,3 returns "Primer" "Sequence" 2) list number of words in each column name = 2,1 returns "Primer name" "Sequence" or 3) assign new column names = "id" "seq"

Chris Stubben

## Not run: 
id <- "PMC2231364"
doc <-pmcOAI(id)
s2 <- pmcSupp(doc, 3)
s2 <- gsub("For ", "", s2) # hack to keep subheader in 1st column only
s2 <- pdf2df(s2, "w w", labels=c(1,3) )
head(s2)
attributes(s2)
repeatSub(s2, "For") 


## End(Not run)