pdf2df: Convert pdf tables to data.frames

Description Usage Arguments Details Value Note Author(s) Examples

View source: R/pdf2df.R

Description

Converts pdf tables loaded using readLines to a data.frame

Usage

1
pdf2df(x, split, captionRow = 1, headerRow = 2, labels, subset)

Arguments

x

a vector of pdf text containing a structured table

split

space delimited string defining columns where w = single word (no spaces), s=single letter, d = decimal 0-9 and characters in scientific notation [Ee.-], a = any character including spaces.

captionRow

row(s) containing caption

headerRow

row(s) containing header

labels

an optional vector to specify which words in headerRow to assign to column names, see note for details)

subset

optional vector for indexing x to avoid dropping attributes

Details

see pmcSupp to read supplementary tables in pdf formats. This function converts vector into a data.frame

Value

A data.frame

Note

If the headerRow contains more words than columns, the labels option is used to specify which words to assign to column names. For example, if a two column table has a header row containing "Primer name Sequence", then there are three options for assigning column names 1) list words from header to keep as colNames = 1,3 returns "Primer" "Sequence" 2) list number of words in each column name = 2,1 returns "Primer name" "Sequence" or 3) assign new column names = "id" "seq"

Author(s)

Chris Stubben

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
## Not run: 
id <- "PMC2231364"
doc <-pmcOAI(id)
s2 <- pmcSupp(doc, 3)
s2 <- gsub("For ", "", s2) # hack to keep subheader in 1st column only
s2 <- pdf2df(s2, "w w", labels=c(1,3) )
head(s2)
attributes(s2)
repeatSub(s2, "For") 


## End(Not run)

cstubben/pmcXML documentation built on May 14, 2019, 12:25 p.m.