ft_extract: Extract text from a single pdf document

Description Usage Arguments Details Value Examples

View source: R/ft_extract.R

Description

ft_extract attemps to make it easy to extract text from PDFs, using a variety of extraction tools. Inputs can be either paths to PDF files, or the output of ft_get (class ft_data).

Usage

1
2
3
4
5
6
7
ft_extract(x, which = "xpdf", ...)

## S3 method for class 'gs_char'
print(x, ...)

## S3 method for class 'xpdf_char'
print(x, ...)

Arguments

x

Path to a pdf file, or an object of class ft_data, the output from ft_get

which

One of gs or xpdf (default).

...

further args passed on

Details

For xpdf, you can pass on addition options via flags. See Examples. Right now, you can't pass options to Ghostscript if you're using the gs option.

xpdf installation: See http://www.foolabs.com/xpdf/download.html for instructions on how to download and install xpdf. For OSX, you an also get xpdf via homebrew.

ghostscript installation: See http://www.ghostscript.com/doc/9.16/Install.htm for instructions on how to download and install ghostscript

Value

An object of class gs_char, xpdf_char

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
## Not run: 
path <- system.file("examples", "example1.pdf", package = "fulltext")

(res_xpdf <- ft_extract(path)) # xpdf is the default
(res_xpdf <- ft_extract(path, "xpdf"))
(res_gs <- ft_extract(path, "gs"))

# pass on options to xpdf
## preserve layout from pdf
ft_extract(path, "xpdf", "-layout")
## preserve table structure as much as possible
ft_extract(path, "xpdf", "-table")
## last page to convert is page 2
ft_extract(path, "xpdf", "-l 2")
## first page to convert is page 3
ft_extract(path, "xpdf", "-f 3")

# use on output of ft_get() to extract pdf to text
## arxiv
res <- ft_get('cond-mat/9309029', from = "arxiv")
res2 <- ft_extract(res)
res$arxiv$data
res2$arxiv$data
res2$arxiv$data$data[[1]]$data

## biorxiv
res <- ft_get('10.1101/012476')
res2 <- ft_extract(res)
res$biorxiv$data
res2$biorxiv$data
res2$biorxiv$data$data[[1]]$data

## End(Not run)

fulltext documentation built on May 29, 2017, 12:09 p.m.