readpdffuns: read text of pdf-file by cell, line or table

read_pdfR Documentation

read text of pdf-file by cell, line or table

Description

The function read_pdf reads the text of pdf-file on cell level.
In this way all attributes of the data are available and can be studied (e.g. for use in read_pdf_cut).
The output of all ⁠read_pdf*⁠ functions is in the format of a data.frame.

The function read_pdf_line can use the output of read_pdf (i.e. a data.frame) and collect all data per line as a character string.
By specifying the argument by="line" in read_pdf the read_pdf_line function is called automatically.

The functions read_pdf_fields and read_pdf_cut read text of a table from a page of a pdf-file.

The function read_pdf_fields tries to do this automatically by assuming that the header of a fields starts before (the x-value is not greater with a given htolerance) the corresponding data. This does not always work.
Input for the function is the actual pdf-file.

The function read_pdf_cut uses a description of the fields with the lowest x-value of data of each field. The description is contained in a data.frame that also specifies the name of the fields and if the field has missing values.
Input for the function is the output of read-pdf that has to be studied to determine the 'lowest x-value'

Usage

read_pdf(filename, vtolerance = 6, frame_table = NULL, by = "cell")

read_pdf_line(read_pdf_df)

read_pdf_cut(read_pdf_df, pdf_df, no_data_lines = c(1, 2), id = NULL)

read_pdf_fields(
  filename,
  vtolerance = 2,
  htolerance = 2,
  header_line = 1,
  pageno = 1
)

Arguments

filename

Character string with path of the pdf-file

vtolerance

Numeric scalar with vertical tolerance to fix vertical mismatches

frame_table

data.frame indicating data frames on pages. See Details

by

Character string with value "line" or "cell" indicating if text is gathered by text line or cell

read_pdf_df

data.frame created by read_pdf (by="cell") function

pdf_df

data.frame describing fields and their lower position. If a field can have missing values then set optmissing to TRUE

no_data_lines

integer vector with line numbers of lines to be deleted

id

Named character or NULL . When not NULL id will be inserted as field in the resulting data.frame. The name of the field is the name attribure of id.

htolerance

Numeric scalar with horizontal tolerance to fix mismatches field contents (field starts before header)

header_line

Integer indicating which lines contain the headers of the table

pageno

Integer indicating the number of the page to read

Value

read_pdf
returns a data.frame with the fields:
"page", "seqnr", "framenr", "width", "height", "space", "x", "y" and "text"
when by == 'line' the fields are:
"page", "framenr", "seqnr", "x", "y" and "text"

read_pdf_line always returns a data.frame with the fields:
"page", "framenr", "seqnr", "x", "y" and "text"

read_pdf_fields and read_pdf_cut return a data.frame with the table .
All fields have character values

Details

Actual reading of a pdf-file uses pdftools::pdf_data as workhorse .

read_pdf
The frame_table is a data.frame that indicates the location of the frames in the pages. The function cut3d() is used to assign a frame number to each cell. See this function for a description

read_pdf_line
The fields 'seqnr' and 'x' in the output of read_pdf_line are the attributes of the first cell that contributed to 'text'.

read_pdf_cut
read_pdf_cut uses the output of read_pdf (by="cell") and fills the fields of a table according to the specification of data.frame pdf_df. See the examples

read_pdf_fields
Using read_pdf_fields,it is assumed that the table occupies a whole page and that the columns are defined by the words in the header.
In the following example

field1         field2           field3
v1a v1b        v2a  v2b     v2c  v3a   v3b

field1 will be filled with "v1a v1b", field2 with "v2a v2b" and field3 with "v3a v3b".
Multiple words in a field are separated by only one blank (even when the original data contains more than one blank)

Examples

## Not run: 
df1 <- read_pdf (r"(D:\data\R\TTVA\inputs\TTV Amstelveen teamindeling Senioren VJ22.pdf)", by= "line")
names(df1) # [1] "page"    "framenr" "seqnr"   "x"       "y"       "text"
df1 <- read_pdf (r"(D:\data\R\TTVA\inputs\TTV Amstelveen teamindeling Senioren VJ22.pdf)", by= "cell")
names(df1) # [1] "page"    "framenr" "seqnr"   "width"   "height"  "space"   "x"       "y"       "text"

## End(Not run)

## Not run: 
df1 <- read_pdf (r"(D:\data\R\TTVA\inputs\TTV Amstelveen teamindeling Senioren VJ22.pdf)", by= "cell")
df2 <- read_pdf_line(df1)
names(df1) # [1] "page"    "framenr" "seqnr"   "x"       "y"       "text"

## End(Not run)
## Not run: 
pdf_df <- tibble::tribble(
 ~field, ~low, ~optmissing,
 "Team", 54, F,
 "Klasse", 86, F,
 "Team_Rating", 122, T,
 "Captain", 178, T,
 "Speler", 219, F,
 "Rating", 334, T,
 "Thuis", 369 , F
)
myfields <- HOQCutil::read_pdf (infileS, vtolerance=2,by="cell")
xx1      <- read_pdf_cut(myfields,pdf_df,no_data_lines = c(1,2),id=c(id="sen"))

## End(Not run)
## Not run: 
df1 <- read_pdf_fields (r"(D:\data\R\TTVA\inputs\TTV Amstelveen teamindeling Senioren VJ22.pdf)" )
names(df1) # [1] "Teamnr."    "Klasse"     "Teamrating" "Captain"    "Speler"     "Rating"     "Speeldag"

## End(Not run)


HanOostdijk/HOQCutil documentation built on July 28, 2023, 5:56 p.m.