read_pdf | R Documentation |
The function read_pdf
reads the text of pdf-file on cell level.
In this way all attributes of the data are available and can be studied (e.g. for use in read_pdf_cut
).
The output of all read_pdf*
functions is in the format of a data.frame.
The function read_pdf_line
can use the output of read_pdf
(i.e. a data.frame) and collect all data per line as a character string.
By specifying the argument by="line"
in read_pdf
the read_pdf_line
function is called automatically.
The functions read_pdf_fields
and read_pdf_cut
read text of a table from a page of a pdf-file.
The function read_pdf_fields
tries to do this automatically by assuming that the header of a fields starts before
(the x-value is not greater with a given htolerance
) the corresponding data. This does not always work.
Input for the function is the actual pdf-file.
The function read_pdf_cut
uses a description of the fields with the lowest x-value of data of each field.
The description is contained in a data.frame that also specifies the name of the fields and if the field has missing values.
Input for the function is the output of read-pdf
that has to be studied to determine the 'lowest x-value'
read_pdf(filename, vtolerance = 6, frame_table = NULL, by = "cell")
read_pdf_line(read_pdf_df)
read_pdf_cut(read_pdf_df, pdf_df, no_data_lines = c(1, 2), id = NULL)
read_pdf_fields(
filename,
vtolerance = 2,
htolerance = 2,
header_line = 1,
pageno = 1
)
filename |
Character string with path of the pdf-file |
vtolerance |
Numeric scalar with vertical tolerance to fix vertical mismatches |
frame_table |
data.frame indicating data frames on pages. See Details |
by |
Character string with value "line" or "cell" indicating if text is gathered by text line or cell |
read_pdf_df |
data.frame created by |
pdf_df |
data.frame describing fields and their lower position.
If a field can have missing values then set |
no_data_lines |
integer vector with line numbers of lines to be deleted |
id |
Named character or |
htolerance |
Numeric scalar with horizontal tolerance to fix mismatches field contents (field starts before header) |
header_line |
Integer indicating which lines contain the headers of the table |
pageno |
Integer indicating the number of the page to read |
read_pdf
returns a data.frame with the fields:
"page", "seqnr", "framenr", "width", "height", "space", "x", "y" and "text"
when by
== 'line' the fields are:
"page", "framenr", "seqnr", "x", "y" and "text"
read_pdf_line
always returns a data.frame with the fields:
"page", "framenr", "seqnr", "x", "y" and "text"
read_pdf_fields
and read_pdf_cut
return a data.frame with the table .
All fields have character values
Actual reading of a pdf-file uses pdftools::pdf_data
as workhorse .
read_pdf
The frame_table
is a data.frame that indicates the location of the frames in the pages.
The function cut3d()
is used to assign a frame number to each cell. See this function for a description
read_pdf_line
The fields 'seqnr' and 'x' in the output of read_pdf_line are the attributes of the first cell that contributed to 'text'.
read_pdf_cut
read_pdf_cut
uses the output of read_pdf
(by="cell") and fills the fields of a table according to the specification of
data.frame pdf_df
. See the examples
read_pdf_fields
Using read_pdf_fields
,it is assumed that the table occupies a whole page and that the columns are defined by the words
in the header.
In the following example
field1 field2 field3
v1a v1b v2a v2b v2c v3a v3b
field1 will be filled with "v1a v1b", field2 with "v2a v2b" and field3 with "v3a v3b".
Multiple words in a field are separated by only one blank (even when the original data contains more than one blank)
## Not run:
df1 <- read_pdf (r"(D:\data\R\TTVA\inputs\TTV Amstelveen teamindeling Senioren VJ22.pdf)", by= "line")
names(df1) # [1] "page" "framenr" "seqnr" "x" "y" "text"
df1 <- read_pdf (r"(D:\data\R\TTVA\inputs\TTV Amstelveen teamindeling Senioren VJ22.pdf)", by= "cell")
names(df1) # [1] "page" "framenr" "seqnr" "width" "height" "space" "x" "y" "text"
## End(Not run)
## Not run:
df1 <- read_pdf (r"(D:\data\R\TTVA\inputs\TTV Amstelveen teamindeling Senioren VJ22.pdf)", by= "cell")
df2 <- read_pdf_line(df1)
names(df1) # [1] "page" "framenr" "seqnr" "x" "y" "text"
## End(Not run)
## Not run:
pdf_df <- tibble::tribble(
~field, ~low, ~optmissing,
"Team", 54, F,
"Klasse", 86, F,
"Team_Rating", 122, T,
"Captain", 178, T,
"Speler", 219, F,
"Rating", 334, T,
"Thuis", 369 , F
)
myfields <- HOQCutil::read_pdf (infileS, vtolerance=2,by="cell")
xx1 <- read_pdf_cut(myfields,pdf_df,no_data_lines = c(1,2),id=c(id="sen"))
## End(Not run)
## Not run:
df1 <- read_pdf_fields (r"(D:\data\R\TTVA\inputs\TTV Amstelveen teamindeling Senioren VJ22.pdf)" )
names(df1) # [1] "Teamnr." "Klasse" "Teamrating" "Captain" "Speler" "Rating" "Speeldag"
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.