extract_table | R Documentation |
If we use Tesseract with HOCR output (e.g. with scan_with_hocr()
) to scan a table, we will have all data in a data.frame . This function convert this data.frame to a proper table when we indicate which data.frame elements to use as headers. See Details.
extract_table(df, headers = NULL, lastline = Inf, desc_above = T)
df |
A data.frame like the result of |
headers |
A character vector or NULL indicating how the headers will be determined. See Details. |
lastline |
An integer indicating the last line (number) in |
desc_above |
A Boolean indicating if the description is above or on the same line as its value (TRUE) or below or on the same line (FALSE) |
A data.frame with table contents.
df
should contain the columns line
, word
, x1
, x2
and fld_nr
(in case headers
is specified) or headers_col
and headers_col_spec
(or aliases) when this is not. In this way it can be derived which data values belong to which headers.
The specification of headers can be done in two ways:
specify a list as in the example where we indicate a.o. that the first header is composed from the first words on line 2 and 3 and the fourth header from the fourth and fifth word on line 2 and the fourth word on line 3. We use this list as the header argument in the first example.
we use the column headers_col
to indicate to which header a df
element belongs. With the headers_col_seq
column we can indicate which element will be used first. Normally no need to specify this. In this case we can specify headers=NULL
. When the name of these columns is not headers_col
and headers_col_seq
but e.g. h1
and h2
we can specify them by using headers=c('h1','h2')
From the header information we can assign the data to the headers. Everything before the first data that is assigned, will be considered as description. We assume that data values are always on one line, but descriptions can take more than one line. In that case the argument desc_above
is used to determine if a description line is coupled with a data value below or above it.
scanner_functions , cleanup_bw()
and scan_with_hocr()
## Not run:
# example1: header definition in list
hdr_desc = list(
list(c(2,1),c(3,1)),
list(c(2,2),c(3,2)),
list(c(2,3),c(3,3)),
list(c(2,4),c(2,5),c(3,4))
)
df2= extract_table(df1,
headers=dhr_desc, lastline = Inf, desc_above=T)
# example2: header definition in column header_col
df1= edit(df1) # change header_col
df2= extract_table(df1,
headers=NULL,lastline = Inf, desc_above=T)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.