extract_table: extract a table from Tesseract HOCR scan

extract_tableR Documentation

extract a table from Tesseract HOCR scan

Description

If we use Tesseract with HOCR output (e.g. with scan_with_hocr()) to scan a table, we will have all data in a data.frame . This function convert this data.frame to a proper table when we indicate which data.frame elements to use as headers. See Details.

Usage

extract_table(df, headers = NULL, lastline = Inf, desc_above = T)

Arguments

df

A data.frame like the result of scan_with_hocr(). See Details.

headers

A character vector or NULL indicating how the headers will be determined. See Details.

lastline

An integer indicating the last line (number) in df that will be used.

desc_above

A Boolean indicating if the description is above or on the same line as its value (TRUE) or below or on the same line (FALSE)

Value

A data.frame with table contents.

Details

df should contain the columns line, word, x1, x2 and fld_nr (in case headers is specified) or headers_col and headers_col_spec (or aliases) when this is not. In this way it can be derived which data values belong to which headers. The specification of headers can be done in two ways:

  • specify a list as in the example where we indicate a.o. that the first header is composed from the first words on line 2 and 3 and the fourth header from the fourth and fifth word on line 2 and the fourth word on line 3. We use this list as the header argument in the first example.

  • we use the column headers_col to indicate to which header a df element belongs. With the headers_col_seq column we can indicate which element will be used first. Normally no need to specify this. In this case we can specify headers=NULL. When the name of these columns is not headers_col and headers_col_seq but e.g. h1 and h2 we can specify them by using headers=c('h1','h2')

From the header information we can assign the data to the headers. Everything before the first data that is assigned, will be considered as description. We assume that data values are always on one line, but descriptions can take more than one line. In that case the argument desc_above is used to determine if a description line is coupled with a data value below or above it.

See Also

scanner_functions , cleanup_bw() and scan_with_hocr()

Examples

## Not run: 
# example1: header definition in list
hdr_desc = list(
  list(c(2,1),c(3,1)),
  list(c(2,2),c(3,2)),
  list(c(2,3),c(3,3)),
  list(c(2,4),c(2,5),c(3,4))
)
df2= extract_table(df1,
   headers=dhr_desc, lastline = Inf, desc_above=T)

# example2: header definition in column header_col
df1= edit(df1) # change header_col
df2= extract_table(df1,
   headers=NULL,lastline = Inf, desc_above=T)


## End(Not run)

HanOostdijk/HOQCutil documentation built on July 28, 2023, 5:56 p.m.