GetSmudges: An Approach to Identifying Irrelevant Noisy Elements on the...

GetSmudgesR Documentation

An Approach to Identifying Irrelevant Noisy Elements on the Page

Description

This function attempts to finds rows in the bounding box matrix/data.frame that might be smudges/specs from the scanning process. The approach this takes is to consider if they are sufficiently small in both height and width to be less than a character. This is ad hoc to say the least. One can implement additional or alternative approaches and this is just offered as a utility.

Usage

GetSmudges(bbox, threshold = 5, charWidth = GetCharWidth(bbox),
             charHeight = GetCharHeight(bbox), anywhere = FALSE)

Arguments

bbox

the bounding box matrix/data from for the elements under consideration.

threshold

currently ignored

charWidth

a number for the typical character width on the page

charHeight

a number giving the typical character height on the page

anywhere

if FALSE, we compute the distance between potential specs and see if they are sufficiently far from another element (currently 3 characters away in any direction)

Value

An integer vector giving the indices of any rows in the bounding box matrix/data.frame that are considered specs/smudges by this approach.

Author(s)

Duncan Temple Lang

See Also

GetBoxes


duncantl/Rtesseract documentation built on March 25, 2022, 5:50 a.m.