ocr_dictionary: Check the words in text against a dictionary

Description Usage Arguments Value Examples

View source: R/ocr_dictionary.R

Description

This function checks the quality of an OCR text against a dictionary. It will return a number between 0 and 1, which is the ratio of words found in the dictionary to the total number of words in the document. The higher the number, the better the quality of the OCR. These measures should not be taken in an absolute sense. That is, a score of 1 does not indicate perfect OCR. They should only be used to determine the relative quality of OCR within a corpus of texts. You can pass a character vector of any length. So, if you split a text into chunks, you can evaluate the OCR quality of each chunk.

Usage

1
ocr_dictionary(text, sample_size = -1L)

Arguments

text

A character vector.

sample_size

If this value is positive, then this many words from the text will be selected for comparison. This is useful for large texts.

Value

A vector of numeric values between 0 and 1.

Examples

1
2
3
4
5
paragraph <- "Fourr score and sleven years ago our fathers brought
  forth on this continent, a new nation, conceived in Liberty,
  and dedicated to tlhe proposition that all men are created equal."

ocr_dictionary(paragraph)

lmullen/ocrquality documentation built on May 21, 2019, 7:35 a.m.