Find Optical Character Recognition Errors and Corrections

batch_get_goldExtract text from a set of PDFs with embedded text.
batch_get_ngramsGet a n-grams from one or more texts
batch_simulate_degrade_setSimulate degraded PDFs from a set of input PDFs
char_ngramsReturn a df with counts of all characters in df
check_embedCheck if text embed is not from OCR
create_dirsCreate directories for 'ocrerrs'
degrade_blurDegrade PDF quality by simulating blurred text
degrade_complexDegrade PDF quality by combining degradation parameters
degrade_densityDegrade PDF quality by manipulating pixel density
degrade_faxDegrade PDF quality by simulating a fax
degrade_pagesWrap degrade functions of split PDF files
degrade_rotateDegrade PDF quality by simulating page rotation
find_errorsFind errors from OCR by comparing to gold standard
find_min_distsFind the minimum string edit for each bad word
get_bg_1gramsGet ngrams and counts for bad and gold strings
get_delta_wordsGet words with difference frequencies between bad and gold...
get_dist_matReturn a matrix of optimal string alignment distances for...
get_embed_pagesReturn a vector of pages with embedded text
get_file_baseReturn the base name of a file
get_goldExtract text from a PDF with embedded text.
get_ngramsGet a set of n-grams from text
get_POSReturn a table of parts of speech
helloHello, World!
hunspell_errorsUse hunspell to find errors
label_delta_wordsLabel words as correct or errors
make_gold_pathCreate a path to which 'gold standard' results are written
normalize_textClean EOL characters from
ocr_pagesWrap optical character recognition around a set of files
save_gold_textSave the extracted text as a .rda
simulate_degrade_setSimulate degraded PDFs from an input PDF
split_pdfSplit a PDF into multiple pages
summarize_goldSummarize the text from a gold-standard PDF
tess_ocrPerform optical character recognition with tesseract
write_gold_textWrite the extracted text to file
