Extract Text from Text- and Image-based PDFs

asciimostlyAn example tesseract config file
cat_pagesConcatenate OCR'd pages to a single file
check_embedCheck if text embed is not from OCR
check_pdfCheck if file is a PDF
convert_to_imgsConvert a file (PDF) to per-page images (PNG)
get_sorted_filesReturn a list of correctly sorted imgs for Tesseract OCR
load_textLoad text extracted from a pdf to a list
make_main_dirsCreate main directories expected by 'pdftext'
ocr_pagesPerform optical character recognition on PNGs.
ocr_pdfPerform optical character recognition on a PDF
pdftextpdftext: A package to extract text from PDFs
pdf_to_txtExtract text from a pdf and write to a txt file
run_unpaperRun 'unpaper' to fix rotation angles
save_imgsSave the images directory from options()$pdftext.wkdir
save_pagesSave the pages directory from tempdir
save_txtsSave the text directory from tempdir
set_tess_confSet a custom 'tesseract' config for OCR
set_wkdirSet the option for the working directory
test_embedTest if a PDF has embedded text.
