Man pages for jacob-ogre/pdftext
Extract Text from Text- and Image-based PDFs

asciimostly	An example tesseract config file
cat_pages	Concatenate OCR'd pages to a single file
check_embed	Check if text embed is not from OCR
check_pdf	Check if file is a PDF
convert_to_imgs	Convert a file (PDF) to per-page images (PNG)
get_sorted_files	Return a list of correctly sorted imgs for Tesseract OCR
load_text	Load text extracted from a pdf to a list
make_main_dirs	Create main directories expected by 'pdftext'
ocr_pages	Perform optical character recognition on PNGs.
ocr_pdf	Perform optical character recognition on a PDF
pdftext	pdftext: A package to extract text from PDFs
pdf_to_txt	Extract text from a pdf and write to a txt file
run_unpaper	Run 'unpaper' to fix rotation angles
save_imgs	Save the images directory from options()$pdftext.wkdir
save_pages	Save the pages directory from tempdir
save_txts	Save the text directory from tempdir
set_tess_conf	Set a custom 'tesseract' config for OCR
set_wkdir	Set the option for the working directory
test_embed	Test if a PDF has embedded text.