asciimostly | An example tesseract config file |
cat_pages | Concatenate OCR'd pages to a single file |
check_embed | Check if text embed is not from OCR |
check_pdf | Check if file is a PDF |
convert_to_imgs | Convert a file (PDF) to per-page images (PNG) |
get_sorted_files | Return a list of correctly sorted imgs for Tesseract OCR |
load_text | Load text extracted from a pdf to a list |
make_main_dirs | Create main directories expected by 'pdftext' |
ocr_pages | Perform optical character recognition on PNGs. |
ocr_pdf | Perform optical character recognition on a PDF |
pdftext | pdftext: A package to extract text from PDFs |
pdf_to_txt | Extract text from a pdf and write to a txt file |
run_unpaper | Run 'unpaper' to fix rotation angles |
save_imgs | Save the images directory from options()$pdftext.wkdir |
save_pages | Save the pages directory from tempdir |
save_txts | Save the text directory from tempdir |
set_tess_conf | Set a custom 'tesseract' config for OCR |
set_wkdir | Set the option for the working directory |
test_embed | Test if a PDF has embedded text. |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.