get_sorted_files: Return a list of correctly sorted imgs for Tesseract OCR

Description Usage Arguments Value Examples

Description

We need the list of PNGs to be sorted correctly, but list.files returns a naively sorted list. This means that a document with 10 pages would return as 'file-0.png file-1.png file-10.png...'; when concatenating the TXTs, the pages would be concatenated out-of-order. This function returns the files in the correct order for OCR, and that order is then used for concatenating the OCR'd pages.

Usage

1
get_sorted_files(path, ext)

Arguments

path

Path to the directory containing TXT files from Tesseract

ext

The extension of the files (e.g., png) to match

Value

The (correctly) sorted vector of TXT files

Examples

1
2
3
4
5
6
7
8
## Not run: 
get_sorted_files("OCR_tmp/doc1/", "png")

# Returns a vector of files such as c(doc1-0.png, doc1-1.png, doc1-2.png,
# ..., doc1-10.png) rather than c(doc1-0.png, doc1-1.png, doc1-10.png,
# doc1-2.png, ...)

## End(Not run)

jacob-ogre/pdftext documentation built on May 18, 2019, 8:01 a.m.