NEWS.md
In trinker/textreadr: Read Text Documents into R

NEWS

Releases will be numbered with the following semantic versioning format:

And constructed with the following guidelines:

Breaking backward compatibility bumps the major (and resets the minor and patch)
New additions without breaking backward compatibility bumps the minor (and resets the patch)
Bug fixes and misc changes bumps the patch

BUG FIXES

read_docx would return the same word as 2 separate words if different characters within the word had different styling (pseudocode example: '<w:p><bold>h</bold>ello word<w:p>' returned 'h ello world').

NEW FEATURES

BUG FIXES

read_pdf threw an error when ocr = TRUE but the tesseract package was unavailable. This has been fixed.
Read_xxx functions failed when a URL was provided for the path. This behavior has been corrected. Thanks to Brent Brewington for the spot in issue #18.

NEW FEATURES

un_zip & un_tar added as convenience functions (wrapper for ?utils::unzip & ?utils::untar) to make the functions more pipe-able.
read_pptx added to read in .pptx files.

MINOR FEATURES

read_xml basic functionality added and part of read_document.
Looping utilities loop_counter, base_name, and try_limit added for use inside of loops. Makes loop reporting and error handling easier and more readable.

IMPROVEMENTS

read_docx would return non-text, formatting information. Issue #19 provides a demonstration of this issue. This behavior has been corrected to grab text (w:t) tags with paragraphs (w:p).

NEW FEATURES

peek picks up a strings.left argument to align strings to the left. This is the default because this is a text reading package that deals primarily with strings.
read_pdf picks up an ocr argument in order to properly handle image based ,pdf files in order to extract the text. For this task optical character recognition (OCR) is required. The tesseract package provides the back-end for processing these types of .pdfs.
browse added to open files and directories.

BUG FIXES

NEW FEATURES

read_document picks up an explicit skip, remove.empty, and trim argument like the other read_ functions.
read_rtf added to the document forms that can be parsed. This relies on the striprtf package as a back-end. read_document and read_transcript pick up the ability to read rich text format as well.

MINOR FEATURES

as_transcript added for coercion of internal strings to transcript. This function adds the ability to call out the person variable via a regex. For example one may split after all caps as the leading string.
read_dir and read_dir_transcript pick up an ignore.case function for pattern. Pattern becomes more powerful in that it was moved outside of the dir command via a grep call.

BUG FIXES

The README.md called for ex_ functions from qdapRegex. This was the dev version of qdapRegex. This is now the CRAN version and now works for users.

NEW FEATURES

read_html added for reading in the text from the body of .html documents. read_document inherits this ability as well.

MINOR FEATURES

The low level read functions all now have consistent arguments: skip, remove.empty, & trim to make their use more interoperable.

IMPROVEMENTS

textreadr no longer uses the antiword program directly, instead the R antiword package is called for read_doc. This makes installation across operating systems more standardized.

CHANGES

The logo has been moved to tools to conform to CRAN standards.
read_doc's argument format is now FALSE by default rather than TRUE to be consistent with the other read functions.
read_docx no longer uses the XML package but now uses xml2 as suggested by Jeroen Ooms (see issue #7).