Releases will be numbered with the following semantic versioning format:
<major>.<minor>.<patch>
And constructed with the following guidelines:
BUG FIXES
read_docx
would return the same word as 2 separate words if different
characters within the word had different styling (pseudocode example:
'<w:p><bold>h</bold>ello word<w:p>' returned 'h ello world').NEW FEATURES
read_odt
added to read in .odt files.BUG FIXES
read_pdf
threw an error when ocr = TRUE
but the tesseract package was
unavailable. This has been fixed.
Read_xxx
functions failed when a URL was provided for the path. This behavior
has been corrected. Thanks to Brent Brewington for the spot in issue #18.
NEW FEATURES
un_zip
& un_tar
added as convenience functions (wrapper for ?utils::unzip
& ?utils::untar
) to make the functions more pipe-able.
read_pptx
added to read in .pptx files.
MINOR FEATURES
read_xml
basic functionality added and part of read_document
.
Looping utilities loop_counter
, base_name
, and try_limit
added for use
inside of loops. Makes loop reporting and error handling easier and more readable.
IMPROVEMENTS
read_docx
would return non-text, formatting information. Issue #19 provides
a demonstration of this issue. This behavior has been corrected to grab text
(w:t) tags with paragraphs (w:p).NEW FEATURES
peek
picks up a strings.left
argument to align strings to the left. This
is the default because this is a text reading package that deals primarily
with strings.
read_pdf
picks up an ocr
argument in order to properly handle image based
,pdf files in order to extract the text. For this task optical character
recognition (OCR) is required. The tesseract package provides the back-end
for processing these types of .pdfs.
browse
added to open files and directories.
BUG FIXES
read_dir
did not handle errored read-ins correctly resulting in an R error.NEW FEATURES
read_document
picks up an explicit skip
, remove.empty
, and trim
argument like the other read_
functions.
read_rtf
added to the document forms that can be parsed. This relies on the
striprtf package as a back-end. read_document
and read_transcript
pick
up the ability to read rich text format as well.
MINOR FEATURES
as_transcript
added for coercion of internal strings to transcript. This
function adds the ability to call out the person variable via a regex. For
example one may split after all caps as the leading string.
read_dir
and read_dir_transcript
pick up an ignore.case
function for pattern.
Pattern becomes more powerful in that it was moved outside of the dir
command
via a grep
call.
BUG FIXES
ex_
functions from qdapRegex. This was the dev
version of qdapRegex. This is now the CRAN version and now works for users.NEW FEATURES
read_html
added for reading in the text from the body of .html documents.
read_document
inherits this ability as well.MINOR FEATURES
skip
,
remove.empty
, & trim
to make their use more interoperable.IMPROVEMENTS
read_doc
. This makes installation across
operating systems more standardized.CHANGES
The logo has been moved to tools to conform to CRAN standards.
read_doc
's argument format
is now FALSE
by default rather than TRUE
to
be consistent with the other read functions.
read_docx
no longer uses the XML package but now uses xml2 as
suggested by Jeroen Ooms (see issue #7).
NEW FEATURES
read_dir_transcript
added to complement read_dir
aimed at a directory of
transcripts.This package is a collection of convenience tools for reading text documents into R.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.