knitr::opts_chunk$set(collapse = TRUE, comment = "##")
# Load readtext package library(readtext)
readtext also handles multiple files and file types using for instance a "glob" expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz). Usually, you do not have to determine the format of the files explicitly - readtext takes this information from the file ending.
The readtext package comes with a data directory called
extdata that contains examples of all files listed above. In the vignette, we use this data directory.
# Get the data directory from readtext DATA_DIR <- system.file("extdata/", package = "readtext")
extdata directory contains several subfolders that include different text files. In the following examples, we load one or more files stored in each of these folders. The
paste0 command is used to concatenate the
extdata folder from the
readtext package with the subfolders. When reading in custom text files, you will need to determine your own data directory (see
The folder "txt" contains a subfolder named UDHR with .txt files of the Universal Declaration of Human Rights in 13 languages.
# Read in all files from a folder readtext(paste0(DATA_DIR, "/txt/UDHR/*"))
We can specify document-level metadata (
docvars) based on the file names or on a separate data.frame. Below we take the docvars from the filenames (
docvarsfrom = "filenames") and set the names for each variable (
docvarnames = c("unit", "context", "year", "language", "party")). The command
dvsep = "_" determines the separator (a regular expression character string) included in the filenames to delimit the
# Manifestos with docvars from filenames readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"), docvarsfrom = "filenames", docvarnames = c("unit", "context", "year", "language", "party"), dvsep = "_", encoding = "ISO-8859-1")
readtext can also curse through subdirectories. In our example, the folder
txt/movie_reviews contains two subfolders (called
pos). We can load all texts included in both folders.
# Recurse through subdirectories readtext(paste0(DATA_DIR, "/txt/movie_reviews/*"))
Read in comma separated values (.csv files) that contain textual data. We determine the
texts variable in our .csv file as the
text_field. This is the column that contains the actual text. The other columns of the original csv file (
FirstName) are by default treated as document-level variables.
# Read in comma-separated values readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
The same procedure applies to tab-separated values.
# Read in tab-separated values readtext(paste0(DATA_DIR, "/tsv/dailsample.tsv"), text_field = "speech")
You can also read .json data. Again you need to specify the
## Read in JSON data readtext(paste0(DATA_DIR, "/json/inaugural_sample.json"), text_field = "texts")
readtext can also read in and convert .pdf files.
In the example below we load all .pdf files stored in the
UDHR folder, and determine that the
docvars shall be taken from the filenames. We call the document-level variables
language, and specify the delimiter (
## Read in Universal Declaration of Human Rights pdf files (rt_pdf <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), docvarsfrom = "filenames", docvarnames = c("document", "language"), sep = "_"))
Microsoft Word formatted files are converted through the package antiword for older
.doc files, and using XML for newer
## Read in Word data (.docx) readtext(paste0(DATA_DIR, "/word/*.docx"))
You can also read in text directly from a URL.
# Note: Example required: which URL should we use?
Finally, it is possible to include text from archives.
# Note: Archive file required. The only zip archive included in readtext has # different encodings and is difficult to import (see section 4.2).
readtext was originally developed in early versions of the quanteda package for the quantitative analysis of textual data. It was spawned from the
textfile() function from that package, and now lives exclusively in readtext. Because quanteda's corpus constructor recognizes the data.frame format returned by
readtext(), it can construct a corpus directly from a
readtext object, preserving all docvars and other meta-data.
You can easily construct a corpus from a readtext object.
# read in comma-separated values with readtext rt_csv <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts") # create quanteda corpus corpus_csv <- corpus(rt_csv) summary(corpus_csv, 5)
When a document contains page numbers, they are imported as well. If you want to remove them, you can use a regular expression. We strongly recommend using the stringi package. For the most common regular expressions you can look at this cheatsheet.
You first need to check in the original file in which format the page numbers occur (e.g., "1", "-1-", "page 1" etc.). We can make use of the fact that page numbers are almost always preceded and followed by a linebreak (
\n). After loading the text with readtext, you can replace the page numbers.
# Load stringi package require(stringi)
In the first example, the page numbers have the format "page X".
# Make some text with page numbers sample_text_a <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, page 1 with the newspaper from a boy named quick Seamus, in his mouth. page 2 The quicker brown fox jumped over 2 lazy dogs." sample_text_a # Remove "page" and respective digit sample_text_a2 <- unlist(stri_split_fixed(sample_text_a, '\n'), use.names = FALSE) sample_text_a2 <- stri_replace_all_regex(sample_text_a2, "page \\d*", "") sample_text_a2 <- stri_trim_both(sample_text_a2) sample_text_a2 <- sample_text_a2[sample_text_a2 != ''] stri_paste(sample_text_a2, collapse = '\n')
In the second example we remove page numbers which have the format "- X -".
sample_text_b <- "The quick brown fox named Seamus - 1 - jumps over the lazy dog also named Seamus, with - 2 - the newspaper from a boy named quick Seamus, in his mouth. - 33 - The quicker brown fox jumped over 2 lazy dogs." sample_text_b sample_text_b2 <- unlist(stri_split_fixed(sample_text_b, '\n'), use.names = FALSE) sample_text_b2 <- stri_replace_all_regex(sample_text_b2, "[-] \\d* [-]", "") sample_text_b2 <- stri_trim_both(sample_text_b2) sample_text_b2 <- sample_text_b2[sample_text_b2 != ''] stri_paste(sample_text_b2, collapse = '\n')
Such stringi functions can also be applied to readtext objects.
Sometimes files of the same type have different encodings. If the encoding of a file is included in the file name, we can extract this information and import the texts correctly.
# create a temporary directory to extract the .zip file FILEDIR <- tempdir() # unzip file unzip(system.file("extdata", "data_files_encodedtexts.zip", package = "readtext"), exdir = FILEDIR)
Here, we will get the encoding from the filenames themselves.
# get encoding from filename filenames <- list.files(FILEDIR, "^(Indian|UDHR_).*\\.txt$") head(filenames) # Strip the extension filenames <- gsub(".txt$", "", filenames) parts <- strsplit(filenames, "_") fileencodings <- sapply(parts, "[", 3) head(fileencodings) # Check whether certain file encodings are not supported notAvailableIndex <- which(!(fileencodings %in% iconvlist())) fileencodings[notAvailableIndex]
If we read the text files without specifying the encoding, we get erroneously formatted text. To avoid this, we determine the
encoding using the character object
fileencoding created above.
We can also add
docvars based on the filenames.
txts <- readtext(paste0(DATA_DIR, "/data_files_encodedtexts.zip"), encoding = fileencodings, docvarsfrom = "filenames", docvarnames = c("document", "language", "input_encoding")) print(txts, n = 50)
From this file we can easily create a quanteda
corpus_txts <- corpus(txts) summary(corpus_txts, 5)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.