In lvaudor/textoteR: Format text corpora

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(textoteR)

Why use the textoteR package?

This package makes it easier to convert text corpora from one format to another. The available formats right now are:

IRaMuTeQ format (a single local .txt file with starred tags indicating distinct texts and metadata variables)
TXM format (as many local .txt files as there are texts in the corpus, as well as a metadata.csv file)
R tibble (i.e. a table within the R environment that will contain both the metadata variables and the text variable itself)

Example: from a TXM corpus

The package is provided with a small example of TXM corpus (multiple .txt files + one metadata.csv file): 9 famous fables of La Fontaine.

Here I get the path to the corpus according to where the package is installed:

path_to_txm_corpus=system.file("extdata/fables",
                        package="textoteR")
print(path_to_txm_corpus)

Here are the files in the directory:

list.files(path_to_txm_corpus)

Here is how you can convert this corpus into an R tibble:

txm_to_rtibble(from_dir=path_to_txm_corpus)

The format of the corpus can also be changed from TXM to IRaMuTeQ through:

txm_to_iramuteq(from_dir=path_to_txm_corpus,
                filename="fables_iramuteq.txt")

See the first 10 lines of the created file "fables_iramuteq.txt" (removed from local files right afterwards):

cat(readLines("fables_iramuteq.txt")[1:10], sep="\n")
file.remove("fables_iramuteq.txt")

Example: from an IRaMuTeQ corpus

The package is provided with a small example of IRaMuTeQ corpus (single .txt file with starred tags): 5 speeches pronounced by French President Macron during the COVID-19 crisis in 2020.

path_to_iramuteq_corpus=system.file("extdata", package="textoteR")

iramuteq_to_rtibble(from_dir=path_to_iramuteq_corpus,
                    filename="macron_covid.txt")

The format of the corpus can also be changed from IRaMuTeQ to TXM through:

iramuteq_to_txm(from_dir=path_to_iramuteq_corpus,
                filename="macron_covid.txt",
                to_dir="macron_covid_corpus")

See the content of directory "macron_covid_corpus" (removed from local files right afterwards), and the content of file txt1.txt :

list.files("macron_covid_corpus")
cat(readLines("macron_covid_corpus/txt1.txt"))
unlink("macron_covid_corpus",recursive=TRUE)

Example: from an R tibble

The package contains an R data tibble LVtweets, with tweets, that contains both metadata variables and text.

head(LVtweets)

Here is how you can export such data into an IRaMuTeQ or TXM format:

rtibble_to_txm(rtibble=LVtweets,
               to_dir="LVtweets_txm")
list.files("LVtweets_txm")

# remove directory:
unlink("LVtweets_txm", recursive=TRUE)

rtibble_to_iramuteq(rtibble=LVtweets,
                    filename="LVtweets_ira.txt")
# remove file: 
file.remove("LVtweets_ira.txt")

Note that tweets contain a certain number of special characters (e.g. emojis) and links that might cause TXM or IRaMuTeQ imports to fail. Such text data should probably be cleaned in R before conversion to TXM or IRaMuTeQ formats.

lvaudor/textoteR documentation built on April 5, 2025, 3:03 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com