knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(textoteR)

Why use the textoteR package?

This package makes it easier to convert text corpora from one format to another. The available formats right now are:

Example: from a TXM corpus

The package is provided with a small example of TXM corpus (multiple .txt files + one metadata.csv file): 9 famous fables of La Fontaine.

Here I get the path to the corpus according to where the package is installed:

path_to_txm_corpus=system.file("extdata/fables",
                        package="textoteR")
print(path_to_txm_corpus)

Here are the files in the directory:

list.files(path_to_txm_corpus)

Here is how you can convert this corpus into an R tibble:

txm_to_rtibble(from_dir=path_to_txm_corpus)

The format of the corpus can also be changed from TXM to IRaMuTeQ through:

txm_to_iramuteq(from_dir=path_to_txm_corpus,
                filename="fables_iramuteq.txt")

See the first 10 lines of the created file "fables_iramuteq.txt" (removed from local files right afterwards):

cat(readLines("fables_iramuteq.txt")[1:10], sep="\n")
file.remove("fables_iramuteq.txt")

Example: from an IRaMuTeQ corpus

The package is provided with a small example of IRaMuTeQ corpus (single .txt file with starred tags): 5 speeches pronounced by French President Macron during the COVID-19 crisis in 2020.

path_to_iramuteq_corpus=system.file("extdata", package="textoteR")
iramuteq_to_rtibble(from_dir=path_to_iramuteq_corpus,
                    filename="macron_covid.txt")

The format of the corpus can also be changed from IRaMuTeQ to TXM through:

iramuteq_to_txm(from_dir=path_to_iramuteq_corpus,
                filename="macron_covid.txt",
                to_dir="macron_covid_corpus")

See the content of directory "macron_covid_corpus" (removed from local files right afterwards), and the content of file txt1.txt :

list.files("macron_covid_corpus")
cat(readLines("macron_covid_corpus/txt1.txt"))
unlink("macron_covid_corpus",recursive=TRUE)

Example: from an R tibble

The package contains an R data tibble LVtweets, with tweets, that contains both metadata variables and text.

head(LVtweets)

Here is how you can export such data into an IRaMuTeQ or TXM format:

rtibble_to_txm(rtibble=LVtweets,
               to_dir="LVtweets_txm")
list.files("LVtweets_txm")

# remove directory:
unlink("LVtweets_txm", recursive=TRUE)
rtibble_to_iramuteq(rtibble=LVtweets,
                    filename="LVtweets_ira.txt")
# remove file: 
file.remove("LVtweets_ira.txt")

Note that tweets contain a certain number of special characters (e.g. emojis) and links that might cause TXM or IRaMuTeQ imports to fail. Such text data should probably be cleaned in R before conversion to TXM or IRaMuTeQ formats.



lvaudor/textoteR documentation built on April 5, 2025, 3:03 a.m.