importCorpusDlg: Import a corpus and process it
In RcmdrPlugin.temis: Graphical Integrated Text Mining Solution

importCorpusDlg

R Documentation

Import a corpus and process it

Description

Import a corpus, process it and extract a document-term matrix.

Details

This dialog allows creating a tm corpus from various sources. Once the documents have been loaded, they are processed according to the chosen settings, and a document-term matrix is extracted.

The first source, “Directory containing plain text files”, creates one document for each .txt file found in the specified directory. The documents are named according to the name of the file they were loaded from. When choosing the directoty where the .txt files can be found, please note that files are not listed in the file browser, only directories, but they will be loaded nevertheless.

The second source, “Spreadsheet file”, creates one document for each row of a file containg tabular data, typically an Excel (.xls) or Open Document Spreadsheet (.ods), comma-separated values (.csv) or tab-separated values (.tsv, .txt, .dat) file. One column must be specified as containing the text of the document, while the remaining columns are added as variables describing each document. For the CSV format, “,” or “;” is used as separator, whichever is the most frequent in the 50 first lines of the file.

The third, fourth and fifth sources, “Factiva XML or HTML file(s)”, “LexisNexis HTML file(s)” and “Europresse HTML file(s)”, load articles exported from the corresponding website in the XML or HTML formats (for Factiva, the former is recommended if you can choose it). Various meta-data variables describing the articles are automatically extracted. If the corpus is split into several .xml or .html files, you can put them in the same directory and select them by holding the Ctrl key to concatenate them into a single corpus. Please note that some articles from Factiva are known to contain invalid character that trigger an error when loading. If this problem happens to you, please try to identify the problematic article, for example by removing half of the documents and retrying, until only one document is left in the corpus; then, report the problem to the Factiva Customer Service, or ask for help to the maintainers of the present package.

The sixth source, “Alceste file(s)”, loads texts and variables from a single file in the Alceste format, which uses asterisks to separate texts and code variables.

The original texts can optionally be split into smaller chunks, which will then be considered as the real unit (called ‘documents’) for all analyses. In order to get meaningful chunks, texts are only splitted into paragraphs. These are defined by the import filter: when importing a directory of text files, a new paragraph starts with a line break; when importing a Factiva files, paragraphs are defined by the content provider itself, so may vary in size (heading is always a separate paragraph); splitting has no effect when importing from a spreadsheet file. A corpus variable called “Document” is created, which identifies the original text the chunk comes from.

For all sources, a data set called corpusVariables is created, with one row for each document in the corpus: it contains meta-data that could be extracted from the source, if any, and can be used to enter further meta-data about the corpus. This can also be done by importing an existing data set via the Data->Load data set or Data->Import data menus. Whatever way you choose, use the Text mining->Set corpus meta-data command after that to set or update the corpus's meta-data that will be used by later analyses (see setCorpusVariables).

The dialog also provides a few processing options that will most likely be all run in order to get a meaningful set of terms from a text corpus. Among them, stopwords removal and stemming require you to select the language used in the corpus. If you tick “Edit stemming manually”, enabled processing steps will be applied to the terms before presenting you with a list of all words originally found in the corpus, together with their stemmed forms. Terms with an empty stemmed form will be excluded from the document-term matrix; the “Stopword” column is only presented as an indication, it is not taken into account when deciding whether to keep a term.

By default, the program tries to detect the encoding used by plain text (usually .txt) and comma/tab-separated values files (.csv, .tsv, .dat...). If importation fails or the imported texts contain strange characters, specify the encoding manually (a tooltip gives suggestions based on the selected language).

Once the corpus has been imported, its document-term matrix is extracted.

References

Ingo Feinerer, Kurt Hornik, and David Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5):1-54, March 2008. Available at https://www.jstatsoft.org/v25/i05.

Ingo Feinerer. An introduction to text mining in R. R News, 8(2):19-22, October 2008. Available at https://cran.r-project.org/doc/Rnews/Rnews_2008-2.pdf

RcmdrPlugin.temis
Graphical Integrated Text Mining Solution

importCorpusDlg: Import a corpus and process it
In RcmdrPlugin.temis: Graphical Integrated Text Mining Solution

Import a corpus and process it

Description

Details

References

See Also

Related to importCorpusDlg in RcmdrPlugin.temis...

R Package Documentation

Browse R Packages

We want your feedback!

RcmdrPlugin.temis Graphical Integrated Text Mining Solution

importCorpusDlg: Import a corpus and process it In RcmdrPlugin.temis: Graphical Integrated Text Mining Solution

Import a corpus and process it

Description

Details

References

See Also

Related to importCorpusDlg in RcmdrPlugin.temis...

R Package Documentation

Browse R Packages

We want your feedback!

RcmdrPlugin.temis
Graphical Integrated Text Mining Solution

importCorpusDlg: Import a corpus and process it
In RcmdrPlugin.temis: Graphical Integrated Text Mining Solution