TextData: Building textual and contextual tables (TextData)
In Xplortext: Statistical Analysis of Textual Data

TextData

R Documentation

Building textual and contextual tables (TextData)

Description

Creates a textual and contextual working-base (TextData format) from a source-base (data frame format).

Usage

TextData(base, var.text=NULL, var.agg=NULL, context.quali=NULL, context.quanti= NULL,
 selDoc="ALL", lower=TRUE, remov.number=TRUE,lminword=1, Fmin=Dmin,Dmin=1, Fmax=Inf,
 stop.word.tm=FALSE, idiom="en", stop.word.user=NULL, segment=FALSE,
 sep.weak="default",
 sep.strong="\u005B()\u00BF?./:\u00A1!=;{}\u005D\u2026", seg.nfreq=10, seg.nfreq2=10,
 seg.nfreq3=10, graph=FALSE)

Arguments

`base`	source data frame with at least one textual column
`var.text`	vector with index(es) or name(s) of the selected textual column(s) (by default NULL)
`var.agg`	index or name of the aggregation categorical variable (by default NULL)
`context.quali`	vector with index(es) or name(s) of the selected categorical variable(s) (by default NULL)
`context.quanti`	vector with index(es) or name(s) of the selected quantitative variable(s) (by default NULL)
`selDoc`	vector with index(es) or name(s) of the selected source-documents (rows of the source-base) (by default "ALL")
`lower`	if TRUE, the corpus is converted into lowercase (by default TRUE)
`remov.number`	if TRUE, numbers are removed (by default TRUE)
`lminword`	minimum length of a word to be selected (by default 1)
`Fmin`	minimum frequency of a word to be selected (by default Dmin)
`Dmin`	a word has to be used in at least Dmin source-documents to be selected (by default 1)
`Fmax`	maximum frequency of a word to be selected (by default Inf)
`stop.word.tm`	if TRUE, stoplist automatically provided in accordance with the idiom (by default FALSE)
`idiom`	declared idiom for the textual column(s) (by default English "en", see IETF language in package NLP)
`stop.word.user`	stoplist provided by the user
`segment`	if TRUE, the repeated segments are identified (by default FALSE)
`sep.weak`	string with the characters marking out the terms (by default punctuation characters, space and control). See details
`sep.strong`	string with the characters marking out the repeated segments (by default "[()??./:?!=+;-]\")
`seg.nfreq`	minimum frequency of a more-than-three-words-long repeated segment (by default 10)
`seg.nfreq2`	minimum frequency of a two-words-long repeated segment (by default 10)
`seg.nfreq3`	minimum frequency of a three-words-long repeated segment (by default 10)
`graph`	if TRUE, documents, words and repeated segments barcharts are displayed; use plot.TextData to use more options (by default FALSE)

Details

Each row of the source-base is considered as a source-document. TextData function builds the working-documents-by-words table, submitted to the analysis.

sep.weak contains the string with the characters marking out the terms (by default punctuation characters, space and control). Backslash or double backslash are used to start an escape sequence defining special characters. Each special character must by separated the symbol | (or) in sep.weak and sep.strong. The default is: ⁠ sep.weak = ("[%`:*$&#/^|<=>;'+@.,~?(){}|[[:space:]]| \u2014|\u002D|\u00A1|\u0021|\u00BF|\u00AB|\u00BB|\u2026|\u0022|\u005D|\u0097") ⁠ Some special characters can be introduced as unicode characters. Back slash (escape contol) is not allowed.

Information related to context.quanti and context.quali arguments:

If numeric, contextual variables can be included in both vectors. The function TextData converts the numeric variable into factor to include it in context.quali vector. This possibility is interesting in some cases. For example, when treating open-ended questions, we can be interested in computing the correlation between the contextual variable "Age" and the axes and, at the same time, to draw the trajectory of the different values of "Age" (year by year) on the CA maps.
In the case of one or several columns with textual data not selected in vector var.text, if the argument context.quali is equal to "ALL", these columns will be considered as categorical variables.

Non-aggregate table versus aggregate table.

If var.agg=NULL:

The work-documents are the non-empty-source-documents.
DocTerm: non-aggregate lexical table with:

as many rows as non-empty source-documents

as many columns as words are selected.
context$quali: data frame crossing the non-empty source-documents (rows) and the categorical contextual-variables (columns).
context$quanti: data frame crossing the non-empty source-documents (rows) and the quantitative contextual-variables (columns). Both contextual tables can be juxtaposed row-wise to DocTerm table.

If var.agg is NON-NULL:

The work-documents are aggregate-documents, issued from aggregating the source-documents depending on the categories of the aggregation variable; the aggregate-documents inherit the names of the corresponding categories.
DocTerm is an aggregate table with:

as many rows as as categories the aggregation variable has

as many columns as words are selected.
context$quali$qualitable: juxtaposes as many supplementary aggregate tables as categorical contextual variables. Each table has:

as many rows as categories the contextual categorical variable has

as many columns as selected words, i.e. as many columns as DocTerm has.
context$quali$qualivar: names of categories of the supplementary categorical variables.
context$quanti: data frame crossing the working aggregate-documents (rows) and the quantitative contextual-variables (columns). The value for an active aggregate-document is the mean-value of the source-documents belonging to this aggregate-document.

Value

A list including:

`summGen`	general summary
`summDoc`	document summary
`indexW`	index of words
`DocTerm`	working lexical table (non-aggregate or aggregate table depending on var.agg value); working-documents by words table in slam package compressed format
`context`	contextual variables if context.quali or context.quanti are non-NULL; the structure greatly differs in accordance with the nature of DocTerm table (non-aggregate/ aggregate), see details
`info`	information about the selection of words
`var.agg`	a one-column data frame with the values of the aggregation variable; NULL if non-aggregate analysis
`SourceTerm`	in the case of DocTerm being an aggregate analysis, the source-documents by words table is kept in this data structure, in slam package compressed format
`indexS`	working-documents by repeated-segments table, in slam package compressed format
`remov.docs`	vector with the names of the removed empty source-documents
`VCr`	Cramer's V coefficient of document x term matrix
`Inertia`	total inertia of document x term matrix

Author(s)

Ramón Alvarez-Esteban ramon.alvarez@unileon.es, Monica Bécue-Bertaut, Josep-Antón Sánchez-Espigares

References

Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. (D. Kluwer, Ed.). \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/978-94-017-1525-6")}.

Examples


# Non aggregate analysis
data(open.question)
res.TD<-TextData(open.question, var.text=c(9,10), remov.number=TRUE, Fmin=10, Dmin=10,  
 stop.word.tm=TRUE, context.quali=c("Gender","Age_Group","Education"), context.quanti=c("Age"))

# Aggregate analysis and repeated segments
data(open.question)
res.TD<-TextData(open.question, var.text=c(9,10), var.agg="Gen_Age", remov.number=TRUE, 
 Fmin=10, Dmin=10, stop.word.tm=TRUE, context.quali=c("Gender","Age_Group","Education"), 
 context.quanti=c("Age"), segment=TRUE)

Xplortext documentation built on April 12, 2025, 2:13 a.m.