create_q_matrix: Elaborate a Document Frequency Matrix in either quanteda or...
In chriskirkhub/etea: This package enables the classification of unstructured textual data into a structured, segmented, temporal, document frequency matrix for use as input into predictive modelling systems such as neural networks or state-space models.

Description Usage Arguments Value Author(s) References Examples

In text investigations it is considered important to consider the use of skip-grams to increase coverage without unduly increasing training size. In multi-state and decision modelling generally, it is considered important to aggregate documents by state IDs and to return a time series. This includes use in packages such as neural networks, decision models and in ensemble packages such as RTextTools. This function is a wrapper for the quanteda package that elaborates a Document Frequency Matrix and returns it in a tm package format for use by other functions. The function receives arguments that leverage functions within the quanteda package namely docvars, ngrams and skipgrams. This function is independent but also acts as a feeder to classify_etea and is a wrapper developed with permission, of functions from originals by Ken Benoit and Paul Nulty et al quanteda.

create_q_matrix(textColumn, stateNum = NULL, timeNum = NULL,
  docvaragg = c("null", "state", "time", "statetime", "timestate"),
  use_stopwords = TRUE, stopwords_language = "english",
  add_stopwords = NULL, remove_stopwords = NULL, verbose = TRUE,
  toLower = FALSE, stem = FALSE, removeFeatures = TRUE,
  language = "english", valuetype = c("glob"), removeNumbers = TRUE,
  removePunct = TRUE, removeSeparators = TRUE, removeHyphens = TRUE,
  removeTwitter = FALSE, ngrams = 1L, skip = 0L, concatenator = "_",
  simplify = FALSE, useSentences = FALSE, convert_to_tm = TRUE, ...)

`textColumn`	character vector containing the text to be analysed; mandatory
`stateNum`	numeric vector containing identifiers for the condition or state when the document or note was recorded/written that it be correctly allocated in the event of more than one note or record being in a state; default NULL
`timeNum`	numeric verctor containing an index that identifies when the document was recorded/noted to give a temporal record. This normalises progress and case note recording as a progress through a system. Typically days or minutes after the system commenced; default NULL
`docvaragg`	specifies how the aggregation on docvars is to occur either s stateID only, t timestamp only, st state and timestamp or timestamp and state ; default NULL; Options s, t, st, ts
`use_stopwords`	specifies whether stopwords are to be removed from the corpus (TRUE) or not removed, (FALSE). Users are reminded that system (language-specific) stopwords may need additions or removals to tailor for a specific need; default TRUE
`add_stopwords`	a character vector of words to be added to the stopwords vector (if any); default is NULL.
`remove_stopwords`	a character vector of words to be removed to the stopwords vector (if any); default is NULL.
`verbose`	to see useful progress information; default is TRUE
`toLower`	to convert all inbound text into lower case. Notably this will degrade the sentence splitting function if applied; default is FALSE; see: `tokenize`
`stem`	reduce word length to root; default is FALSE; see: `tokenize`
`removeFeatures`	remove particular features from inbound text as specified in a list; default is TRUE; see: `quanteda`
`language`	to define local language; default is "english" see: `quanteda`
`valuetype`	to define patterning; default is `glob`; see: `quanteda`
`removeNumbers`	remove individual numbers from inbound text, (note: numbers already aggregated with characters such as 1st or 2nd are unaffected); default is TRUE; see: `quanteda`
`removePunct`	remove punctuation from inbound text; default is TRUE; see: `quanteda`
`removeSeparators`	remove separators from inbound text; default is TRUE; see: `quanteda`
`removeHyphens`	remove hyphen characters from inbound text; default is TRUE; see: `quanteda`
`removeTwitter`	remove twitter api characters from inbound text; default is TRUE; see: `quanteda`
`ngrams`	integer vector specifying the number of elements to be concatenated in each ngram; default is 1L; see: `ngrams`
`skip`	integer vector specifying the adjacency skip size for tokens forming the ngrams; `0`: see: `ngrams`
`concatenator`	character for combining words, default is `_`; see: `ngrams`
`simplify`	character vector of tokens rather than a length of texts; default is FALSE; see: `tokenize`
`convert_to_tm`	logical specifying the requirement for the matrix to be returned in the tm TRUE or quanteda FALSE format
`...`	Extra arguments, not used
`choice`	of language to determine the content of the basic stopword list; default `english`. See `quanteda` for further information.
`termNum`	integer specifying the minimum frequency a word is to have been found in the matrix

optionally a quanteda-type Document Feature Matrix or a tm-type Document Term Matrix object containing word frequencies with (optionally) a time-series index and a state ID identifier as a data frame.

Chris Kirk

Guthrie, D., B. Allison, W. Liu, and L. Guthrie. 2006. "A Closer Look at Skip-Gram Modelling."

# create dfm using character vector of CaseNotes and aggregated by state for use in a classifier
## LOAD ##
text_df <- read.csv("data/jtr_docs.csv",header=TRUE, sep=";")
textColumn<-as.character(text_df$Notes)
## CREATE MATRIX ##
create_q_matrix(textColumn, stateNum=c(1,1,2,3,3), verbose=TRUE, use_stopwords=TRUE, docvaragg="s")

# create dfm using character vector of CaseNotes, states and datetimestamps for use as a time series for nnet or MARSS
## LOAD ##
text_df <- read.csv("data/militant_suffragette_extract.csv",header=TRUE, sep=";")
textColumn<-as.character(text_df$Notes) # typically textual interview or clinical notes
stateNum<-as.numeric(text_df$stateID) # typically identication of parts of journey/episode
timeNum<-as.character(text_df$datetimestamp) # typically days since start of journey/episode
## CREATE MATRIX ##
q_tm_dfm <- create_q_matrix(textColumn, stateNum, timeNum, verbose=TRUE, use_stopwords=TRUE, docvaragg="t")
# for MARSS
q_matrix <- data.matrix(q_tm_dfm) # MARSS requires standard data matrix Note timeNum as rownames
## CONVERT FOR MARSS ##
dat = t(q_matrix) # transpose to MARSS form
colnames(dat) <- rownames(q_matrix) # set column names to timeNum from docvars (rownames)
## dat is now available as MARSS DATA ##