etea_features: Create a list of the Document Frequency Matrix according to...
In chriskirkhub/etea: This package enables the classification of unstructured textual data into a structured, segmented, temporal, document frequency matrix for use as input into predictive modelling systems such as neural networks or state-space models.

Description Usage Arguments Value Author(s) Examples

This function is a wrapper for a Document Frequency Matrix provided by the function create_q_matrix so that it can typically be used to add terms to a lexicon. The terms to be added to a lexicon must first be categorised in a format that matches the example in the data directory. Parameters listed below match those of the function create_q_matrix.

etea_features(textColumn, stateNum = NULL, timeNum = NULL,
  docvaragg = "null", use_stopwords = TRUE,
  stopwords_language = "english", add_stopwords = NULL,
  remove_stopwords = NULL, verbose = TRUE, toLower = FALSE,
  stem = FALSE, keptFeatures = NULL, removeFeatures = TRUE,
  language = "english", valuetype = c("glob"), thesaurus = NULL,
  dictionary = NULL, removeNumbers = TRUE, removePunct = TRUE,
  removeSeparators = TRUE, removeHyphens = TRUE, removeTwitter = TRUE,
  ngrams = 1L, skip = 0L, concatenator = "_", simplify = FALSE,
  convert_to_tm = TRUE, termNum = 1, ...)

`textColumn`	character vector containing the text to be analysed; mandatory
`stateNum`	numeric vector containing identifiers for the condition or state when the document or note was recorded/written that it be correctly allocated in the event of more than one note or record being in a state; default NULL
`timeNum`	numeric verctor containing an index that identifies when the document was recorded/noted to give a temporal record. This normalises progress and case note recording as a progress through a system. Typically days or minutes after the system commenced; default NULL
`docvaragg`	specifies how the aggregation on docvars is to occur either s stateID only, t timestamp only, st state and timestamp or timestamp and state ; default NULL; Options s, t, st, ts
`use_stopwords`	specifies whether stopwords are to be removed from the corpus (TRUE) or not removed, (FALSE). Users are reminded that system (language-specific) stopwords may need additions or removals to tailor for a specific need; default TRUE
`add_stopwords`	a character vector of words to be added to the stopwords vector (if any); default is NULL.
`remove_stopwords`	a character vector of words to be removed to the stopwords vector (if any); default is NULL.
`verbose`	to see useful progress information; default is TRUE
`toLower`	to convert all inbound text into lower case. Notably this will degrade the sentence splitting function if applied; default is FALSE; see: `tokenize`
`stem`	reduce word length to root; default is FALSE; see: `tokenize`
`removeFeatures`	remove particular features from inbound text as specified in a list; default is TRUE; see: `quanteda`
`language`	to define local language; default is "english" see: `quanteda`
`valuetype`	to define patterning; default is `glob`; see: `quanteda`
`removeNumbers`	remove individual numbers from inbound text, (note: numbers already aggregated with characters such as 1st or 2nd are unaffected); default is TRUE; see: `quanteda`
`removePunct`	remove punctuation from inbound text; default is TRUE; see: `quanteda`
`removeSeparators`	remove separators from inbound text; default is TRUE; see: `quanteda`
`removeHyphens`	remove hyphen characters from inbound text; default is TRUE; see: `quanteda`
`removeTwitter`	remove twitter api characters from inbound text; default is TRUE; see: `quanteda`
`ngrams`	integer vector specifying the number of elements to be concatenated in each ngram; default is 1L; see: `ngrams`
`skip`	integer vector specifying the adjacency skip size for tokens forming the ngrams; `0`: see: `ngrams`
`concatenator`	character for combining words, default is `_`; see: `ngrams`
`simplify`	character vector of tokens rather than a length of texts; default is FALSE; see: `tokenize`
`convert_to_tm`	logical specifying the requirement for the matrix to be returned in the tm TRUE or quanteda FALSE format
`termNum`	integer specifying the minimum frequency a word is to have been found in the matrix
`...`	Extra arguments, not used
`choice`	of language to determine the content of the basic stopword list; default `english`. See `quanteda` for further information.

a word vector.

Chris Kirk

## LOAD ##
text_df <- read.csv("data/militant_suffragette_extract.csv",header=TRUE, sep=";")
textColumn<-as.character(text_df$Notes) # typically textual interview or clinical notes
## CREATE FEATURES LIST ##
features_vec <- etea_features(textColumn, termNum=1,verbose=TRUE,use_stopwords=TRUE)
features_vec