Description Usage Arguments Value Author(s) Examples
This function is a wrapper for a Document Frequency Matrix provided by the function create_q_matrix
so that it can typically be used to add terms to a lexicon. The terms to be added to a lexicon must first be categorised in a format that matches the example in the data directory. Parameters listed below match those of the function create_q_matrix
.
1 2 3 4 5 6 7 8 9 10 | etea_features(textColumn, stateNum = NULL, timeNum = NULL,
docvaragg = "null", use_stopwords = TRUE,
stopwords_language = "english", add_stopwords = NULL,
remove_stopwords = NULL, verbose = TRUE, toLower = FALSE,
stem = FALSE, keptFeatures = NULL, removeFeatures = TRUE,
language = "english", valuetype = c("glob"), thesaurus = NULL,
dictionary = NULL, removeNumbers = TRUE, removePunct = TRUE,
removeSeparators = TRUE, removeHyphens = TRUE, removeTwitter = TRUE,
ngrams = 1L, skip = 0L, concatenator = "_", simplify = FALSE,
convert_to_tm = TRUE, termNum = 1, ...)
|
textColumn |
character vector containing the text to be analysed; mandatory |
stateNum |
numeric vector containing identifiers for the condition or state when the document or note was recorded/written that it be correctly allocated in the event of more than one note or record being in a state; default NULL |
timeNum |
numeric verctor containing an index that identifies when the document was recorded/noted to give a temporal record. This normalises progress and case note recording as a progress through a system. Typically days or minutes after the system commenced; default NULL |
docvaragg |
specifies how the aggregation on docvars is to occur either s stateID only, t timestamp only, st state and timestamp or timestamp and state ; default NULL; Options s, t, st, ts |
use_stopwords |
specifies whether stopwords are to be removed from the corpus (TRUE) or not removed, (FALSE). Users are reminded that system (language-specific) stopwords may need additions or removals to tailor for a specific need; default TRUE |
add_stopwords |
a character vector of words to be added to the stopwords vector (if any); default is NULL. |
remove_stopwords |
a character vector of words to be removed to the stopwords vector (if any); default is NULL. |
verbose |
to see useful progress information; default is TRUE |
toLower |
to convert all inbound text into lower case. Notably this will degrade the sentence splitting function if applied; default is FALSE; see: |
stem |
reduce word length to root; default is FALSE; see: |
removeFeatures |
remove particular features from inbound text as specified in a list; default is TRUE; see: |
language |
to define local language; default is "english" see: |
valuetype |
to define patterning; default is |
removeNumbers |
remove individual numbers from inbound text, (note: numbers already aggregated with characters such as 1st or 2nd are unaffected); default is TRUE; see: |
removePunct |
remove punctuation from inbound text; default is TRUE; see: |
removeSeparators |
remove separators from inbound text; default is TRUE; see: |
removeHyphens |
remove hyphen characters from inbound text; default is TRUE; see: |
removeTwitter |
remove twitter api characters from inbound text; default is TRUE; see: |
ngrams |
integer vector specifying the number of elements to be concatenated in each ngram; default is 1L; see: |
skip |
integer vector specifying the adjacency skip size for tokens forming the ngrams; |
concatenator |
character for combining words, default is |
simplify |
character vector of tokens rather than a length of texts; default is FALSE; see: |
convert_to_tm |
logical specifying the requirement for the matrix to be returned in the tm TRUE or quanteda FALSE format |
termNum |
integer specifying the minimum frequency a word is to have been found in the matrix |
... |
Extra arguments, not used |
choice |
of language to determine the content of the basic stopword list; default |
a word vector.
Chris Kirk
1 2 3 4 5 6 | ## LOAD ##
text_df <- read.csv("data/militant_suffragette_extract.csv",header=TRUE, sep=";")
textColumn<-as.character(text_df$Notes) # typically textual interview or clinical notes
## CREATE FEATURES LIST ##
features_vec <- etea_features(textColumn, termNum=1,verbose=TRUE,use_stopwords=TRUE)
features_vec
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.