Usage Arguments Value Author(s) Examples
1 2 3 4 5 6 7 8 9 10 11 | classify_etea(textColumn, stateNum = NULL, timeNum = NULL,
docvaragg = "null", use_stopwords = TRUE,
stopwords_language = "english", add_stopwords = NULL,
remove_stopwords = NULL, verbose = TRUE, toLower = FALSE,
stem = FALSE, keptFeatures = NULL, removeFeatures = TRUE,
language = "english", valuetype = c("glob"), thesaurus = NULL,
dictionary = NULL, removeNumbers = TRUE, removePunct = TRUE,
removeSeparators = TRUE, removeHyphens = TRUE, removeTwitter = TRUE,
ngrams = 1L, skip = 0L, concatenator = "_", simplify = FALSE,
useSentences = TRUE, convert_to_tm = TRUE, pstrong = 1, pweak = 0.5,
termNum = 1, ...)
|
textColumn |
character vector containing the text to be analysed; mandatory |
stateNum |
numeric vector containing identifiers for the condition or state when the document or note was recorded/written that it be correctly allocated in the event of more than one note or record being in a state; default NULL |
timeNum |
numeric vector containing an index that identifies when the document was recorded/noted to give a temporal record. This normalises progress and case note recording as a progress through a system. Typically days or minutes after the system commenced; default NULL |
docvaragg |
specifies how the aggregation on docvars is to occur either 'state' stateID only, 'time' timestamp only, 'statetime' state and timestamp or timestamp and state ; default NULL; Options state, time, statetime, timestate |
use_stopwords |
specifies whether stopwords are to be removed from the corpus (TRUE) or not removed, (FALSE). Users are reminded that system (language-specific) stopwords may need additions or removals to tailor for a specific need; default TRUE |
add_stopwords |
a character vector of words to be added to the stopwords vector (if any); default is NULL. |
remove_stopwords |
a character vector of words to be removed to the stopwords vector (if any); default is NULL. |
verbose |
to see useful progress information; default is TRUE |
toLower |
to convert all inbound text into lower case. Notably this will degrade the sentence splitting function if applied; default is FALSE; see: |
stem |
reduce word length to root; default is FALSE; see: |
removeFeatures |
remove particular features from inbound text as specified in a list; default is TRUE; see: |
language |
to define local language; default is "english" see: |
valuetype |
to define patterning; default is |
removeNumbers |
remove individual numbers from inbound text, (note: numbers already aggregated with characters such as 1st or 2nd are unaffected); default is TRUE; see: |
removePunct |
remove punctuation from inbound text; default is TRUE; see: |
removeSeparators |
remove separators from inbound text; default is TRUE; see: |
removeHyphens |
remove hyphen characters from inbound text; default is TRUE; see: |
removeTwitter |
remove twitter api characters from inbound text; default is TRUE; see: |
ngrams |
integer vector specifying the number of elements to be concatenated in each ngram; default is 1L; see: |
skip |
integer vector specifying the adjacency skip size for tokens forming the ngrams; |
concatenator |
character for combining words, default is |
simplify |
character vector of tokens rather than a length of texts; default is FALSE; see: |
convert_to_tm |
logical specifying the requirement for the matrix to be returned in the tm TRUE or quanteda FALSE format |
termNum |
integer specifying the minimum frequency a word is to have been found in the matrix |
... |
Extra arguments, not used |
choice |
of language to determine the content of the basic stopword list; default |
prior |
integer specifying the prior bayesian weighting value |
a scored, classified matrix of document/note words as categories to provide input into other analytical systems.
Chris Kirk
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | # create a scored, classified matrix of CaseNotes and aggregated by state for use in further modelling
## LOAD ##
text_df <- read.csv("data/jtr_docs.csv",header=TRUE, sep=";")
textColumn<-as.character(text_df$Notes)
## CLASSIFY ##
classify_eta(textColumn, stateNum=c(1,1,2,3,3), verbose=TRUE, use_stopwords=TRUE, docvaragg="state")
# create dfm using character vector of CaseNotes, states and datetimestamps for use as a time series for a neural network or MARSS
## LOAD ##
text_df <- read.csv("data/militant_suffragette_extract.csv",header=TRUE, sep=";")
textColumn<-as.character(text_df$Notes) # typically textual interview or clinical notes
statecol<-as.numeric(text_df$stateNum) # typically identication of parts of journey/episode
timecol<-as.character(text_df$timeNum) # typically days since start of journey/episode
## CLASSIFY ##
etea_df_time <- classify_etea(textColumn, statecol, timecol, verbose=TRUE, use_stopwords=TRUE, docvaragg="time")
# for MARSS
## CONVERT FOR MARSS ##
etea_matrix <- data.matrix(etea_df_time) # MARSS requires standard data matrix Note timeNum as rownames
dat = t(etea_matrix) # transpose to MARSS form
colnames(dat) <- rownames(etea_matrix) # set column names to timeNum from docvars (rownames)
## dat is now available as MARSS DATA ##
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.