TextMiner-class | R Documentation |
Reference Class TextMiner is a combination of properties and methods for running various text mining algorithms
text
vector of character containing raw text documents which are contents of argument text_vect
passed to the class constructor.
n.text
a single integer indicating the count of text documents.
stop.words
vector of character specifying words to be removed from the text corpus.
dictionary
data.frame of two columns containing words to be replaced with their synonyms. Words in the first column are replaced by the words in the second.
data$words
vector of character containing the words in all the documnets.
time
vector of POSIXlt containing the time in which the text document is issued. Need to be given to the class constructor as an argument.
settings
list of various parameters containing the settings of the text miner object:
data$DTM
a matrix of numerics representing the document term matrix of the text corpus. Better to use method get.dtm() to get the matrix.
data$W.tfidf
matrix of numerics containing the tf-idf weights of the document-term matrix. Better to use method get.tfidf() to get the matrix.
data$W.bin
matrix of numerics containing the binary weights of the term-document matrix.
D.bin
matrix Nd x Nd
of numerics, where Nd
is the number of documents.
Contains the distances of all pairs of documents based on binary metric.
data$D.freq.euc
matrix same size as D.bin matrix
.
Contains the distances of all pairs of documents based on euclidean metric using raw frequencies as word weights.
data$D.freq.max
matrix same size as data$D.freq.euc
containing the distances of documents based on maximum metric using raw frequencies as word weights.
data$D.freq.man
matrix same size as data$D.freq.euc
containing the distances of documents based on manhattan metric using raw frequencies as word weights.
data$D.freq.can
matrix same size as data$D.freq.euc
containing the distances of documents based on canberra metric using raw frequencies as word weights.
data$D.freq.min
matrix same size as data$D.freq.euc
containing the distances of documents based on minkovsky metric using raw frequencies as word weights.
data$D.freq.sph
matrix same size as data$D.freq.euc
containing the distances of documents based on spherical metric (cosine dissimilarities) using raw frequencies as word weights.
data$D.tfidf.euc
matrix similar to data$D.freq.euc
contains euclidean distances of documents using tf-idf as word weights.
data$D.tfidf.max
matrix similar to data$D.freq.max
contains maximum distances of documents using tf-idf as word weights.
data$D.tfidf.man
matrix similar to data$D.freq.man
contains manhattan distances of documents using tf-idf as word weights.
data$D.tfidf.can
matrix similar to data$D.freq.can
contains canberra distances of documents using tf-idf as word weights.
data$D.tfidf.min
matrix similar to data$D.freq.min
contains minkovsky distances of documents using tf-idf as word weights.
data$D.tfidf.sph
matrix similar to data$D.freq.sph
contains spherical distances of documents using tf-idf as word weights.
data$CLS
integer vector of size Nd
.
Contains the cluster number associated with each text document after the clustering has been implemented.
data$CRS
matrix Nc x Nt
where Nc
is the number of clusters and Nt
is the number of terms (words).
Contains centers of each cluster after the clustering has been implemented.
data$CRS.dist
matrix Nd x Nc
.
Contains distances of each document from centers of each cluster based on the metric passed to method centers.dist()
in its last call.
data$CNTR
matrix Nc x Nt
where Nt
is the number of terms (words).
Contains centers of each cluster after the clustering has been implemented.
data$CNTR.dist
vector of numerics of size Nd
.
Contains the distaces of each document from the center of all documents using the metric passed to method center.dist()
in its last call.
clust(nc = settings$num_clust, weighting = settings$weighting,
metric = settings$metric)
Clusters the text documents on the given metric and weighting.
Arguments:
nc: a single integer specifying the number of clusters.
weighting: a single character. Must be within valid.weightings
metric: a single character. Must be within valid.metrics
Returns: integer vector containing cluster numbers associated with text documents.
clusterObject(cn)
Returns all documents of a given cluster, as a new TextMiner object.
Arguments:
cn: a single integer specifying the cluster number.
Returns: a fresh object of class TextMiner containing only the text documents within the given cluster number.
clusterObjects()
Returns each cluster as a new TextMiner object.
Arguments:
No arguments.
Returns: a list of objects of class TextMiner. Each element contains the text documents within one cluster.
get.dtm(cn = NULL)
Use this method to get the document term matrix containing raw frequencies of each word in each document. Arguments: cn A single integer specifying the cluster number. If null(default), the whole text corpus is included. Returns: A numeric matrix containing the frequency of each term in each document
get.mds(n.dim = 2, weighting = settings$weighting,
metric = settings$metric)
Multi-Dimensional Scaling is a dimensionality reduction method. In this method, coordinates of text documents as vectors in the low-dimensional space are computed while the sum of squares of difference in distances between all pairs of documents are minimized. This method returns the equivalent vectors of text documents in a low-dimensional space using multi-dimensional scaling.
Arguments:
n.dim: a single integer specifying the number of dimensions of the lower-dimensional space.
weighting: a single character within valid.weightings specifying the weighting. metric: a single character within valid.metrics specifying the metric used for computing distances between text documents. Returns: A matrix of numerics containing coordinates of equivalent vectors in the lower-dimensional space.
get.tfidf(cn = NULL)
Returns the tf-idf weights of the document-term matrix containing tf-idf weights of each term in each document Arguments: cn A single integer specifying the cluster number. If null(default), the whole text corpus is included. Returns: A numeric matrix containing the weight of each term in each document
get.weights(weighting = settings$weighting, cn = NULL)
Returns the vector of total term weights depending on the given weighting.
Arguments:
weighting: a single character within valid.weightings specifying the weighting. Returns: A named vector of numerics containing the total tf-idf or frequency weights of the terms.
initialize(dataset, text_col = "text", id_col = NULL, time_col = NULL,
label_col = NULL, settings = genDefaultSettings())
Class Constructor function.
Arguments:
text_vect: vector of character containing raw text documents.
arr_time: vector of POSIXlt containing the time in which the text document is issued. Need to be given to the class constructor as an argument.
stop_words: vector of character specifying words to be removed from the text corpus. Default is tm::stopwords('english') dictionary: data.frame of two columns containing words to be replaced by their synonyms. Words in the first column are replaced by the words in the second. settings: list of various parameters containing various settings of the object. Refer to the calss documentation to see all setting parameters.
set.metric(m)
Changes the metric in the settings and clears all clusters.
Arguments:
m: a single integer specifying the metric. Must be within valid.metrics.
Returns: Norhing. Changes the metric in the settings to the given metric and clears all clusters.
term.weights()
Returns term weights as a data.frame of two columns. The first column contains raw frequencies and the second column, contains tf-idf weights of the word in each corpus. Words appear as rownames of the data.frame
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.