TextMiner-class: Reference Class TextMiner is a combination of properties and...

TextMiner-classR Documentation

Reference Class TextMiner is a combination of properties and methods for running various text mining algorithms

Description

Reference Class TextMiner is a combination of properties and methods for running various text mining algorithms

Fields

text

vector of character containing raw text documents which are contents of argument text_vect passed to the class constructor.

n.text

a single integer indicating the count of text documents.

stop.words

vector of character specifying words to be removed from the text corpus.

dictionary

data.frame of two columns containing words to be replaced with their synonyms. Words in the first column are replaced by the words in the second.

data$words

vector of character containing the words in all the documnets.

time

vector of POSIXlt containing the time in which the text document is issued. Need to be given to the class constructor as an argument.

settings

list of various parameters containing the settings of the text miner object:

data$DTM

a matrix of numerics representing the document term matrix of the text corpus. Better to use method get.dtm() to get the matrix.

data$W.tfidf

matrix of numerics containing the tf-idf weights of the document-term matrix. Better to use method get.tfidf() to get the matrix.

data$W.bin

matrix of numerics containing the binary weights of the term-document matrix.

D.bin

matrix Nd x Nd of numerics, where Nd is the number of documents. Contains the distances of all pairs of documents based on binary metric.

data$D.freq.euc

matrix same size as D.bin matrix. Contains the distances of all pairs of documents based on euclidean metric using raw frequencies as word weights.

data$D.freq.max

matrix same size as data$D.freq.euc containing the distances of documents based on maximum metric using raw frequencies as word weights.

data$D.freq.man

matrix same size as data$D.freq.euc containing the distances of documents based on manhattan metric using raw frequencies as word weights.

data$D.freq.can

matrix same size as data$D.freq.euc containing the distances of documents based on canberra metric using raw frequencies as word weights.

data$D.freq.min

matrix same size as data$D.freq.euc containing the distances of documents based on minkovsky metric using raw frequencies as word weights.

data$D.freq.sph

matrix same size as data$D.freq.euc containing the distances of documents based on spherical metric (cosine dissimilarities) using raw frequencies as word weights.

data$D.tfidf.euc

matrix similar to data$D.freq.euc contains euclidean distances of documents using tf-idf as word weights.

data$D.tfidf.max

matrix similar to data$D.freq.max contains maximum distances of documents using tf-idf as word weights.

data$D.tfidf.man

matrix similar to data$D.freq.man contains manhattan distances of documents using tf-idf as word weights.

data$D.tfidf.can

matrix similar to data$D.freq.can contains canberra distances of documents using tf-idf as word weights.

data$D.tfidf.min

matrix similar to data$D.freq.min contains minkovsky distances of documents using tf-idf as word weights.

data$D.tfidf.sph

matrix similar to data$D.freq.sph contains spherical distances of documents using tf-idf as word weights.

data$CLS

integer vector of size Nd. Contains the cluster number associated with each text document after the clustering has been implemented.

data$CRS

matrix Nc x Nt where Nc is the number of clusters and Nt is the number of terms (words). Contains centers of each cluster after the clustering has been implemented.

data$CRS.dist

matrix Nd x Nc. Contains distances of each document from centers of each cluster based on the metric passed to method centers.dist() in its last call.

data$CNTR

matrix Nc x Nt where Nt is the number of terms (words). Contains centers of each cluster after the clustering has been implemented.

data$CNTR.dist

vector of numerics of size Nd. Contains the distaces of each document from the center of all documents using the metric passed to method center.dist() in its last call.

Methods

clust(nc = settings$num_clust, weighting = settings$weighting, metric = settings$metric)

Clusters the text documents on the given metric and weighting.

Arguments:

nc: a single integer specifying the number of clusters.

weighting: a single character. Must be within valid.weightings

metric: a single character. Must be within valid.metrics

Returns: integer vector containing cluster numbers associated with text documents.

clusterObject(cn)

Returns all documents of a given cluster, as a new TextMiner object.

Arguments:

cn: a single integer specifying the cluster number.

Returns: a fresh object of class TextMiner containing only the text documents within the given cluster number.

clusterObjects()

Returns each cluster as a new TextMiner object.

Arguments:

No arguments.

Returns: a list of objects of class TextMiner. Each element contains the text documents within one cluster.

get.dtm(cn = NULL)

Use this method to get the document term matrix containing raw frequencies of each word in each document. Arguments: cn A single integer specifying the cluster number. If null(default), the whole text corpus is included. Returns: A numeric matrix containing the frequency of each term in each document

get.mds(n.dim = 2, weighting = settings$weighting, metric = settings$metric)

Multi-Dimensional Scaling is a dimensionality reduction method. In this method, coordinates of text documents as vectors in the low-dimensional space are computed while the sum of squares of difference in distances between all pairs of documents are minimized. This method returns the equivalent vectors of text documents in a low-dimensional space using multi-dimensional scaling.

Arguments:

n.dim: a single integer specifying the number of dimensions of the lower-dimensional space.

weighting: a single character within valid.weightings specifying the weighting. metric: a single character within valid.metrics specifying the metric used for computing distances between text documents. Returns: A matrix of numerics containing coordinates of equivalent vectors in the lower-dimensional space.

get.tfidf(cn = NULL)

Returns the tf-idf weights of the document-term matrix containing tf-idf weights of each term in each document Arguments: cn A single integer specifying the cluster number. If null(default), the whole text corpus is included. Returns: A numeric matrix containing the weight of each term in each document

get.weights(weighting = settings$weighting, cn = NULL)

Returns the vector of total term weights depending on the given weighting.

Arguments:

weighting: a single character within valid.weightings specifying the weighting. Returns: A named vector of numerics containing the total tf-idf or frequency weights of the terms.

initialize(dataset, text_col = "text", id_col = NULL, time_col = NULL, label_col = NULL, settings = genDefaultSettings())

Class Constructor function.

Arguments:

text_vect: vector of character containing raw text documents.

arr_time: vector of POSIXlt containing the time in which the text document is issued. Need to be given to the class constructor as an argument.

stop_words: vector of character specifying words to be removed from the text corpus. Default is tm::stopwords('english') dictionary: data.frame of two columns containing words to be replaced by their synonyms. Words in the first column are replaced by the words in the second. settings: list of various parameters containing various settings of the object. Refer to the calss documentation to see all setting parameters.

set.metric(m)

Changes the metric in the settings and clears all clusters.

Arguments:

m: a single integer specifying the metric. Must be within valid.metrics.

Returns: Norhing. Changes the metric in the settings to the given metric and clears all clusters.

term.weights()

Returns term weights as a data.frame of two columns. The first column contains raw frequencies and the second column, contains tf-idf weights of the word in each corpus. Words appear as rownames of the data.frame


genpack/texer documentation built on March 23, 2022, 2:14 p.m.