Reference Class TEXT.MINER is a combination of properties and methods for running various text mining algorithms

`text`

vector of character containing raw text documents which are contents of argument

`text_vect`

passed to the class constructor.`n.text`

a single integer indicating the count of text documents.

`stop.words`

vector of character specifying words to be removed from the text corpus.

`dictionary`

data.frame of two columns containing words to be replaced with their synonyms. Words in the first column are replaced by the words in the second.

`data$words`

vector of character containing the words in all the documnets.

`time`

vector of POSIXlt containing the time in which the text document is issued. Need to be given to the class constructor as an argument.

`settings`

list of various parameters containing the settings of the text miner object:

`data$DTM`

a matrix of numerics representing the document term matrix of the text corpus. Better to use method get.dtm() to get the matrix.

`data$W.tfidf`

matrix of numerics containing the tf-idf weights of the document-term matrix. Better to use method get.tfidf() to get the matrix.

`data$W.bin`

matrix of numerics containing the binary weights of the term-document matrix.

`D.bin`

matrix

`Nd x Nd`

of numerics, where`Nd`

is the number of documents. Contains the distances of all pairs of documents based on*binary*metric.`data$D.freq.euc`

matrix same size as

`D.bin matrix`

. Contains the distances of all pairs of documents based on euclidean metric using raw frequencies as word weights.`data$D.freq.max`

matrix same size as

`data$D.freq.euc`

containing the distances of documents based on*maximum*metric using raw frequencies as word weights.`data$D.freq.man`

matrix same size as

`data$D.freq.euc`

containing the distances of documents based on*manhattan*metric using raw frequencies as word weights.`data$D.freq.can`

matrix same size as

`data$D.freq.euc`

containing the distances of documents based on*canberra*metric using raw frequencies as word weights.`data$D.freq.min`

matrix same size as

`data$D.freq.euc`

containing the distances of documents based on*minkovsky*metric using raw frequencies as word weights.`data$D.freq.sph`

matrix same size as

`data$D.freq.euc`

containing the distances of documents based on*spherical*metric (cosine dissimilarities) using raw frequencies as word weights.`data$D.tfidf.euc`

matrix similar to

`data$D.freq.euc`

contains*euclidean*distances of documents using*tf-idf*as word weights.`data$D.tfidf.max`

matrix similar to

`data$D.freq.max`

contains*maximum*distances of documents using*tf-idf*as word weights.`data$D.tfidf.man`

matrix similar to

`data$D.freq.man`

contains*manhattan*distances of documents using*tf-idf*as word weights.`data$D.tfidf.can`

matrix similar to

`data$D.freq.can`

contains*canberra*distances of documents using*tf-idf*as word weights.`data$D.tfidf.min`

matrix similar to

`data$D.freq.min`

contains*minkovsky*distances of documents using*tf-idf*as word weights.`data$D.tfidf.sph`

matrix similar to

`data$D.freq.sph`

contains*spherical*distances of documents using*tf-idf*as word weights.`data$CLS`

integer vector of size

`Nd`

. Contains the cluster number associated with each text document after the clustering has been implemented.`data$CRS`

matrix

`Nc x Nt`

where`Nc`

is the number of clusters and`Nt`

is the number of terms (words). Contains centers of each cluster after the clustering has been implemented.`data$CRS.dist`

matrix

`Nd x Nc`

. Contains distances of each document from centers of each cluster based on the metric passed to method`centers.dist()`

in its last call.`data$CNTR`

matrix

`Nc x Nt`

where`Nt`

is the number of terms (words). Contains centers of each cluster after the clustering has been implemented.`data$CNTR.dist`

vector of numerics of size

`Nd`

. Contains the distaces of each document from the center of all documents using the metric passed to method`center.dist()`

in its last call.

`clust(nc = settings$num_clust, weighting = settings$weighting, metric = settings$metric)`

Clusters the text documents on the given metric and weighting.

Arguments:

nc: a single integer specifying the number of clusters.

weighting: a single character. Must be within valid.weightings

metric: a single character. Must be within valid.metrics

Returns: integer vector containing cluster numbers associated with text documents.

`clusterObject(cn)`

Returns all documents of a given cluster, as a new TEXT.MINER object.

Arguments:

cn: a single integer specifying the cluster number.

Returns: a fresh object of class TEXT.MINER containing only the text documents within the given cluster number.

`clusterObjects()`

Returns each cluster as a new TEXT.MINER object.

Arguments:

No arguments.

Returns: a list of objects of class TEXT.MINER. Each element contains the text documents within one cluster.

`get.dtm(cn = NULL)`

Use this method to get the document term matrix containing raw frequencies of each word in each document. Arguments: cn A single integer specifying the cluster number. If null(default), the whole text corpus is included. Returns: A numeric matrix containing the frequency of each term in each document

`get.mds(n.dim = 2, weighting = settings$weighting, metric = settings$metric)`

Multi-Dimensional Scaling is a dimensionality reduction method. In this method, coordinates of text documents as vectors in the low-dimensional space are computed while the sum of squares of difference in distances between all pairs of documents are minimized. This method returns the equivalent vectors of text documents in a low-dimensional space using multi-dimensional scaling.

Arguments:

n.dim: a single integer specifying the number of dimensions of the lower-dimensional space.

weighting: a single character within valid.weightings specifying the weighting. metric: a single character within valid.metrics specifying the metric used for computing distances between text documents. Returns: A matrix of numerics containing coordinates of equivalent vectors in the lower-dimensional space.

`get.tfidf(cn = NULL)`

Returns the tf-idf weights of the document-term matrix containing tf-idf weights of each term in each document Arguments: cn A single integer specifying the cluster number. If null(default), the whole text corpus is included. Returns: A numeric matrix containing the weight of each term in each document

`get.weights(weighting = settings$weighting, cn = NULL)`

Returns the vector of total term weights depending on the given weighting.

Arguments:

weighting: a single character within valid.weightings specifying the weighting. Returns: A named vector of numerics containing the total tf-idf or frequency weights of the terms.

`initialize(dataset, text_col = "text", id_col = NULL, time_col = NULL, label_col = NULL, settings = genDefaultSettings())`

Class Constructor function.

Arguments:

text_vect: vector of character containing raw text documents.

arr_time: vector of POSIXlt containing the time in which the text document is issued. Need to be given to the class constructor as an argument.

stop_words: vector of character specifying words to be removed from the text corpus. Default is tm::stopwords('english') dictionary: data.frame of two columns containing words to be replaced by their synonyms. Words in the first column are replaced by the words in the second. settings: list of various parameters containing various settings of the object. Refer to the calss documentation to see all setting parameters.

`set.metric(m)`

Changes the metric in the settings and clears all clusters.

Arguments:

m: a single integer specifying the metric. Must be within valid.metrics.

Returns: Norhing. Changes the metric in the settings to the given metric and clears all clusters.

`term.weights()`

Returns term weights as a data.frame of two columns. The first column contains raw frequencies and the second column, contains tf-idf weights of the word in each corpus. Words appear as rownames of the data.frame

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.