TEXT_DOC_DISSIM: Dissimilarity calculation of text documents
In textTinyR: Text Processing for Small or Big Data Files

TEXT_DOC_DISSIM

R Documentation

Dissimilarity calculation of text documents

Description

Dissimilarity calculation of text documents

Usage

TEXT_DOC_DISSIM(
  first_matr = NULL,
  second_matr = NULL,
  method = "euclidean",
  batches = NULL,
  threads = 1,
  verbose = FALSE
)

Arguments

`first_matr`	a numeric matrix where each row represents a text document ( has same dimensions as the second_matr )
`second_matr`	a numeric matrix where each row represents a text document ( has same dimensions as the first_matr )
`method`	a dissimilarity metric in form of a character string. One of euclidean, manhattan, chebyshev, canberra, braycurtis, pearson_correlation, cosine, simple_matching_coefficient, hamming, jaccard_coefficient, Rao_coefficient
`batches`	a numeric value specifying the number of batches
`threads`	a numeric value specifying the number of cores to run in parallel
`verbose`	either TRUE or FALSE. If TRUE then information will be printed in the console

Details

Row-wise dissimilarity calculation of text documents. The text document sequences should be converted to numeric matrices using for instance LSI (Latent Semantic Indexing). If the numeric matrices are too big to be pre-processed, then one should use the batches parameter to split the data in batches before applying one of the dissimilarity metrics. For parallelization (threads) OpenMP will be used.

Value

a numeric vector

Examples


## Not run: 

library(textTinyR)


# example input LSI matrices (see details section)
#-------------------------------------------------

set.seed(1)
LSI_matrix1 = matrix(runif(10000), 100, 100)

set.seed(2)
LSI_matrix2 = matrix(runif(10000), 100, 100)


txt_out = TEXT_DOC_DISSIM(first_matr = LSI_matrix1,

                          second_matr = LSI_matrix2, 'euclidean')

## End(Not run)

textTinyR documentation built on June 24, 2024, 5:16 p.m.