simDic: Document Similarity using Dictionary
In smdc: Document Similarity

Description Usage Arguments Value Author(s) Examples

This function calculates the similarity between documents and documents by using dictionary.

1	simDic(docMatrix1, docMatrix2, scoreDict, breaks = seq(-1, 1, length = 11), norm = FALSE, method = "cosine", scoreFunc = mean)

`docMatrix1`	Document matrix whose rows represent feature vector of one document. This matrix must satisfy the following: colnames(docMatrix1) denote feature names, rownames(docMatrix1) denote document names, every element is numerical.
`docMatrix2`	Document matrix whose rows represent feature vector of one document. This matrix must satisfy the following: colnames(docMatrix2) denote feature names, rownames(docMatrix2) denote document names, every element is numerical.
`scoreDict`	Dictionary matrix which converts features to numbers. This matrix must k * 2 matrix: 1st colmn represents features and 2nd column represents corresponding number. Similarity is calculated according to the number.
`breaks`	Range vector of frequency distribution. Each element must be ascending order.
`norm`	Whether normalize similarity matrix or not.
`method`	Method to caluculate similarity.
`scoreFunc`	Function of scoring from dictionary.

Similarity Matrix whose rows represent documents of docMatrix1 and whose columns represent documents of docMatrix2. This matrix is n * m matrix where n=ncol(docMatrix1) and m=ncol(docMatrix2), and satisfy the following: rownames(returnValue)=colnames(docMatrix1), colnames(returnValue)=colnames(docMatrix2).

Masaaki TAKADA

## The function is currently defined as
function (docMatrix1, docMatrix2, scoreDict, breaks = seq(-1, 
    1, length = 11), norm = FALSE, method = "cosine", scoreFunc = mean) 
{
    library("proxy")
    words <- unique(rbind(matrix(rownames(docMatrix1)), matrix(rownames(docMatrix2))))
    words <- words[order(words)]
    wordScores <- rep(NA, length(words))
    for (i in 1:length(words)) {
        cond <- (scoreDict[, 1] == words[i])
        value <- scoreDict[cond, 2]
        if (length(value) != 0) {
            wordScores[i] <- scoreFunc(value, na.rm = TRUE)
        }
    }
    names(breaks) <- cut(breaks, breaks)
    wordClass <- cut(wordScores, breaks)
    names(wordClass) <- words
    docFreq1 <- conv2Freq(docMatrix1, wordClass, breaks)
    docFreq2 <- conv2Freq(docMatrix2, wordClass, breaks)
    colnames(docFreq1) <- paste("r_", colnames(docMatrix1), sep = "")
    colnames(docFreq2) <- paste("c_", colnames(docMatrix2), sep = "")
    sim <- as.matrix(simil(t(cbind(docFreq1, docFreq2)), method = method))[colnames(docFreq1), 
        colnames(docFreq2)]
    rownames(sim) <- colnames(docMatrix1)
    colnames(sim) <- colnames(docMatrix2)
    if (norm) {
        sim <- normalize(sim)
    }
    return(sim)
  }