R/tf-idf.R

Defines functions h2o.tf_idf

Documented in h2o.tf_idf

#' Computes TF-IDF values for each word in given documents.
#'
#' @param frame             documents or words frame for which TF-IDF values should be computed.
#' @param document_id_col   index or name of a column containing document IDs.
#' @param text_col          index or name of a column containing documents if `preprocess = TRUE`
#'                          or words if `preprocess = FALSE`.
#' @param preprocess        whether input text data should be pre-processed. Defaults to `TRUE`.
#' @param case_sensitive    whether input data should be treated as case sensitive. Defaults to `TRUE`.
#'
#' @return  resulting frame with TF-IDF values.
#'          Row format: documentID, word, TF, IDF, TF-IDF
#' @export
h2o.tf_idf <- function(frame, document_id_col, text_col, preprocess=TRUE, case_sensitive=TRUE) {
    col_indices <- c()
    for (col in c(document_id_col, text_col))
        if(is.numeric(col) && all.equal(col, as.integer(col)))
            col_indices <- c(col_indices, col)
        else if (is.character(col))
            col_indices <- c(col_indices, match(col, colnames(frame)))
        else {
            warning(paste0("Invalid type to specify a column ('", class(col), "'). Name or index of a column is required."))
            return(NULL)
        }
    
    if(is(frame, 'H2OFrame')) {
        .newExpr('tf-idf', frame, col_indices[1], col_indices[2], preprocess, case_sensitive)
    } else {
        warning(paste0("TF-IDF cannot be computed for class ", class(frame), ". H2OFrame input is required."))
        return(NULL)
    }
}

Try the h2o package in your browser

Any scripts or data that you put into this service are public.

h2o documentation built on May 29, 2024, 4:26 a.m.