computeTfIdf: Compute Term Frequency - Inverse Document Frequency on a...
In toaster: Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

Description Usage Arguments Details See Also Examples

Compute Term Frequency - Inverse Document Frequency on a corpus.

computeTfIdf(channel, tableName, docId, textColumns, parser, top = NULL,
  rankFunction = "rank", idSep = "-", idNull = "(null)",
  adjustDocumentCount = FALSE, where = NULL, stopwords = NULL,
  test = FALSE)

`channel`	connection object as returned by `odbcConnect`
`tableName`	Aster table name
`docId`	vector with one or more column names comprising unique document id. Values are concatenated with `idSep`. Database NULLs are replaced with `idNull` string.
`textColumns`	one or more names of columns with text. Multiple coumn are concatenated into single text field first.
`parser`	type of parser to use on text. For example, `ngram(2)` parser generates 2-grams (ngrams of length 2), `token(2)` parser generates 2-word combinations of terms within documents.
`top`	specifies threshold to cut off terms ranked below `top` value. If value is greater than 0 then included top ranking terms only, otherwise all terms returned (also see paramter `rankFunction`). Terms are always ordered by their term frequency - inverse document frequency (tf-idf) within each document. Filtered out terms have their rank ariphmetically greater than threshold `top` (see details): term is more important the smaller value of its rank.
`rankFunction`	one of `rownumber, rank, denserank, percentrank`. Rank computed and returned for each term within each document. function determines which SQL window function computes term rank value (default `rank` corresponds to SQL `RANK()` window function). When threshold `top` is greater than 0 ranking function used to limit number of terms returned (see details).
`idSep`	separator when concatenating 2 or more document id columns (see `docId`).
`idNull`	string to replace NULL value in document id columns.
`adjustDocumentCount`	logical: if TRUE then number of documents 2 will be increased by 1.
`where`	specifies criteria to satisfy by the table rows before applying computation. The criteria are expressed in the form of SQL predicates (inside `WHERE` clause).
`stopwords`	character vector with stop words. Removing stop words takes place in R after results are computed and returned from Aster.
`test`	logical: if TRUE show what would be done, only (similar to parameter `test` in RODBC functions sqlQuery and sqlSave).

By default function computes and returns all terms. When large number of terms is expected then use parameters top to limit number of terms returned by filtering top ranked terms for each document. Thus if set top=1000 and there is 100 documents then at least 100,000 terms (rows) will be returned. Result size could exceed this number when other than rownumber rankFunction used:

rownumber applies a sequential row number, starting at 1, to each term in a document. The tie-breaker behavior is as follows: Rows that compare as equal in the sort order will be sorted arbitrarily within the scope of the tie, and all terms will be given unique row numbers.
rank function assigns the current row-count number as the terms's rank, provided the term does not sort as equal (tie) with another term. The tie-breaker behavior is as follows: terms that compare as equal in the sort order are sorted arbitrarily within the scope of the tie, and the sorted-as-equal terms get the same rank number.
denserank behaves like the rank function, except that it never places gaps in the rank sequence. The tie-breaker behavior is the same as that of RANK(), in that the sorted-as-equal terms receive the same rank. With denserank, however, the next term after the set of equally ranked terms gets a rank 1 higher than preceding tied terms.
percentrank assigns a relative rank to each term, using the formula: (rank - 1) / (total rows - 1). The tie-breaker behavior is as follows: Terms that compare as equal are sorted arbitrarily within the scope of the tie, and the sorted-as-equal rows get the same percent rank number.

The ordering of the rows is always by their tf-idf value within each document.

computeTf, nGram, token

if(interactive()){
# initialize connection to Dallas database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")

# compute term-document-matrix of all 2-word Ngrams of Dallas police crime reports
# for each 4-digit zip
tdm1 = computeTfIdf(channel=conn, tableName="public.dallaspoliceall", 
                    docId="substr(offensezip, 1, 4)", 
                    textColumns=c("offensedescription", "offensenarrative"),
                    parser=nGram(2, ignoreCase=TRUE, 
                                 punctuation="[-.,?\\!:;~()]+"))
                    
# compute term-document-matrix of all 2-word combinations of Dallas police crime reports
# for each type of offense status
tdm2 = computeTfIdf(channel=NULL, tableName="public.dallaspoliceall", docId="offensestatus", 
                    textColumns=c("offensedescription", "offensenarrative", "offenseweather"),
                    parser=token(2), 
                    where="offensestatus NOT IN ('System.Xml.XmlElement', 'C')")
                    
# include only top 100 ranked 2-word ngrams for each 4-digit zip into resulting 
# term-document-matrix using rank function  
tdm3 = computeTfIdf(channel=NULL, tableName="public.dallaspoliceall", 
                    docId="substr(offensezip, 1, 4)", 
                    textColumns=c("offensedescription", "offensenarrative"),
                    parser=nGram(2), top=100)
                    
# same but get top 10% ranked terms using percent rank function
tdm4 = computeTfIdf(channel=NULL, tableName="public.dallaspoliceall", 
                    docId="substr(offensezip, 1, 4)", 
                    textColumns=c("offensedescription", "offensenarrative"),
                    parser=nGram(1), top=0.10, rankFunction="percentrank")

}

toaster documentation built on May 30, 2017, 3:51 a.m.

toaster index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

toaster
Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

computeTfIdf: Compute Term Frequency - Inverse Document Frequency on a...
In toaster: Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

Description

Usage

Arguments

Details

See Also

Examples

Related to computeTfIdf in toaster...

R Package Documentation

Browse R Packages

We want your feedback!

toaster Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

computeTfIdf: Compute Term Frequency - Inverse Document Frequency on a... In toaster: Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

Description

Usage

Arguments

Details

See Also

Examples

Related to computeTfIdf in toaster...

R Package Documentation

Browse R Packages

We want your feedback!

toaster
Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

computeTfIdf: Compute Term Frequency - Inverse Document Frequency on a...
In toaster: Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform