computeTfIdf: Compute Term Frequency - Inverse Document Frequency on a...

Description Usage Arguments Details See Also Examples

Description

Compute Term Frequency - Inverse Document Frequency on a corpus.

Usage

1
2
3
4
computeTfIdf(channel, tableName, docId, textColumns, parser, top = NULL,
  rankFunction = "rank", idSep = "-", idNull = "(null)",
  adjustDocumentCount = FALSE, where = NULL, stopwords = NULL,
  test = FALSE)

Arguments

channel

connection object as returned by odbcConnect

tableName

Aster table name

docId

vector with one or more column names comprising unique document id. Values are concatenated with idSep. Database NULLs are replaced with idNull string.

textColumns

one or more names of columns with text. Multiple coumn are concatenated into single text field first.

parser

type of parser to use on text. For example, ngram(2) parser generates 2-grams (ngrams of length 2), token(2) parser generates 2-word combinations of terms within documents.

top

specifies threshold to cut off terms ranked below top value. If value is greater than 0 then included top ranking terms only, otherwise all terms returned (also see paramter rankFunction). Terms are always ordered by their term frequency - inverse document frequency (tf-idf) within each document. Filtered out terms have their rank ariphmetically greater than threshold top (see details): term is more important the smaller value of its rank.

rankFunction

one of rownumber, rank, denserank, percentrank. Rank computed and returned for each term within each document. function determines which SQL window function computes term rank value (default rank corresponds to SQL RANK() window function). When threshold top is greater than 0 ranking function used to limit number of terms returned (see details).

idSep

separator when concatenating 2 or more document id columns (see docId).

idNull

string to replace NULL value in document id columns.

adjustDocumentCount

logical: if TRUE then number of documents 2 will be increased by 1.

where

specifies criteria to satisfy by the table rows before applying computation. The criteria are expressed in the form of SQL predicates (inside WHERE clause).

stopwords

character vector with stop words. Removing stop words takes place in R after results are computed and returned from Aster.

test

logical: if TRUE show what would be done, only (similar to parameter test in RODBC functions sqlQuery and sqlSave).

Details

By default function computes and returns all terms. When large number of terms is expected then use parameters top to limit number of terms returned by filtering top ranked terms for each document. Thus if set top=1000 and there is 100 documents then at least 100,000 terms (rows) will be returned. Result size could exceed this number when other than rownumber rankFunction used:

The ordering of the rows is always by their tf-idf value within each document.

See Also

computeTf, nGram, token

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
if(interactive()){
# initialize connection to Dallas database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")

# compute term-document-matrix of all 2-word Ngrams of Dallas police crime reports
# for each 4-digit zip
tdm1 = computeTfIdf(channel=conn, tableName="public.dallaspoliceall", 
                    docId="substr(offensezip, 1, 4)", 
                    textColumns=c("offensedescription", "offensenarrative"),
                    parser=nGram(2, ignoreCase=TRUE, 
                                 punctuation="[-.,?\\!:;~()]+"))
                    
# compute term-document-matrix of all 2-word combinations of Dallas police crime reports
# for each type of offense status
tdm2 = computeTfIdf(channel=NULL, tableName="public.dallaspoliceall", docId="offensestatus", 
                    textColumns=c("offensedescription", "offensenarrative", "offenseweather"),
                    parser=token(2), 
                    where="offensestatus NOT IN ('System.Xml.XmlElement', 'C')")
                    
# include only top 100 ranked 2-word ngrams for each 4-digit zip into resulting 
# term-document-matrix using rank function  
tdm3 = computeTfIdf(channel=NULL, tableName="public.dallaspoliceall", 
                    docId="substr(offensezip, 1, 4)", 
                    textColumns=c("offensedescription", "offensenarrative"),
                    parser=nGram(2), top=100)
                    
# same but get top 10% ranked terms using percent rank function
tdm4 = computeTfIdf(channel=NULL, tableName="public.dallaspoliceall", 
                    docId="substr(offensezip, 1, 4)", 
                    textColumns=c("offensedescription", "offensenarrative"),
                    parser=nGram(1), top=0.10, rankFunction="percentrank")

}

teradata-aster-field/toaster documentation built on May 31, 2019, 8:36 a.m.