Description Usage Arguments Details See Also Examples
Compute term frequencies on a corpus.
1 2 3 |
channel |
connection object as returned by |
tableName |
Aster table name |
docId |
vector with one or more column names comprising unique document id.
Values are concatenated with |
textColumns |
one or more names of columns with text. Multiple coumn are concatenated into single text field first. |
parser |
type of parser to use on text. For example, |
weighting |
term frequency formula to compute the tf value. One of following:
|
top |
specifies threshold to cut off terms ranked below |
rankFunction |
one of |
where |
specifies criteria to satisfy by the table rows before applying
computation. The criteria are expressed in the form of SQL predicates (inside
|
idSep |
separator when concatenating 2 or more document id columns (see |
idNull |
string to replace NULL value in document id columns. |
stopwords |
character vector with stop words. Removing stop words takes place in R after results are computed and returned from Aster. |
test |
logical: if TRUE show what would be done, only (similar to parameter |
By default function computes and returns all terms. When large number of terms is expected then
use parameters top
to limit number of terms returned by
filtering top ranked terms for each document. Thus if set top=1000
and there
is 100 documents then at least 100,000 terms (rows) will be returned. Result size could
exceed this number when other than rownumber
rankFunction
used:
rownumber
applies a sequential row number, starting at 1, to each term in a document.
The tie-breaker behavior is as follows: Rows that compare as equal in the sort order will be
sorted arbitrarily within the scope of the tie, and all terms will be given unique row numbers.
rank
function assigns the current row-count number as the terms's rank, provided the
term does not sort as equal (tie) with another term. The tie-breaker behavior is as follows:
terms that compare as equal in the sort order are sorted arbitrarily within the scope of the tie,
and the sorted-as-equal terms get the same rank number.
denserank
behaves like the rank
function, except that it never places
gaps in the rank sequence. The tie-breaker behavior is the same as that of RANK(), in that
the sorted-as-equal terms receive the same rank. With denserank
, however, the next term after
the set of equally ranked terms gets a rank 1 higher than preceding tied terms.
percentrank
assigns a relative rank to each term, using the formula:
(rank - 1) / (total rows - 1)
. The tie-breaker behavior is as follows: Terms that compare
as equal are sorted arbitrarily within the scope of the tie, and the sorted-as-equal rows
get the same percent rank number.
The ordering of the rows is always by their tf value within each document.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | if(interactive()){
# initialize connection to Dallas database in Aster
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")
# compute term-document-matrix of all 2-word Ngrams of Dallas police open crime reports
tdm1 = computeTf(channel=conn, tableName="public.dallaspoliceall", docId="offensestatus",
textColumns=c("offensedescription", "offensenarrative"),
parser=nGram(2),
where="offensestatus NOT IN ('System.Xml.XmlElement', 'C')")
# compute term-document-matrix of all 2-word combinations of Dallas police crime reports
# by time of day (4 documents corresponding to 4 parts of day)
tdm2 = computeTf(channel=conn, tableName="public.dallaspoliceall",
docId="(extract('hour' from offensestarttime)/6)::int%4",
textColumns=c("offensedescription", "offensenarrative"),
parser=token(2, punctuation="[-.,?\\!:;~()]+", stopWords=TRUE),
where="offensenarrative IS NOT NULL")
# include only top 100 ranked 2-word ngrams for each offense status
# into resulting term-document-matrix using dense rank function
tdm3 = computeTf(channel=NULL, tableName="public.dallaspoliceall", docId="offensestatus",
textColumns=c("offensedescription", "offensenarrative"),
parser=nGram(2), top=100, rankFunction="denserank",
where="offensestatus NOT IN ('System.Xml.XmlElement', 'C')")
}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.