get_collocations: get_collocations function

Description Usage Arguments Value Examples

View source: R/get_collocations.R

Description

get_collocations function

Usage

1
2
get_collocations(corpus_dates, path_name, ntrms, ngrams_number, min_freq,
  language)

Arguments

corpus_dates

a character vector indicating the subfolders where are located the texts.

path_name

the folders path where the subfolders with the dates are located.

ntrms

maximum numbers of collocations that will be filtered by tf-idf. We rank the collocations by tf-idf in a decreasing order. Then, after we select the words with the ntrms highest tf-idf.

ngrams_number

integer indicating the size of the collocations. Defaults to 2, indicating to compute bigrams. If set to 3, will find collocations of bigrams and trigrams.

min_freq

integer indicating the frequency of how many times a collocation should at least occur in the data in order to be returned.

language

the texts language. Default is english.

Value

a list containing a sparse matrix with the all collocations couting and another with a tf-idf filtered collocations counting according to the ntrms.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
st_year=2017
end_year=2018
path_name=system.file("news",package="TextForecast")
qt=paste0(sort(rep(seq(from=st_year,to=end_year,by=1),12)),
c("m1","m2","m3","m4","m5","m6","m7","m8","m9","m10","m11","m12"))
z_coll=get_collocations(corpus_dates=qt[1:23],path_name=path_name,
ntrms=500,ngrams_number=3,min_freq=10)

path_name=system.file("news",package="TextForecast")
days=c("2019-30-01","2019-31-01")
z_coll=get_collocations(corpus_dates=days[1],path_name=path_name,
ntrms=500,ngrams_number=3,min_freq=1)

lucasgodeiro/TextForecast documentation built on Sept. 19, 2019, 3:41 a.m.