term_frequencies: Compute the frequency of terms within in a chosen time period

Description Usage Arguments Details Value

View source: R/term-analysis.R

Description

term_frequencies summarizes the counts and relative frequency of terms in a chosen timebin for a given collection of terms with date of occurrence.

Usage

1
2
term_frequencies(termsByDate, timeBinUnit = "week",
  minTermTimeBins = 0.5, minTermOccurences = 10)

Arguments

termsByDate

a dataframe as returned by terms_by_date

timeBinUnit

a character sequence specifying the time period that should be used as a bin unit when computing term frequencies. Valid values are "day", "week", "month", "quarter", "year", but for the text sources processed in this package "week" is recommended and used as a default. NOTE: for the assignment of weeks Monday is considered as the first day of the week.

minTermTimeBins

a double in the range [0,1] specifying the minimum share of all unique timebins in which an occurrence of a term must have been recorded, i.e. a value of 0.5 (the default) requires that an occurrence of a term must have been recorded in at least 50% of all unique timebins covered by the dataset; terms that do not meet this threshold will not be included in the returned results.

minTermOccurences

an integer specifying the minimum of total occurrences of a term to be included in the results; terms that do not meet this threshold will not be included in the returned results.

Details

Timebins for which no occurrence of a given term is recorded are added with an explicit value of zero, excluding however such empty timebins before the first occurrence of a term and after the last.

Value

a dataframe with term frequencies by chosen timebin, where:

term

a term as provided as an input in termsByDate

timebin

the first day of the a timebin; if timeBinUnit was set to week, this date will always be a Monday

n_term_per_timebin

the number of occurrences of term in timebin

first_occur

the exact date of the first occurrence of term across the whole time range covered by timebins

latest_occur

the exact date of the latest occurrence of term across the whole time range covered by timebins; note that this date can be larger than the maximum timebin, as timebin specifies the floor date of a time unit

term_share_per_timebin

the share of term in a given timebin with respect to all other term occurrences in a timebin; NOTE that this is computed in consideration of all terms, including those that may be filtered out of the results

n_term_total

the total number of occurrences in the dataset, i.e. across all timebins

n_term_timebins

the number of unique timebins in which an occurrence of term was recorded


sdaume/topicsplorrr documentation built on Dec. 22, 2021, 11:11 p.m.