topic_frequencies: Compute topic shares for a given time bin

Description Usage Arguments Details Value

View source: R/topic-analysis.R

Description

topic_frequencies summarizes the shares of topics in a chosen time interval as per provided topic shares by document and date.

Usage

1
2
topic_frequencies(topicsByDocDate, timeBinUnit = "week",
  minGamma = 0.01, minTopicTimeBins = 0.5)

Arguments

topicsByDocDate

a dataframe as returned by topics_by_doc_date

timeBinUnit

a character sequence specifying the time period that should be used as a bin unit when computing topic share frequencies. Valid values are "day", "week", "month", "quarter", "year", "week" is used as a default. NOTE, for the assignment of weeks Monday is considered as the first day of the week.

minGamma

the minimum share of a topic per document to be considered when summarizing topic frequencies, topics with smaller shares per individual document will be ignored when computing topic frequencies. (In an stm topic model the likelihood that a topic is generated from a topic is expressed by the value gamma.) The default is 0.01, but should be adjusted with view of the number of topics and the average length of a document.

minTopicTimeBins

a double in the range [0,1] specifying the minimum share of all unique timebins in which an occurrence of a topic share of at least minGamma must have been recorded, i.e. a value of 0.5 (the default) requires that an occurrence of a topic must have been recorded in at least 50% of all unique timebins covered by the dataset; topics that do not meet this threshold will not be included in the returned results.

Details

A stm topic model provides for each document the likelihood (gamma) that it is generated from a specific topic; here we interprete these as the share of a document attributed to this topic and then summarize these shares per timebin to obtain the share of a topic across all documents over time.

The topic share or likelihood per document has to be above a threshold specified by minGamma. A suitable threshold might consider the number of topics and the average document size. An additional filtering option is provided with minTopicTimeBins.

Timebins for which no occurrence of a given topic is recorded are added with an explicit value of zero, excluding however such empty timebins before the first occurrence of a topic and after the last.

Value

a dataframe with term frequencies by chosen timebin, where:

topic_id

a topic ID as provided as an input in topicsByDocDate

timebin

the floor date of a timebin; if timeBinUnit was set to week, this date will always be a Monday

median_gamma

the median of likelihoods of the topic with topic_id in timebin

mean_gamma

the mean of likelihoods of the topic with topic_id in timebin

topicshare

the share of topic with topic_id relative to all topic shares recorded and included in a given timebin. NOTE: strictly speaking these are the likelihoods that a document is generated from a topic, which we here interpret as the share of a document attributed to a topic.

n_docs_topic

the total number of documents in a dataset in which a topic with topic_id occurs as least with likelihood minGamma

first_occur

the exact date of the first occurrence of a topic with topic_id across the whole time range covered by timebins

latest_occur

the exact date of the latest occurrence of a topic with topic_id across the whole time range covered by timebins; note that this date can be larger than the maximum timebin, as timebin specifies the floor date of a time unit

n_topic_timebins

the number of unique timebins in a topic with topic_id occurs at least with likelihood minGamma


sdaume/topicsplorrr documentation built on Dec. 22, 2021, 11:11 p.m.