get_aggregates: Getting already aggregated time series produced by...
In epitweetr: Early Detection of Public Health Threats from 'Twitter' Data

get_aggregates

R Documentation

Getting already aggregated time series produced by `detect_loop`

Description

Read and returns the required aggregated dataset for the selected period and topics defined by the filter.

Usage

get_aggregates(
  dataset = "country_counts",
  cache = TRUE,
  filter = list(),
  top_field = NULL,
  top_freq = NULL
)

Arguments

`dataset`	A character(1) vector with the name of the series to request, it must be one of 'country_counts', 'geolocated', 'topwords', 'hashtags', 'entities', 'urls', 'contexts', default: 'country_counts'
`cache`	Whether to use the cache for lookup and storing the returned dataframe, default: TRUE
`filter`	A named list defining the filter to apply on the requested series, it should be on the shape of a named list e.g. list(tweet_geo_country_code=list('FR', 'DE')) default: list()
`top_field`	Name of the top field used with top_frequency to enable optimisation for getting only most frequent elements. It will only keep top 500 items after first 50k lines on reverse index order
`top_freq`	character, Name of the frequency fields used with top_field to enable optimisation for getting only most frequent elements. It will only keep top 500 items after first 50k rows on reverse index order

Details

This function returns data aggregated by epitweetr. The data is found on the 'series' folder, which contains Rds files per weekday and type of series. starting on v 1.0.x it will also look on Lucene indexes situated on fs folder. Names of files and folders are parsed to limit the files to be read. When using Lucene indexes, filters are directly applied on read. This is an improvement compared 'series' folder where filters are applied after read. All returned rows are joined in a single dataframe. If no filter is provided all data series is returned, which can end up with millions of rows depending on the time series. To limit by period, the filter list must have an element 'period' containing a date vector or list with two dates representing the start and end of the request.

To limit by topic, the filter list must have an element 'topic' containing a non-empty character vector or list with the names of the topics to return.

The available time series are:

"country_counts" counting tweets and retweets by posted date, hour and country
"geolocated" counting tweets and retweets by posted date and the smallest possible geolocated unit (city, administrative level or country)
"topwords" counting tweets and retweets by posted date, country and the most popular words, (this excludes words used in the topic search)

The returned dataset can be cached for further calls if requested. Only one dataset per series is cached.

Value

A data frame containing the requested series for the requested period

Examples

if(FALSE){
   message('Please choose the epitweetr data directory')
   setup_config(file.choose())
   # Getting all country tweets between 2020-jan-10 and 2020-jan-31 for all topics
   df <- get_aggregates(
     dataset = "country_counts", 
     filter = list(period = c("2020-01-10", "2020-01-31"))
   )

   # Getting all country tweets for the topic dengue
   df <- get_aggregates(dataset = "country_counts", filter = list(topic = "dengue"))

   # Getting all country tweets between 2020-jan-10 and 2020-jan-31 for the topic dengue
    df <- get_aggregates(
        dataset = "country_counts",
         filter = list(topic = "dengue", period = c("2020-01-10", "2020-01-31"))
    )
}

epitweetr documentation built on Nov. 16, 2023, 5:07 p.m.