readme.md

gdeltr2

gdeltr2R’s modern GDELT Project interface

What is the GDELT Project?

The Global Database of Events, Language, and Tone [GDELT] is a non profit whose initiative is to:

construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what’s happening around the world, what its context is and who’s involved, and how the world is feeling about it, every single day.

GDELT was founded in 1994 and it’s data commences in 1979. Over the last two years the GDELT’s functionality and abilities have grown exponentially, for example in May 2014 GDELT processed 3,928,926 where as in May 2016 it processed 6,198,461. GDELT continues to evolve and integrate advanced machine learning tools including Google Cloud Vision, a data store that became available in February 2016.

This package wraps GDELT’s four primary data stores

Why gdeltr2?

My main motivation for this building package is simple, GDELT IS INCREDIBLE!!

Accessing GDELT’s data gold is doable but either difficult or costly.

Currently, anyone proficient in SQL can access the data via Google Big Query. The problem is that even if you want to use SQL, users have to pay above a certain API call threshold and then you still need another layer of connectivity to explore the data in R.

Although R has two existing packages that allow users to interact with portions of GDELT’s data outside of Big Query:

These packages are old, incomplete and difficult to use. It is my hope that gdelt2r allows the R user easy access to GDELT’s data allowing for faster, more exhilarating data visualizations and analysis!

PRIOR TO INSTALL

This package may require the development versions of devtools and dplyr so, to be safe, before installation run the following code:

devtools::install_github("hadley/devtools")
devtools::install_github("hadley/dplyr")
devtools::install_github("hafen/trelliscopejs")

Installation

devtools::install_github("abresler/gdeltr2")

Function Ontology

The package currently consists of two function families, data acquisition and data tidying.

The package’s data acquisition functions begin with get_urls_ for acquiring data store log information, get_codes_ for acquiring code books and get_data_ for downloading and reading data.

The data tidying functions begin with parse_ and they apply to a number of the features in the gkg and vgkg data stores that will get described in further detail farther below.

CAUTION

Primary Functions

Tidying Functions

Many of the columns in the GKG output are concatenated and require further parsing for proper analysis. These function tidy those concatenated columns, note given file sizes the functions may be time consuming.

V2 Full Text API

You can refer to this blog post that discusses how to use this functionality.

Global Knowledge Graph

  • parse_gkg_mentioned_names() - parses mentioned names
  • parse_gkg_mentioned_people() - parses mentioned people
  • parse_gkg_mentioned_organizations() - parses mentioned organizations
  • parse_gkg_mentioned_numerics() - parses mentioned numeric figures
  • parse_gkg_mentioned_themes() - parses mentioned themes, ties to CAMEO Theme Codes
  • parse_gkg_mentioned_gcams() - parses resolved GCAMs ties GCAM code book.
  • parse_gkg_mentioned_dates() - parses mentioned dates according to the GKG scheme
  • parse_xml_extras() - parses XML metadata from GKG table
Visual Global Knowledge Graph
  • parse_vgkg_labels() - parses and labels learned items
  • parse_vgkg_landmarks() - parses and geocodes learned landmarks
  • parse_vgkg_logos() - parses learned logos
  • parse_vgkg_safe_search() - parses safe search likelihoods
  • parse_vgkg_faces() - parses learned faces
  • parse_vgkg_ocr() - parses OCR’d items
  • parse_vgkg_languages() - parses languages

Code Books

All these the GDELT and GKG datasets contain a whole host of codes that need resolution to be human readable. The package contains easy access to these code books to allow for that resolution. These functions provide access to the code books:

  • get_codes_gcam() - retrieves Global Content Analysis Measurement [GCAM] codes
  • get_codes_cameo_country() - retrieves Conflict and Mediation Event Observations [CAMEO] country codes
  • get_codes_cameo_ethnic() - retrieves cameo ethnic codes
  • get_codes_cameo_events() - retrieves cameo event codes
  • get_codes_gkg_themes() - retrieves gkg theme codes
  • get_codes_cameo_type() - retrieves cameo type codes
  • get_codes_cameo_religion() - retrieves cameo religion codes
  • get_codes_cameo_known_groups() - retrieves cameo known group codes

Coming Soon

  • Vignettes
  • Generic data visualization functions
  • Generic machine learning and data analysis functions
  • bigrquery integration
  • Third party database mirror

EXAMPLES

library(gdeltr2)
load_needed_packages(c('dplyr', 'magrittr'))

GDELT Event Data

events_1989 <-
  get_data_gdelt_periods_event(
    periods = 1989,
    return_message = T
  )

GKG Data

gkg_summary_count_may_15_16_2014 <-
  get_data_gkg_days_summary(
    dates = c('2014-05-15', '2014-05-16'),
    is_count_file = T,
    return_message = T
  )

gkg_full_june_2_2016 <-
  get_data_gkg_days_detailed(
    dates = c("2016-06-02"),
    table_name = 'gkg',
    return_message = T
  )

gkg_mentions_may_12_2016 <-
  get_data_gkg_days_detailed(
    dates = c("2016-05-12"),
    table_name = 'mentions',
    return_message = T
  )

GKG Television Data

gkg_tv_test <- 
  get_data_gkg_tv_days(dates = c("2016-06-17", "2016-06-16"))

GKG Tidying

load_needed_packages(c('magrittr'))

gkg_test <- 
  get_data_gkg_days_detailed(only_most_recent = T, table_name = 'gkg')

gkg_sample_df <- 
  gkg_test %>% 
  sample_n(1000)

xml_extra_df <- 
  gkg_sample_df %>% 
  parse_gkg_xml_extras(filter_na = T, return_wide = F)

article_tone <- 
  gkg_sample_df %>% 
  parse_gkg_mentioned_article_tone(filter_na = T, return_wide = T)

gkg_dates <- 
  gkg_sample_df %>% 
  parse_gkg_mentioned_dates(filter_na = T, return_wide = T)

gkg_gcams <- 
  gkg_sample_df %>% 
  parse_gkg_mentioned_gcams(filter_na = T, return_wide = T)

gkg_event_counts <- 
  gkg_sample_df %>% 
  parse_gkg_mentioned_event_counts(filter_na = T, return_wide = T)

gkg_locations <- 
  gkg_sample_df %>% 
  parse_gkg_mentioned_locations(filter_na = T, return_wide = T)

gkg_names <- 
  gkg_sample_df %>% 
  parse_gkg_mentioned_names(filter_na = T, return_wide = T)

gkg_themes <- 
  gkg_sample_df %>% 
  parse_gkg_mentioned_themes(theme_column = 'charLoc',
                                      filter_na = T, return_wide = T)

gkg_numerics <- 
  gkg_sample_df %>% 
  parse_gkg_mentioned_numerics(filter_na = T, return_wide = T)

gkg_orgs <-
  gkg_sample_df %>% 
  parse_gkg_mentioned_organizations(organization_column = 'charLoc', 
                                             filter_na = T, return_wide = T)

gkg_quotes <-
  gkg_sample_df %>% 
  parse_gkg_mentioned_quotes(filter_na = T, return_wide = T)

gkg_people <- 
  gkg_sample_df %>% 
  parse_gkg_mentioned_people(people_column = 'charLoc', filter_na = T, return_wide = T)

VGKG Tidying

vgkg_test <- 
  get_data_vgkg_dates(only_most_recent = T)

vgkg_sample <- 
  vgkg_test %>% 
  sample_n(1000)

vgkg_labels <- 
  vgkg_sample %>% 
  parse_vgkg_labels(return_wide = T)

faces_test <- 
  vgkg_sample %>% 
  parse_vgkg_faces(return_wide = T)

landmarks_test <- 
  vgkg_sample %>% 
  parse_vgkg_landmarks(return_wide = F)

logos_test <- 
  vgkg_sample %>% 
  parse_vgkg_logos(return_wide = T)

ocr_test <- 
  vgkg_sample %>% 
  parse_vgkg_ocr(return_wide = F)

search_test <- 
  vgkg_sample %>% 
  parse_vgkg_safe_search(return_wide = F)

Sentiment API

location_codes <-
  dictionary_stability_locations()
location_test <-
  instability_api_locations(
    location_ids = c("US", "IS", "CA", "TU", "CH", "UK", "IR"),
    use_multi_locations = c(T, F),
    variable_names = c('instability', 'tone', 'protest', 'conflict'),
    time_periods = c('daily'),
    nest_data = F,
    days_moving_average = NA,
    return_wide = T,
    return_message = T
  )

location_test %>%
  dplyr::filter(codeLocation %>% is.na()) %>%
  group_by(nameLocation) %>%
  summarise_at(.vars = c('instability', 'tone', 'protest', 'conflict'),
               funs(mean)) %>%
  arrange(desc(instability))location_codes <-
  dictionary_stability_locations()
location_test <-
  instability_api_locations(
    location_ids = c("US", "IS", "CA", "TU", "CH", "UK", "IR"),
    use_multi_locations = c(T, F),
    variable_names = c('instability', 'tone', 'protest', 'conflict'),
    time_periods = c('daily'),
    nest_data = F,
    days_moving_average = NA,
    return_wide = T,
    return_message = T
  )

location_test %>%
  dplyr::filter(codeLocation %>% is.na()) %>%
  group_by(nameLocation) %>%
  summarise_at(.vars = c('instability', 'tone', 'protest', 'conflict'),
               funs(mean)) %>%
  arrange(desc(instability))


abresler/gdeltr2 documentation built on July 26, 2023, 11:06 p.m.