v2_doc_api: GDELT V2 Doc API

View source: R/v2.R

v2_doc_apiR Documentation

GDELT V2 Doc API

Description

Interact with the GDELT V2 Document API Documentation

Usage

v2_doc_api(
  terms = NULL,
  term_domains = NULL,
  term_exact_domains = NULL,
  use_exact_term = FALSE,
  domains = NULL,
  domains_exact = NULL,
  images_face_tone = NULL,
  images_number_faces = NULL,
  images_ocr_meta = NULL,
  image_tags = NULL,
  image_web_counts = NULL,
  image_web_tags = NULL,
  themes_gkg = NULL,
  near_terms = NULL,
  near_length = 20,
  repeat_terms = NULL,
  repeat_count = 3,
  source_languages = "english",
  source_countries = "United States",
  tone = NULL,
  tone_absolute = NULL,
  modes = "ArtList",
  formats = "json",
  timespans = NULL,
  date_resolution = NULL,
  maximum_records = 250,
  sort_variable = "DateDesc",
  timeline_smooth = NULL,
  start_date = NULL,
  end_date = NULL,
  timezone_adjust = NULL,
  time_zoom = NULL,
  parse_data = TRUE,
  widen_url_parameters = FALSE,
  widen_variables = c("mode", "timespan", "format"),
  nest_data = FALSE,
  return_message = TRUE
)

Arguments

terms

This contains your search query and supports keyword and keyphrase searches, OR statements and a variety of advanced operators.

term_domains

Vector of domains isolated to search.

term_exact_domains

Vector of 'exact' domains to search

use_exact_term

If 'TRUE' quotes terms for exact representations.

domains

Vector of domains. Returns all coverage from the specified domain. Follow by a colon and the domain name of interest. Search for "domain:cnn.com" to return all coverage from CNN

domains_exact

Vector of exact domains

images_face_tone

Vector of tones. Searches the average "tone" of human facial emotions in each image. Only human faces that appear large enough in the image to accurately gauge their facial emotion are considered, so large crowd photos where it is difficult to see the emotion of peoples' faces may not be scored accurately. The tone score of an average photograph typically ranges from +2 to -2. To search for photos where visible people appear to be sad, search "imagefacetone<-1.5". Only available in any of the "image" modes

images_number_faces

This searches the total number of foreground human faces in the image.

images_ocr_meta

This searches a combination of the results of OCR performed on the image in 80+ languages (to extract any text found in the image, including background text like storefronts and signage), all metadata embedded in the image file itself (EXIF, etc) and the textual caption provided for the image. To search for images of a specific event, such as "mobile congress" you would use this field, since that information would most likely either be found in signage in the background of the image, provided in the EXIF metadata in the image or listed in the caption under the image. The search parameter for this field must always be enclosed in quote marks, even when searching for a single word like "imageocrmeta:"zika"". Only available in any of the "image" modes.

image_tags

Every image processed by GDELT is assigned one or more topical tags from a universe of more than 10,000 objects and activities recognized by Google's algorithms. This is the primary and most accurate way of searching global news imagery monitored by GDELT, as these tags represent the ground truth of what is actually depicted in the image itself.

image_web_counts

Every image processed by GDELT is run through the equivalent of a reverse Google Images search that searches the web to see if the image has ever appeared anywhere else on the web that Google has seen. Up to the first 200 web pages where the image has been seen are returned. This operator allows you to screen for popular versus novel images

image_web_tags

Every image processed by GDELT is run through the equivalent of a reverse Google Images search that searches the web to see if the image has ever appeared anywhere else on the web that Google has seen. The system then takes every one of those appearances from across the web and looks at all of the textual captions appearing beside the image and compiles a list of the major topics used to describe the image across the web. This offers tremendous descriptive advantage in that you are essentially "crowdsourcing" the key topics of the image by looking at how it has been described across the web. Values must be enclosed in quote marks. Only available in any of the "image" modes. You can access a list of all tags appearing in at least 100 images (Image WebTag Lookup).

themes_gkg

Searches for any of the GDELT Global Knowledge Graph (GKG) Themes. GKG Themes offer a more powerful way of searching for complex topics, since they can include hundreds or even thousands of different phrases or names under a single heading. To search for coverage of terrorism, use "theme:terror". You can find a list of all themes that have appeared in at least 100 articles over the past two years (GKG Theme Lookup).

near_terms

Allows you to specify a set of keywords that must appear within a given number of words of each other. To use this operator, you specify the word "near", followed by the maximum distance all of the words can appear apart in a given document and still be considered a match, a colon, and then the list of words in quote marks. Phrase matching is not supported at this time, so the list of words is treated as a list of individual words that must all appear together within the given proximity. Note that if the words appear in a document in a different order than specified in the "near" operator, each ordering difference increments the word distance counted by the "near" operator. (Thus, near10:"donald trump" will return documents where "trump" appears within 10 words after "donald", but will also return documents in which "donald" appears within 9 words after "trump".) The distance measure is not precise and can count punctuation and other tokens as "words" as well. It is also important to remember that proximity in a document does not necessarily imply two words are connected semantically each other.

near_length

Vector of lengths to isolate near

repeat_terms

Allows you to specify that a given word must appear at least a certain number of times in a document to be considered a match.

repeat_count

Vector of repeat counts

source_languages

Vector of countries. Searches for articles originally published in the given language. The GEO API currently only allows you to search the English translations of all coverage, but you can specify that you want to limit your search to articles published in a particular language. Using this operator by itself you can map all of the locations mentioned in a particular language across all topics to see the geographic focus of a given language. Search for "sourcelang:spanish" to return only Spanish language coverage. You can also specify its three-character language code. All 65 machine translated languages are supported

source_countries

Vector of source countries. Searches for articles published in outlets located in a particular country. This allows you to narrow your scope to the press of a single country. For countries with spaces in their names, type the full name without the spaces (like "sourcecountry:unitedarabemirates" or "sourcecountry:saudiarabia"). You can also use their 2-character FIPS country code

tone

Allows you to filter for only articles above or below a particular tone score (ie more positive or more negative than a certain threshold). To use, specify either a greater than or less than sign and a positive or negative number (either an integer or floating point number). To find fairly positive articles, search for "tone>5" or to search for fairly negative articles, search for "tone<-5".

tone_absolute

The same as "Tone" but ignores the positive/negative sign and lets you simply search for high emotion or low emotion articles, regardless of whether they were happy or sad in tone. Thus, search for "toneabs<1" for fairly neutral articles or search for "toneabs>10" for fairly emotional articles.

modes

This specifies the specific output you would like from the API, ranging from timelines to word clouds to article lists.

  • 'ArtList' This is the most basic output mode and generates a simple list of news articles that matched the query. In HTML mode articles are displayed in a table with its social sharing image (if available) to its left, the article title, its source country, language and publication date all shown. RSS output format is only available in this mode.

  • 'ArtGallery' This displays the same information as the ArtList mode, but does so using a high design visual layout suitable for creating magazine-style collages of matching coverage. Only articles containing a social sharing image are included. #'

  • ‘ImageCollageInfo' This yields identical output as the ImageCollage option, but adds four additional pieces of information to each image: 1) the number of times (up to 200) it has been seen before on the open web (via a reverse Google Images search), 2) a list of up to 6 of those web pages elsewhere on the web where the image was found in the past, 3) the date the photograph was captured via in the image’s internal metadata (EXIF/etc), and 4) a warning if the image's embedded date metadata suggests the photograph was taken more than 72 hours prior to it appearing in the given article. Using this information you can rapidly triage which of the returned images are heavily-used images and which are novel images that have never been found anywhere on the web before by Google's crawlers. (You can also use the imagewebcount query term above to restrict your search to just images which have appeared a certain number of times.) Only a relatively small percent of news images contain an

  • 'ImageGallery' This displays most of the same information as the 'ImageCollageInfo' mode (though it does not include the embedded date warning)

  • 'ImageCollageShare' Instead of returning VGKG-processed images, this mode returns a list of the social sharing images found in the matching news articles. Social sharing images are those specified by an article to be shown as its image when shared via social media sites like Facebook and Twitter. Not all articles include social sharing images and the images may sometimes only be the logo of the news outlet or not representative of the article contents, but in general they offer a reasonable visual summary of the core focus of the article and especially how it will appear when shared across social media platforms.

  • 'TimelineVol' This is the most basic timeline mode and returns the volume of news coverage that matched your query by day/hour/15 minutes over the search period. Since the total number of news articles published globally varies so much through the course of a day and through the weekend and holiday periods, the API does not return a raw count of matched articles, but instead divides the number of matching articles by the total number of all articles monitored by GDELT in each time step. Thus, the timeline reports volume as a percentage of all global coverage monitored by GDELT. For time spans of less than 72 hours, the timeline uses a time step of 15 minutes to provide maximum temporal resolution, while for time spans from 72 hours to one week it uses an hourly resolution and for time spans of greater than a week it uses a daily resolution. In HTML mode the timeline is displayed as an interactive browser-based visualization.

  • 'TimelineVolRaw' This is identical to the standard TimelineVol mode, but instead of reporting results as a percent of all online coverage monitored by GDELT, it returns the actual number of distinct articles that matched your query.

  • 'TimelineVolInfo' This is identical to the main TimelineVol mode, but for each time step it displays the top 10 most relevant articles that were published during that time interval. Thus, if you see a sudden spike in coverage of your topic, you can instantly see what was driving that coverage. In HTML mode a popup is displayed over the timeline as you mouse over it and you can click on any of the articles to view them, while in JSON and CSV mode the article list is output as part of the file

  • 'TimelineTone' Similar to the main TimelineVol mode, but instead of coverage volume it displays the average tone of all matching coverage, from extremely negative to extremely positive.

  • 'TimelineLang' Similar to the TimelineVol mode, but instead of showing total coverage volume, it breaks coverage volume down by language so you can see which languages are focusing the most on a topic. Note that the GDELT APIs currently only search the 65 machine translated languages supported by GDELT, so stories trending in unsupported languages will not be displayed in this graph, but will likely be captured by GDELT as they are cross-covered in other languages. With the launch of GDELT3 later this summer, the resolution and utility of this graph will increase dramatically.

  • 'TimelineSourceCountry' Similar to the TimelineVol mode, but instead of showing total coverage volume, it breaks coverage volume down by source country so you can see which countries are focusing the most on a topic. Note that GDELT attempts to monitor as much media as possible in each country, but smaller countries with less developed media systems will necessarily be less represented than larger countries with massive local press output. With the launch of GDELT3 later this summer, the resolution and utility of this graph will increase dramatically.

  • 'ToneChart'This is an extremely powerful visualization that creates an emotional histogram showing the tonal distribution of coverage of your query. All coverage matching your query over the search time period is tallied up and binned by tone, from -100 (extremely negative) to +100 (extremely positive). (Though typically the actual range will be from -20 to 20 or less). Articles in the -1 to +1 bin tend to be more neutral or factually-focused, while those on either extreme tend to be emotionally-laden diatribes. Typically most sentiment dashboards display a single number representing the average of all coverage matching the query ala The average tone of Donald Trump coverage in the last week is -7. Such displays are not very informative since its unclear what precisely -7 means in terms of tone and whether that means that most coverage clustered around -7 or whether it means there were a lot of extremely negative and extremely positive coverage that averaged out to -7, but no act ‘WordCloudImageTags' This is identical to the WordCloudEnglish mode, but instead of the article text words, this mode takes all of the VGKG-processed images found in the matching articles (or which matched any image query operators) and constructs a histogram of the top topics assigned by Google’s deep learning neural network algorithms as part of the Google Cloud Vision API. ‘WordCloudImageWebTags' This is identical to the WordCloudImageTags mode, but instead of using the tags assigned by Google’s deep learning algorithms, it uses the Google knowledge graph topical taxonomy tags assigned by the Google Cloud Vision API's Web Annotations engine. This engine performs a reverse Google Images search on each image to locate all instances where it has been seen on the open web, examines the captions of all of those instances of the image and compiles a list of topical tags that capture the contents of those captions. In this way this field offers a far more powerful and higher resolution understanding of the primary topics and activities depicted in the image, including context that is not visible in the image, but relies on the captions assigned by others, whereas the WordCloudImageTags field displays the output of deep learning algorithms considering the visual contents of the image.

formats

This controls what file format the results are displayed in. Not all formats are available for all modes. To assist with website embedding, the CORS ACAO header for all output of the API is set to the wildcard "*", permitting universal embedding

  • 'HTML' This is the default mode and returns a browser-based visualization or display. Some displays, such as word clouds, are static images, some, like the timeline modes, result in interactive clickable visualizations, and some result in simple HTML lists of images or articles. The specific output varies by mode, but all are intended to be displayed directly in the browser in a user-friendly intuitive display and are designed to be easily embedded in any page via an iframe.

  • 'CSV' This returns the requested data in comma-delimited (CSV) format. The specific set of columns varies based on the requested output mode. Note that since some modes return multilingual content, the CSV is encoded as UTF8 and includes the UTF8 BOM to work around Microsoft Excel limitations handling UTF8 CSV files.

  • 'RSS' This output format is only available in ArticleList mode and returns the list of matching article URLs and titles in RSS 2.0 format. This makes it possible to display the results using any standard RSS reader. It also makes it seamless for web archives to create tailored archival feeds to preserve news coverage on certain topics or meeting certain criteria.

  • 'JSON' This returns the requested data in UTF8 encoded JSON. The specific fields varies by output mode.

  • 'JSONP' This mode is identical to "JSON" mode, but accepts an additional parameter in the API URL "callback=XYZ" (if not present defaults to "callback") and wraps the JSON in that callback to return JSONP compliant JavaScript code.

  • 'JSONFeed' This output format is only available in ArticleList mode and returns the list of matching article URLs and titles in JSONFeed 1.0 format.

timespans

By default the DOC API searches the last 3 months of coverage monitored by GDELT. You can narrow this range by using this option to specify the number of months, weeks, days, hours or minutes (minimum of 15 minutes). The API then only searches documents published within the specified timespan backwards from the present time. If you would instead like to specify the precise start/end time of the search instead of an offset from the present time, you should use the STARTDATETIME/ENDDATETIME parameters

date_resolution

These parameters allow you to specify the precise start and end date/times to search, instead of using an offset like with TIMESPAN.

maximum_records

Number of records

sort_variable

By default results are sorted by relevance to your query. Sometimes you may wish to sort by date or tone instead.

  • 'DateDesc' Sorts results by publication date, displaying the most recent articles first

  • 'DateAsc' Sorts results by publication date, displaying the oldest articles first.

  • 'ToneDesc' Sorts results by tone, displays the most positive articles first.

  • 'ToneAsc' Sorts results by tone, displays the most negative articles first

  • 'HybridRel' This is the default new relevance sorting mode for all searches of content published after 12:01AM September 16, 2018. It uses a combination of the textual relevance of the article and other signals, including the "popularity" of the outlet to rank highly relevant content from well known outlets at top, rather than ranking content exclusively based on its textual relevance, which tends to surface obscure coverage. We will be constantly refining the underlying scoring models over time to yield the best possible results and once we have a final model that performs well in all scenarios we will retroactively apply it to our entire backfile and make it available for all searches. This mode is not currently available for image searches, only textual article searches

timeline_smooth

This option is only available in the various Timeline modes and performs moving window smoothing over the specified number of time steps, up to a maximum of 30. Due to GDELT's high temporal resolution, timeline displays can sometimes capture too much of the chaotic noisy information environment that is the global news landscape, resulting in jagged displays. Use this option to enable moving average smoothing up to 30 days. Note that since this is a moving window average, peaks will be shifted to the right, up to several days or weeks at the heaviest smoothing levels.

start_date

Start time YYYYMMDDHHMMSS

end_date

End time YYYYMMDDHHMMSS

timezone_adjust

Timezone Adjus

time_zoom

This option is only available for timeline modes in HTML format output and enables interactive zooming of the timeline using the browser-based visualization. Set to "yes" to enable and set to "no" or do not include the parameter, to disable. By default, the browser-based timeline display allows interactive examination and export of the timeline data, but does not allow the user to rezoom the display to a more narrow time span. If enabled, the user can click-drag horizontally in the graph to select a specific time period. If the visualization is being displayed directly by itself (it is the "parent" page), it will automatically refresh the page to display the revised time span. If the visualization is being embedded in another page via iframe, it will use postMessage to send the new timespan to the parent page with parameters "startdate" and "enddate" in the format needed by the STARTDATETIME and ENDDATETIME API parameters. The parent page can then use these parameters to rewrite the URLs of any API visualizations embedded in the page and reload each of them. This allows the creation of dashboard-like displays that contain multiple DOC API visualizations where the user can zoom the timeline graph at the top and have all of the other displays automatically refresh to narrow their coverage to that revised time frame.

parse_data

If 'TRUE' parse data

widen_url_parameters

if 'TRUE' widens URL parameters

widen_variables

If 'TRUE' variables to unite for API urls. Default 'c("mode", "timespan", "format")'

nest_data

If 'TRUE' nest parsed data

return_message

If 'TRUE' returns message

Examples

library(gdeltr2)
v2_doc_api(terms = c("Donald Trump"))

abresler/gdeltr2 documentation built on Sept. 23, 2024, 5:36 a.m.