README.md

rzeit2

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Build Status CRAN_Status_Badge CRAN_Download_Badge

Purpose / Description

Client for the ZEIT ONLINE Content API - Interface to gather newspaper articles from DIE ZEIT and ZEIT ONLINE, based on a multilevel query.

This package is a lightweight successor of the rzeit package. The main functions are completly rewritten using httr. Additionally, the package provides new functionalities to directly download article texts, comments and even images using web scraping. Old grouping and visualisation functions are removed and will probably also rewritten in the future.

Status

The package is under continuous development and will be extended with additional features in the future.

Installation

Stable Version

# install package from CRAN
install.packages("rzeit2")

# load package
library(rzeit2)

Current Development Version

# install devtools package if it's not already
if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

# install package from GitHub
devtools::install_github("jandix/rzeit2")

# load package
library(rzeit2)

Introduction

Authentication

You need an API key to access the content endpoint. You can apply for a key at http://developer.zeit.de/quickstart/. The function below appends the key automatically to your R environment file. Hence, every time you start R the key is loaded. get_content and get_content_all access your key automatically by executing Sys.getenv("ZEIT_ONLINE_KEY"). Replace api_key with the key you receive from ZEIT and path with the location of your R environment.

# save the api key in the .Renviron file
set_api_key(api_key = "xxx", 
            path = "~/.Renviron")

Download meta data

Important: This function requires an API key. See Authentication.

You can query the whole ZEIT and ZEIT ONLINE archive using the content endpoint. Define your query using the query argument. The API supports query syntax such as + and AND/OR. You find more information at the ZEIT API documentation. As stated above, get_content and get_content_all access your key automatically by executing Sys.getenv("ZEIT_ONLINE_KEY"), but you can provide a key by your own using the api_key argument.

Why are there two functions?

rzeit2 provides two functions to query the content endpoint: get_content and get_content_all. get_content supports only 1000 rows per call because of the API specifications. You can use get_content_all to derive all articles matching your query. Internally, get_content_all executes get_content but fills all the arguments automatically. Please set a timeout to be polite.

# fetch articles up to 1000 rows
tatort_articles <- get_content(query = "Tatort",
                               begin_date = "20180101",
                               end_date = "20180131")

# fetch ALL articles
tatort_articles <- get_content(query = "Tatort",
                               timeout = 2)

Download text (vectorized)

get_article_text allows you to download the article text for a given url. The function is vectorized. Hence, you can define multiple urls. The function automatically scrapes all pages if the article has multiple pages. Please set a timeout to be polite.

tatort_content <- get_article_text(url = tatort_articles$content$href, 
                                   timeout = 1)

Download comments (not vectorized)

get_article_comments allows you to download the article comments for a given url. The function is not vectorized. This function may take quite long due to the ZEIT ONLINE website structure. Since ZEIT does not provide an API for comments this function downloads all comments as your browser would do. Please set a timeout to be polite.

tatort_comments <- get_article_comments(url = tatort_articles$content$href[1], 
                                        timeout = 1)

Download images (vectorized)

get_article_images allows you to download the article images for a given url. The function is vectorized. Hence, you can define multiple urls. The function automatically scrapes all pages if the article has multiple pages. You can directly download the images if you define a path in download. Please ensure that the folder you define exists. The file name is derived as md5 hash of the image url. Hence, it should be un Please set a timeout to be polite.

Please set a timeout to be polite.

tatort_images <- get_article_images(url = tatort_articles$content$href, 
                                    timeout = 1, 
                                    download = "~/Documents/tatort-img/")

Authors

Jan Dix jan.dix@uni-konstanz.de

Acknowledgements

Special thanks to Simon Munzert, who helped me entering the world of R and GitHub. Additionally, I would like to thank Peter Meißner and Christian Graul who helped with the first version of this package. Lastly, I would like to thank Jana Blahak who wrote the documentation for the first package which is widely reused.



jandix/rzeit2 documentation built on Sept. 30, 2020, 3:19 p.m.