Home

/

GitHub

/

README.md
In jandix/rzeit2: Client for the ZEIT ONLINE Content API

rzeit2

Client for the ZEIT ONLINE Content API - Interface to gather newspaper articles from DIE ZEIT and ZEIT ONLINE, based on a multilevel query.

This package is a lightweight successor of the rzeit package. The main functions are completly rewritten using httr. Additionally, the package provides new functionalities to directly download article texts, comments and even images using web scraping. Old grouping and visualisation functions are removed and will probably also rewritten in the future.

The package is under continuous development and will be extended with additional features in the future.

# install package from CRAN
install.packages("rzeit2")

# load package
library(rzeit2)

# install devtools package if it's not already
if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

# install package from GitHub
devtools::install_github("jandix/rzeit2")

# load package
library(rzeit2)

You need an API key to access the content endpoint. You can apply for a key at http://developer.zeit.de/quickstart/. The function below appends the key automatically to your R environment file. Hence, every time you start R the key is loaded. get_content and get_content_all access your key automatically by executing Sys.getenv("ZEIT_ONLINE_KEY"). Replace api_key with the key you receive from ZEIT and path with the location of your R environment.

# save the api key in the .Renviron file
set_api_key(api_key = "xxx", 
            path = "~/.Renviron")

Important: This function requires an API key. See Authentication.

You can query the whole ZEIT and ZEIT ONLINE archive using the content endpoint. Define your query using the query argument. The API supports query syntax such as + and AND/OR. You find more information at the ZEIT API documentation. As stated above, get_content and get_content_all access your key automatically by executing Sys.getenv("ZEIT_ONLINE_KEY"), but you can provide a key by your own using the api_key argument.

Why are there two functions?

rzeit2 provides two functions to query the content endpoint: get_content and get_content_all. get_content supports only 1000 rows per call because of the API specifications. You can use get_content_all to derive all articles matching your query. Internally, get_content_all executes get_content but fills all the arguments automatically. Please set a timeout to be polite.

# fetch articles up to 1000 rows
tatort_articles <- get_content(query = "Tatort",
                               begin_date = "20180101",
                               end_date = "20180131")

# fetch ALL articles
tatort_articles <- get_content(query = "Tatort",
                               timeout = 2)

get_article_text allows you to download the article text for a given url. The function is vectorized. Hence, you can define multiple urls. The function automatically scrapes all pages if the article has multiple pages. Please set a timeout to be polite.

tatort_content <- get_article_text(url = tatort_articles$content$href, 
                                   timeout = 1)

get_article_comments allows you to download the article comments for a given url. The function is not vectorized. This function may take quite long due to the ZEIT ONLINE website structure. Since ZEIT does not provide an API for comments this function downloads all comments as your browser would do. Please set a timeout to be polite.

tatort_comments <- get_article_comments(url = tatort_articles$content$href[1], 
                                        timeout = 1)

get_article_images allows you to download the article images for a given url. The function is vectorized. Hence, you can define multiple urls. The function automatically scrapes all pages if the article has multiple pages. You can directly download the images if you define a path in download. Please ensure that the folder you define exists. The file name is derived as md5 hash of the image url. Hence, it should be un Please set a timeout to be polite.

Please set a timeout to be polite.

tatort_images <- get_article_images(url = tatort_articles$content$href, 
                                    timeout = 1, 
                                    download = "~/Documents/tatort-img/")

Jan Dix jan.dix@uni-konstanz.de

Special thanks to Simon Munzert, who helped me entering the world of R and GitHub. Additionally, I would like to thank Peter Meißner and Christian Graul who helped with the first version of this package. Lastly, I would like to thank Jana Blahak who wrote the documentation for the first package which is widely reused.

jandix/rzeit2 documentation built on Sept. 30, 2020, 3:19 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jandix/rzeit2
Client for the ZEIT ONLINE Content API

README.md
In jandix/rzeit2: Client for the ZEIT ONLINE Content API

rzeit2

Purpose / Description

Status

Installation

Stable Version

Current Development Version

Introduction

Authentication

Download meta data

Download text (vectorized)

Download comments (not vectorized)

Download images (vectorized)

Authors

Acknowledgements

R Package Documentation

Browse R Packages

We want your feedback!

jandix/rzeit2 Client for the ZEIT ONLINE Content API

README.md In jandix/rzeit2: Client for the ZEIT ONLINE Content API

rzeit2

Purpose / Description

Status

Installation

Stable Version

Current Development Version

Introduction

Authentication

Download meta data

Download text (vectorized)

Download comments (not vectorized)

Download images (vectorized)

Authors

Acknowledgements

R Package Documentation

Browse R Packages

We want your feedback!

jandix/rzeit2
Client for the ZEIT ONLINE Content API

README.md
In jandix/rzeit2: Client for the ZEIT ONLINE Content API