inspirehep - facilitating access to high-energy physics literature with R

knitr::opts_chunk$set(
  comment = "#>",
  collapse = TRUE,
  warning = FALSE,
  message = FALSE
)

Travis-CI Build Status AppVeyor Build Status Coverage Status

This package gives access to INSPIRE HEP, a comprehensive source for High-Energy Physics Literature.

API Documentation: https://inspirehep.net/info/hep/api

No API registration needed, no limits in place.

If you need to gather many records or in case you have many queries, please be nice and consider the following options for bulk downloads:

See this example how to work with the json dump

Installation

Get the development version from GitHub

install.packages("devtools")
devtools::install_github("njahn82/inspirehep")

Load inspirehep

library('inspirehep')

Basic usage

Use hep_search to search INSPIRE HEP, and hep_details to get detailed information on the record-level.

Searching INSPIRE HEP

If you are familiar with the INSPIRE HEP web search, discovering literature with hep_search is easy because the API supports all well-known search features. Through the SPIRES syntax not only searching metadata is possible, but also structured full-text queries are supported.

The INSPIRE HEP team gives search tips with a particular focus on the SPIRES syntax: https://inspirehep.net/info/hep/search-tips

hep_search parses the resulting MARC XML and returns key metadata as data.frame with the following columns:

| Variable | Description |:----------------|:-----------------------------------------| |id |INSPIRE HEP ID | |title |title of work | |author |first author | |affiliation |first author affiliation (collapsed ";") | |doi |Digital Object Identifier (DOI) | |report_number |eprint id from the arXiv (collapsed ";") | |date_reported |date of initial appearance of preprint | |journal |journal title | |volume |journal volume | |issue |journal issue | |keywords |controlled keywords (collapsed ";") | |collection |collection information (collapsed ";") | |license_url |re-use terms for full texts (e.g. CC) |

Keyword search

library(dplyr) # for pipes and tbl_df class
hep_search("witten black hole", limit = 5) %>%
  tbl_df()

Exact Author search

INSPIRE HEP disambiguates author names. To search for an exact author, e.g. Dominik Schwarz and get the most frequent journals:

hep_search('exactauthor:D.J.Schwarz.1', limit = 250) %>%
  group_by(journal) %>% 
  group_by(journal) %>% 
  summarise(counts = n()) %>% 
  arrange(desc(counts))

Full-text search

Search in arXiv eprints:

hep_search('find ft "faster than light"', limit = 5) %>%
  tbl_df()

Navigate through the search results

By default, 100 records are returned for each query. The parameter limit can be used to control the number of records that you wish to retrieve.

To jump to a record, use the jrec parameter. For example, you want records 20 to 29, tell hep_search that you would like to start from record 20 and limit your search results to 10 records.

hep_search('witten black hole', jrec = 20, limit = 10) %>%
  tbl_df()

Last but not least, you can use batch_size to control the size of your result pages. By default, batch_size groups 10 records into a single page. The maximum number is 250. Please note that large values per page could cause longer response times. Consider the bulk download options via OAI-PMH or the json data dump if you need to get to work with a large set of INSPIRE records.

Get record details

Working with the json data dump

INSPIRE HEP offers data dumps to support working with large amounts of INSPIRE HEP records. To prevent memory problems, load the json dump incrementally with the jsonlite::stream_in function. The function supports custom handlers, so you can apply your own function on the incoming stream.

Suppose we want to retrieve all cited HEP publications in 2015. In the example of the INSPIRE HEP dump, load 500 records per iteration from the connection, select the columns recid, citations and creation_data, and filter out records published in 2015. The resulting data.frame is saved as temporary file, which can be loaded into R again.

library(dplyr)
library(curl)
library(jsonlite)
con <- gzcon(curl("https://inspirehep.net/hep_records.json.gz")) 
output <- file(tmp <- tempfile(), open = "wb")
stream_in(con, function(df){
  df <- select(df, recid, citations, creation_date)
  df <- filter(df, grepl('2015', creation_date))
  stream_out(df, output, verbose = FALSE)
})
close(output)
mydata <- stream_in(file(tmp))

tbl_df(mydata)

This strategy originates from this Jeroen Ooms talk on jsonlite and using Mongo DB

to be added

Meta

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Licence: MIT (c) Najko Jahn

For bug reports or feature requests please use the issue tracker.



njahn82/inspirehep documentation built on May 23, 2019, 7:06 p.m.