In ropensci/refimpact: API Wrapper for the UK REF 2014 Impact Case Studies Database

library(refimpact)

Introduction

This package is an API wrapper around the REF Impact Case Studies database API. Chances are that if you're looking at this package, you already know what this dataset is, and you probably know roughly what you're looking for.

If you have stumbled upon this package however, and you want to know more about the dataset, you can head here to find out more. If you are thinking of using this dataset as a toy dataset for learning, then you might find this dataset useful for text mining, amongst other things.

Core functions

The core function for this package is ref_get(), which takes an API method as the first argument, and some optional arguments depending on the method.

The API methods available are detailed below, but presented here for quick reference:

SearchCaseStudies
ListUnitsOfAssessment
ListTagTypes
ListTagValues
ListInstitutions

SearchCaseStudies

This is the core method of the API, and the most important for users of this package. The search method requires a compulsory argument to the ref_get() function: query. This argument takes a list of query parameters, which can be as simple as a single Case Study ID, which returns a single record. A query returning a single record is shown below to demonstrate the syntax and the returned data structure; more complex queries will be shown later in the vignette.

results <- ref_get("SearchCaseStudies", query=list(ID=941))
print(results)

You will note that the function returns a nested tibble - that is a tibble with other data frames inside it. This means that you can interrogate the tibble as per usual:

cat(results[[1, "CaseStudyId"]])
cat(results[[1, "Title"]])
cat(strtrim(results[[1, "ImpactSummary"]], width = 200), "<truncated>")
cat(strtrim(results[[1, "ImpactDetails"]], width = 200), "<truncated>")
cat(results[[1, "Institution"]])

You can also interrogate the nested fields the same way, and even subset them:

print(results[[1, "Country"]])
print(results[[1, "Institutions"]])
print(results[[1, "Institutions"]][,c("UKPRN", "InstitutionName")])

In the opinion of the package author, the nested tibble offers many advantages over other data representations - it is a relatively straight-forward exercise to transform the data into a set of wide or narrow tables if required.

Returning a single case study based on the ID is obviously a niche use-case, so there are some other ways to search the database. But before getting to those, it is worth pointing out that you can select multiple case studies in a single query:

results <- ref_get("SearchCaseStudies", query=list(ID=c(941, 942, 1014)))
print(results)

The ID parameter above is an exclusive parameter - if you provide one or more IDs then the function will print a warning to the console, and remove all parameters except for the IDs. This is based on the API's documented limitations.

The other parameters can all be combined for searching. Those parameters are:

UKPRN - This is a code referencing an institution, and comes from the ListInstitutions method below. Takes a single UKPRN.
UoA - This is a code referencing a Unit of Assessment, and comes from the ListUnitsOfAssessment method below. Takes a single ID.
tags - This is one or more codes referencing tags from the ListTagValues method. The tags are separated into 13 different TagTypes, which are detailed below. When multiple tags are provided to the search method, it will only return rows which contain both tags.
phrase - You can search the database using a text query. The query must conform to Lucene search query syntax.

Some examples are shown below.

results <- ref_get("SearchCaseStudies", query=list(UKPRN = 10007777))
dim(results)
results <- ref_get("SearchCaseStudies", query=list(UoA = 5))
dim(results)
results <- ref_get("SearchCaseStudies", query=list(tags = c(11280, 5085)))
dim(results)
results <- ref_get("SearchCaseStudies", query=list(phrase = "hello"))
dim(results)
results <- ref_get("SearchCaseStudies", query=list(UKPRN = 10007146,
                                                   UoA   = 3))
dim(results)

Unfortunately, the API method requires at least one search parameter, which makes it more difficult to download the entire dataset. A short script for this purpose is included at the end of this vignette.

Useful values for the UKPRN, UoA and tags parameters can be found by querying the other 4 API methods - the phrase parameter is the only parameter which can be used in isolation. Each of the 4 other API methods are outlined below.

ListInstitutions

This method lists all of the institutions which are included in the REF Impact Case Studies database, and the UKPRN column in the resuling tibble can be used as a query parameter

institutions <- ref_get("ListInstitutions")
print(institutions)

ListTagTypes and ListTagValues

These methods provide tags which can be used as search parameters in the SearchCaseStudies method. The ListTagTypes method returns the types of tags available:

tag_types <- ref_get("ListTagTypes")
print(tag_types)

These tag types can then be used as an argument to the ListTagValues method, to get all tags for each type:

tag_values_5 <- ref_get("ListTagValues", tag_type = 5)
print(tag_values_5)

This can take some time to iterate through, so the full table is bundled with this package. You can access it via ref_tags:

print(ref_tags)

ListUnitsOfAssessment

This method lists all of the units of assessment which the Impact Case Studies can be assessed against. The tibble also includes an ID column which can be used when querying the SearchCaseStudies method.

UoAs <- ref_get("ListUnitsOfAssessment")
print(UoAs)

Extracting the entire dataset

As alluded to above, the API cannot be searched without parameters, which means that downloading the entire dataset is not a simple task. The code below can be used to extract all records from the database.

uoa_table <- ref_get("ListUnitsOfAssessment")
uoa_list <- uoa_table$ID

ref_corpus <- vector(length = length(uoa_list), mode = "list")

for (i in seq_along(uoa_list)) {
  message("Retrieving data for UoA ", uoa_list[i])
  ref_corpus[[i]] <- ref_get("SearchCaseStudies", query = list(UoA = uoa_list[i]))
}

output <- do.call(rbind, ref_corpus)