In leonawicz/rtrek: Data Analysis Relating to Star Trek

set.seed(1)
knitr::opts_chunk$set(
  collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, error = FALSE
)
library(rtrek)

The rtrek package includes some Star Trek datasets, but much more data is available outside the package. You can access other Star Trek data through various APIs.

Technically, there is only one formal API: the Star Trek API (STAPI). rtrek has functions to assist with making calls to this API in order to access specific data. This is the focus of this vignette.

rtrek also interfaces with and extracts information from the Memory Alpha and Memory Beta websites. Neither of these sites actually expose an API, but functions in rtrek with querying these websites in an API-like manner. See the respective vignettes for details.

Accessing information from STAPI is covered to some degree in the package introduction vignette. There is some duplication here, followed by additional examples.

Star Trek API

The Star Trek API (STAPI) is a particularly useful data source.

Keep in mind that STAPI focuses more on providing real world data associated with Star Trek (e.g., when did episode X first air on television?) than on fictional universe data, but it contains both and the database holdings will grow with time.

To use the words of the developers, the STAPI is

the first public Star Trek API, accessible via REST and SOAP. It's an open source project, that anyone can contribute to.

The API is highly functional. Please do not abuse the API with constant requests. Their pages suggest no more than one request per second, but I would suggest ten seconds between successive requests. The default anti-DDOS measures in rtrek limit requests to one per second. You can update this global rtrek setting with options(), e.g. options(rtrek_antidos = 10) for a minimum ten second wait between API calls to be an even better neighbor. rtrek will not permit faster requests. If set below one second, the option is ignored and a warning thrown when making any API call.

STAPI entities

There a many fields, or entities, available in the API. The available IDs can be found in this table:

stapiEntities

These ID values are passed to stapi() to perform a search using the API. The other columns provide some information about the object returned from a search. All entity searches return tibble data frames. You can inspect or unnest the column names of each table returned from every available entity search so you can see beforehand what variables are associated with each entity.

Accessing the API

Using stapi should be thought of as a three part process:

Determine how many pages of results exist for a particular entity search.
Only after taking care to do the previous step, perform the search to return search results.
If satellite data is needed on a unique observation in the search results, call stapi() one more time referencing the specific observation.

To determine how many pages of results exist for a given search, set page_count = TRUE. The impact on the API will be equivalent to only searching a single page of results. One page contains metadata including the total number of pages. Nothing is returned in this "safe mode", but the total number of search results available is printed to the console.

Searching movies only returns one page of results. However, there are a lot of characters in the Star Trek universe. Check the total pages available for character search.

stapi("character", page_count = TRUE)

And that is with 100 results per page!

The default page = 1 only returns the first page. page can be a vector, e.g. page = 1:62. Results from multi-page searches are automatically combined into a single, constant data frame output. For the second call to stapi(), return only page two here, which contains the character, Q (currently, pending future character database updates that may shift the indexing). In case that does change and Q is not always near the top of page two of the search results, the example further below hard-codes his unique/universal ID.

library(dplyr)
stapi("character", page = 1) %>% select(uid, name)

Character tables can be sparse. There are a lot of variables, many of which will contain missing data for rare, esoteric characters. Even for more popular characters about whom much more universe lore has been uncovered, it still takes dedicated nerds to enter all the data in a database.

When a dataset contains a uid column, this can be used subsequently to extract a satellite dataset about that particular observation that was returned in the original search. First you used safe mode, then search mode, and now switch from search mode to extraction mode to obtain data about Q, specifically. All that is required to do this is pass Q's uid to stapi() and call the function one last time. When uid is no longer NULL, stapi() knows not to bother with a search and makes a different type of API call requesting information about the uniquely identified entry.

Q <- "CHMA0000025118"
Q <- stapi("character", uid = Q)

q_eps <- Q$episodes %>% select(uid, title, stardateFrom, stardateTo)
q_eps

The data returned on Q is actually a large list, including multiple data frames. For simplicity only a piece of it is shown above.

Find out which TNG characters other than Q appear in both the Encounter at Farpoint series premier and later in the All Good Things... series finale. To do this, usestapi() to extract data from other endpoints by following a breadcrumb trail of uid values.

Engage.

eps <- c("Encounter at Farpoint", "All Good Things...")
q_eps <- filter(q_eps, title %in% eps)
q_eps
eaf <- stapi("episode", uid = q_eps$uid[q_eps$title == eps[1]])
agt <- stapi("episode", uid = q_eps$uid[q_eps$title == eps[2]])
characters <- setdiff(intersect(eaf$characters$name, agt$characters$name), "Q")
characters

This returns key crew members who remained a part of the show from beginning to end, disregarding any interim absences. Below, inspect how many episodes each character appeared in. uid is again needed, this time for each character.

Note that this requires making one API call for each character. The anti-DOS measures in rtrek will force a one-second minimum wait between each call in the event that the individual calls actually return results faster than this, so the code below will take at least seven seconds to complete.

Add a fun Star Trek-themed plot? Make it so.

characters <- eaf$characters %>% select(uid, name) %>% filter(name %in% characters)
characters
eps_count <- sapply(characters$uid, function(i){
  stapi("character", uid = i)$episodes$series |>
     summarize(sum(title == "Star Trek: The Next Generation")) |>
    unlist()
})
eps_count <- select(characters, name) |> mutate(n = eps_count)

library(ggplot2)
library(showtext)
font_add("StarNext", system.file(paste0("fonts/StarNext.ttf"), package = "trekfont"))
showtext_auto()

uniforms <- c("#5B1414", "#AD722C", "#1A6384")[c(3, 1, 2, 2, 2, 2, 2, 1, 3)]
eb <- element_blank()

ggplot(eps_count, aes(factor(name, levels = name[order(n)]), n)) + 
  geom_col(fill = uniforms, color = "gray20") + coord_flip() + 
  theme_minimal(base_size = 22) + 
  theme(plot.title = element_text(family = "StarNext"), line = eb, axis.text.x = eb) +
  scale_x_discrete(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) +
  labs(x = NULL, y = NULL, title = "TNG EPISODE APPEARANCES") +
  geom_text(aes(label = n), color = "white", size = 8, hjust = 1.5)

This looks as expected. Inspect the structure of the list objects returned by stapi() to become more familiar with what kind of information is available.

Minor weirdness

Sometimes you may receive errors when trying filter rows for one of the data frames while certain problematic columns are still selected. This is likely because the data frame contains a nested data frame, but one which is not nested in the typical way (e.g., tidyr::unnest() will also fail to resolve the issue).

In the code immediately above, this occurs with the series data frame, which is why the episodes parent data frame is subset using $series before calling dplyr::summarise().

leonawicz/rtrek documentation built on Sept. 18, 2024, 7:21 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com