set.seed(1)
knitr::opts_chunk$set(
  collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, error = FALSE
)

This introductory vignette provides a brief, example-driven overview of rtrek.

Local datasets

The rtrek package provides datasets related to the Star Trek fictional universe and functions for working with those datasets. It interfaces with the Star Trek API (STAPI), Memory Alpha and Memory Beta to retrieve data, metadata and other information relating to Star Trek.

The package also contains several local datasets covering a variety of topics such as Star Trek timeline data, universe species data and geopolitical data. Some of these are more information rich, while others are toy examples useful for simple demonstrations. The bulk of Star Trek data is accessed from external sources by API. A future version of rtrek will also include summary datasets resulting from text mining analyses of Star Trek novels.

library(rtrek)
st_datasets()

At this time, several of the datasets are very small and are only included in the package in order to demonstrate some very basic examples and they are not particularly useful or interesting beyond this purpose. However, rtek now includes more sizable curated datasets relating the compendium of licensed, published Star Trek literature and multiple versions of Star Trek fictional universe historical timelines.

Package datasets in rtrek are somewhat eclectic and currently limited. They will expand with further package development. To list all available package datasets with a short description, call st_datasets.

Star Trek novels

A largely comprehensive Star Trek book metadata table is available as stBooks, which is informed and curated from directly parsing Star Trek e-book metadata rather than parsing third party website content.

stBooks

This dataset is discussed further in the section below on e-book text mining.

Before moving on, it is worth mentioning a helpful table for mapping between series names and their abbreviations used throughout this package (and in the Star Trek community in general).

stSeries

Spatial maps

The stTiles data frame shows all available Star Trek-themed map tile sets along with metadata and attribution information. These map tiles can be used with the leaflet and shiny packages to make interactive maps situated in the Star Trek universe.

stTiles

The list is scant at the moment, but more will come. One thing to keep in mind is these tile sets use a simple/non-geographical coordinate reference system (CRS). Clearly, they are not Earth-based, though they are spatial in more ways than one!

Similar to game maps, there is a sense of space, but it is a simple Cartesian coordinate system and does not use geographic projections like you may be used to working with when analyzing spatial data or making Leaflet maps. This system is much simpler, but simple does not necessarily mean easy!

Inspect stGeo:

stGeo

This is another small dataset containing locations of key planets in the Star Trek universe. Notice the coordinates do not appear meaningful. There is no latitude and longitude. Instead there are row and column entries defining cells in a matrix. The matrix dimensions are defined by the pixel dimensions of source map that was used to create each tile set.

The coordinates are also not consistent. Source maps differ significantly. Even if they had identical pixel dimensions, which they do not, each artist's visual rendering of the fictional universe will place locations differently in space. In this sense, every tile set has a unique coordinate reference system. For each new tile set produced, all locations of interest must be georeferenced again.

This is not ideal, but it gets worse. Once you have locations' coordinates defined that map onto a particular tile set, the leaflet package does not work in these row and column grids. The (col, row) pairs need to be transformed or projected into Leaflet space. Fortunately, rtrek does this part for you with tile_coords. It takes a data frame like one returned by st_tiles_data with columns named col and row, as well as the name of an available Star Trek map tile set. It returns a data frame with new columns x and y that will map properly in a leaflet map built on that tile set.

id <- "galaxy1"
(d <- st_tiles_data(id))
(d <- tile_coords(d, id))

Here is an example using the galaxy1 map with leaflet. The st_tiles function is used to link to the tile provider.

library(leaflet)
tiles <- st_tiles("galaxy1")
leaflet(d, options = leafletOptions(crs = leafletCRS("L.CRS.Simple"))) %>%
  addTiles(tiles) %>% setView(108, -75, 2) %>%
  addCircleMarkers(lng = ~x, lat = ~y, label = ~loc, color = "white", radius = 20)


The stSpecies dataset is just a small table that pairs species named with representative thumbnail avatars, mostly pulled from the Memory Alpha website. There is nothing map-related here, but these are used in this Stellar Cartography example. It is similar to the Leaflet example above, but a bit more interesting, with markers to click on and information displays.

In the course of the above map-related examples, a few functions have also been introduced. st_tiles takes an id argument that is mapped to the available tile sets in stTiles and returns the relevant URL. st_tiles_data takes the same id argument and returns a simple example data frame containing ancillary data related to the available locations from stGeo. The result is always the same except that the grid cells for locations change with respect to the chosen tile set. Finally, tile_coords can be applied to one of these data frames to add x and y columns for a CRS that Leaflet will understand.

Historical timeline

Fictional universe historical timeline data is an exciting type of in-universe Star Trek data to have at your fingertips to play around with and explore.

It is also difficult to compile. Many people have labored away intensely over the years compiling various attempts at integrated, internally consistent, accurate timelines of Star Trek universe lore. Some have turned out more successful than others.

As of rtrek v0.2.0 the rudimentary beginnings of what will ideally eventually become an up to date and comprehensive timeline dataset are now underway in the form of two different flavors of timeline datasets.

Novel-driven timeline

One is based on published works, mostly consisting of novels, as well as television series and movies, all placed in chronological order.

tlBooks

Event-driven timeline

The other is an event-driven timeline that consists of textual entries referencing historically significant events, situated chronologically in the timeline.

tlEvents

The two datasets are quite different in their focus and compliment one another.

Footnotes

One column these two timeline data frames share in common is the footnote column, which you can see only contains ID values for entries which have a footnote. The tlFootnotes dataset can be referenced or joined by footnote to one of the other tables. Footnotes tend to be long strings of text and not associated with most timeline entries, so they are kept in a separate table.

tlFootnotes

Timeline data details and caveats

tlBooks is novel-driven, meaning that the timeline entries (rows) provide a chronologically ordered list of licensed Star Trek novels. This timeline is helpful for figuring out when stories are set and the relative order in which they occur, but it does not provide any description of events transpiring in the universe.

While this data is very informative, it is many years out of date, being last updated in October of 2006. It is also necessarily speculative. Settings are determined based in part on what is interpreted to be the intention of a given author for a given story.

Nevertheless, it still represents possibly the highest quality representation of the chronological ordering of Star Trek fiction that combines episodes and movies with written works. The concurrent timeline of Star Trek TV episodes and movies are interleaved with the novels and other stories, anthologies and other written fiction. This provides fuller context resulting in a much richer timeline.

tlEvents is event-driven, meaning that the timeline entries (rows) provide chronologically ordered historical events from the Star Trek universe. As with tlBooks, this timeline is quite out of date. In fact it is at least somewhat more out of date than tlBooks, its last content update appearing to be no later than 2005. This timeline is also more problematic than the other, and less relevant moving forward. Its updating essentially ceased as the other began.

However, it is included because unlike tlBooks, which is a timeline of production titles, this timeline dataset is event-driven. While it may now be erroneous in places even independent from being out of date, it is useful for its informative textual entries referencing historically significant events in Star Trek lore.

In summary, these datasets have much value, but they should be used with the awareness that they are necessarily imperfect ans speculative, notably outdated, and tlEvents in particular is less able to stand the test of time as the Star Trek universe moves forward with new publications and productions.

It should also be noted that while it may be tempting to merge these two data frames, this is not advisable if it is important to maintain chronological order. It is generally safe to assume that multiple entries within a single year are listed in a sensible order in cases where it may matter, within-year entries do not have specific, unique within-year dates. They are ordinal only. It is not possible to merge entries from both tables for a specific year and know how the combined set of entries should be ordered- unless you already know everything about Star Trek, in which case please craft the ultimate timeline in a universal file format that can be easily digested by a computer.

Star Trek API

Now that you have seen an overview of available rtrek datasets and some associated functions, it is time to turn attention to external datasets. The Star Trek API (STAPI) is a particularly useful data source.

Keep in mind that STAPI focuses more on providing real world data associated with Star Trek (e.g., when did episode X first air on television?) than on fictional universe data, but it contains both and the database holdings will grow with time.

To use the words of the developers, the STAPI is

the first public Star Trek API, accessible via REST and SOAP. It's an open source project, that anyone can contribute to.

The API is highly functional. Please do not abuse the API with constant requests. Their pages suggest no more than one request per second, but I would suggest ten seconds between successive requests. The default anti-DOS measures in rtrek limit requests to one per second. You can update this global rtrek setting with options, e.g. options(rtrek_antidos = 10) for a minimum ten second wait between API calls to be an even better neighbor. rtrek will not permit faster requests. If set below one second, the option is ignored and a warning thrown when making any API call.

STAPI entities

There a many fields, or entities, available in the API. The available IDs can be found in this table:

stapiEntities

These ID values are passed to stapi to perform a search using the API. The other columns provide some information about the object returned from a search. All entity searches return tibble data frames. You can inspect or unnest the column names of each table returned from every available entity search so you can see beforehand what variables are associated with each entity.

Accessing the API

Using stapi should be thought of as a three part process:

To determine how many pages of results exist for a given search, set page_count = TRUE. The impact on the API will be equivalent to only searching a single page of results. One page contains metadata including the total number of pages. Nothing is returned in this "safe mode", but the total number of search results available is printed to the console.

Searching movies only returns one page of results. However, there are a lot of characters in the Star Trek universe. Check the total pages available for character search.

stapi("character", page_count = TRUE)

And that is with 100 results per page!

The default page = 1 only returns the first page. page can be a vector, e.g. page = 1:62. Results from multi-page searches are automatically combined into a single, constant data frame output. For the second call to stapi, return only page two here, which contains the character, Q (currently, pending future character database updates that may shift the indexing). In case that does change and Q is not always near the top of page two of the search results, the example further below hard-codes his unique/universal ID.

stapi("character", page = 2)

Character tables can be sparse. There are a lot of variables, many of which will contain missing data for rare, esoteric characters. Even for more popular characters about whom much more universe lore has been uncovered, it still takes dedicated nerds to enter all the data in a database.

When a dataset contains a uid column, this can be used subsequently to extract a satellite dataset about that particular observation that was returned in the original search. First you used safe mode, then search mode, and now switch from search mode to extraction mode to obtain data about Q, specifically. All that is required to do this is pass Q's uid to stapi and call the function one last time. When uid is no longer NULL, stapi knows not to bother with a search and makes a different type of API call requesting information about the uniquely identified entry.

Q <- "CHMA0000025118"
Q <- stapi("character", uid = Q)

library(dplyr)
Q$episodes %>% select(uid, title, stardateFrom, stardateTo)

The data returned on Q is actually a large list, including multiple data frames. For simplicity only a piece of it is shown above. For more examples, see the STAPI vignette.

Pseudo-APIs

Some functions in rtrek provide an API-like interface to online Star Trek-related data. Specifically, parsing data from the Memory Alpha and Memory Beta websites. These sites do not provide APIs. Therefore the only option is to read pages into R and parse the html. Behind the scenes this is done using the xml2 and rvest packages, but from the user perspective it is presented as passing an API endpoint string to a function.

memory_alpha and memory_beta, as well as several other related functions, are available in rtrek. These functions access data from Memory Alpha and Memory Beta. For details and examples on these functions, see the Memory Alpha vignette and the Memory Beta vignette.

Star Trek novel text mining

This section will be continued in a future version of rtrek. For now what is available is a dataset stBooks. This dataset represents metadata parsed, imperfectly but painstakingly and thoroughly, from actual Star Trek books. stBooks contains several different fields, including useful fields for analysts such as the number of words and chapters in a book.

stBooks

Obviously, verbatim licensed book content itself cannot be shared, so it is not possible to provide capability in rtrek to enable analysts to perform their own unique text mining analyses on Star Trek novel corpora. However, future versions of rtrek will include more summary datasets that will aim to represent more interesting variables.

A few examples could be:

or any other set of interesting metrics that could be used, for example, to inform suggested reading lists of various titles, or books by particular authors with a favored style or focus.



leonawicz/rtrek documentation built on Sept. 18, 2023, 11:29 p.m.