knitr::opts_chunk$set(echo = TRUE)

This README contains info about how to access the file, disclaimer, some use cases.

TLDR

This dataset contains all episodes of star trek TNG and has seperate rows for every speech or description that I found in the moviescripts. Install using devtools::install_github("RMHogervorst/TNG") or download the compressed csv file from raw-data folder. uncompressed the file is approx. 95.2 Mb.

Best results happen when you search using grep. because sometimes names are followed or preceded by spaces. for instance PICARD OR PICARD OR PICARD V.O..

Licence public domain although the original scripts might not be.

Short intro

This repo is r package and dataset of the sci-fi series Star Trek The Next Generation.

The dataset has 17 variables/columns and 110176 rows. variable names are here:

[1] "episode"           "productionnumber"  "setnames"          "characters"       
[5] "act"               "scenenumber"       "scenedetails"      "partnumber"       
[9] "type"              "who"               "text"              "speechdescription"
[13] "Released"          "Episode"           "imdbRating"        "imdbID"           
[17] "Season"

Episode contains the name of the episode, productionnumber, setnames, and characters were scraped from the toppart of the moviescript. All scripts are divided up into partnumbers. A part can be a description or speech (as told by the TYPE variable). speech and descriptions over multiple lines is put together. ACT, SCENENUMBER, PARTNUMBER tell you what follows what and where in the episode this happened.

The variables from Released to the Season are imports from my IMDB package.

for example in the episode New Ground somewhere in the episode a certain grubby crewmember confirms something...

all_episodes_TNG[65305,]

has episode New Ground, production number #40275-210 a bunch of sets and the following people in the cast:

PICARD,HELENA ROZHENKO,RIKER,ALEXANDER,DATA,MS. LOWRY,BEVERLY,ENSIGN FELTON,TROI,DOCTOR JA'DAR,GEORDI,WORF,Non-Speaking,SUPERNUMERARIES,SEVERAL BOYS,SEVERAL FATHERS,A SKULL-FACED ALIEN,WAITER

As you can see Non-Speaking is not really a castmember. but describes the next people That happens when you scrape text.

   act scenenumber scenedetails partnumber   type   who   text speechdescription
1: ONE         6A                       95 speech  WORF  Good.             FALSE

And as you can see, WORF says "Good." in act one, scene 6a, partnumber 95. There is no description how Worf says this.

installation

Install using devtools::install_github("RMHogervorst/TNG") or download the compressed csv file from raw-data folder. uncompressed the file is approx. 95.2 Mb.

Examples of explorations in this data set {#usecases}

Let's start with some basic explorations.

number of speaking roles and ratings

How many people are speaking in a episode?

Since I'm using dplyr the endresult will be a tbl_df which prints nicer.

suppressMessages(library(dplyr))  
library(TNG)
TNG %>% group_by(episode) %>% distinct(who) %>% 
        summarize(n_people = n(), rating = mean(imdbRating)) %>% 
        arrange(desc(n_people), desc(rating) ) 

What is the relation between rating and number of speaking people? I will also add bit of color for season.

library(ggplot2)
TNG %>% group_by(episode) %>% distinct(who) %>% 
        summarize(n_people = n(), rating = mean(imdbRating), season = mean(Season)) %>% 
        arrange(desc(n_people), desc(rating) ) %>%
       ggplot(aes(n_people, rating, colour = Season)) + geom_point(aes(color = as.factor(season)) , na.rm = TRUE)

The number of distinct speakers and rating all center around the same point, around 30 people and with ratings around 7.5.

I'm intrigued with the lowest rating.

TNG %>% group_by(episode) %>% distinct(who) %>% summarize( rating = mean(imdbRating)) %>% arrange( rating)

It is episode shades of gray.

according to wikipedia

It was the only clip show filmed during the series and was created due to a lack of funds left over from other episodes during the season.

"Shades of Gray" is widely regarded as the worst episode of the series, with critics calling it "god-awful" and a "travesty"; even Hurley referred to it negatively. It can be compared to "Spock's Brain" in The Original Series.

Right.

One character I found really annoying was Q.

In how many episodes is he really. Let's look at the character list in the dataset. Those episodes must by terrible.

TNG %>% group_by(episode) %>% filter(grepl(",Q,", characters)) %>% 
        summarize(rating = mean(imdbRating)) %>% knitr::kable(format = "html")

Well they're not. They belong to the best episodes of TNG.

Descriptions

While I created this dataset I found that descriptions in the script are very nice

This is the first one:

r TNG$text[[1]]

Which made me think, how many times is this description used? It feels as if the scene is used very often.

TNG %>% filter(type == "description") %>%
        filter(grepl("enterprise", text, ignore.case = TRUE) , grepl("warp speed", text, ignore.case = TRUE)) %>% select(text, Season) %>% knitr::kable(format = "html")

Not that often it seems.

How often does picard drink tea....

picard drinking tea like a boss Found at: https://www.heatherbuchanan.ca/products/captain-picard-tea-earl-grey-hot-greeting-card

Picard seems to drink a lot of earl grey tea.

in fact someone did a montage of all the time he orders it

TNG %>% filter(grepl("PICARD", who), grepl(" tea ", text)) %>% select(who, text, Season, act) %>% knitr::kable(format = "html")

That's weird. In the original scripts there is little to no mentioning of earl grey tea. In fact when I search for the exact phrase it only happens seven times.

grep("Tea. Earl Grey. Hot", TNG$text, value = TRUE, ignore.case = TRUE)

disclaimer {#disclaimer}

I haven't checked everything and I had some errors during the construction, so some scripts are not complete and some parts are perhaps wrongly classified as speech or description.

The creation of the dataset took me 15 hours and linking it to the IMDB database and creating this package took me another 4 hours.

Resources

I've dowloaded all the files from http://www.st-minutiae.com/resources/scripts/

And discovered that the scripts (mostly...) follow a convention of

I have used the packages dplyr and readr.

Licence {#licence}

My dataset is CC0 PUBLIC domain.

I'm very curious to see your analyses of TNG. Enjoy

Roel M. Hogervorst

2016-3-27



RMHogervorst/TNG documentation built on May 8, 2019, 7:32 a.m.