knitr::opts_chunk$set(echo = TRUE)
This README contains info about how to access the file, disclaimer, some use cases.
This dataset contains all episodes of star trek TNG and has seperate rows
for every speech or description that I found in the moviescripts.
Install using devtools::install_github("RMHogervorst/TNG")
or download the
compressed csv file from raw-data folder. uncompressed the file is approx. 95.2 Mb.
Best results happen when you search using grep. because sometimes names are followed
or preceded by spaces. for instance PICARD
OR PICARD
OR PICARD V.O.
.
Licence public domain although the original scripts might not be.
This repo is r package and dataset of the sci-fi series Star Trek The Next Generation.
The dataset has 17 variables/columns and 110176 rows. variable names are here:
[1] "episode" "productionnumber" "setnames" "characters" [5] "act" "scenenumber" "scenedetails" "partnumber" [9] "type" "who" "text" "speechdescription" [13] "Released" "Episode" "imdbRating" "imdbID" [17] "Season"
Episode contains the name of the episode, productionnumber, setnames, and characters were scraped from the toppart of the moviescript. All scripts are divided up into partnumbers. A part can be a description or speech (as told by the TYPE variable). speech and descriptions over multiple lines is put together. ACT, SCENENUMBER, PARTNUMBER tell you what follows what and where in the episode this happened.
The variables from Released to the Season are imports from my IMDB package.
for example in the episode New Ground somewhere in the episode a certain grubby crewmember confirms something...
all_episodes_TNG[65305,]
has episode New Ground, production number #40275-210 a bunch of sets and the following people in the cast:
PICARD,HELENA ROZHENKO,RIKER,ALEXANDER,DATA,MS. LOWRY,BEVERLY,ENSIGN FELTON,TROI,DOCTOR JA'DAR,GEORDI,WORF,Non-Speaking,SUPERNUMERARIES,SEVERAL BOYS,SEVERAL FATHERS,A SKULL-FACED ALIEN,WAITER
As you can see Non-Speaking is not really a castmember. but describes the next people That happens when you scrape text.
act scenenumber scenedetails partnumber type who text speechdescription 1: ONE 6A 95 speech WORF Good. FALSE
And as you can see, WORF says "Good." in act one, scene 6a, partnumber 95. There is no description how Worf says this.
Install using devtools::install_github("RMHogervorst/TNG")
or download the
compressed csv file from raw-data folder. uncompressed the file is approx. 95.2 Mb.
Let's start with some basic explorations.
How many people are speaking in a episode?
Since I'm using dplyr the endresult will be a tbl_df which prints nicer.
suppressMessages(library(dplyr)) library(TNG) TNG %>% group_by(episode) %>% distinct(who) %>% summarize(n_people = n(), rating = mean(imdbRating)) %>% arrange(desc(n_people), desc(rating) )
What is the relation between rating and number of speaking people? I will also add bit of color for season.
library(ggplot2) TNG %>% group_by(episode) %>% distinct(who) %>% summarize(n_people = n(), rating = mean(imdbRating), season = mean(Season)) %>% arrange(desc(n_people), desc(rating) ) %>% ggplot(aes(n_people, rating, colour = Season)) + geom_point(aes(color = as.factor(season)) , na.rm = TRUE)
The number of distinct speakers and rating all center around the same point, around 30 people and with ratings around 7.5.
I'm intrigued with the lowest rating.
TNG %>% group_by(episode) %>% distinct(who) %>% summarize( rating = mean(imdbRating)) %>% arrange( rating)
It is episode shades of gray.
according to wikipedia
It was the only clip show filmed during the series and was created due to a lack of funds left over from other episodes during the season.
"Shades of Gray" is widely regarded as the worst episode of the series, with critics calling it "god-awful" and a "travesty"; even Hurley referred to it negatively. It can be compared to "Spock's Brain" in The Original Series.
Right.
One character I found really annoying was Q.
In how many episodes is he really. Let's look at the character list in the dataset. Those episodes must by terrible.
TNG %>% group_by(episode) %>% filter(grepl(",Q,", characters)) %>% summarize(rating = mean(imdbRating)) %>% knitr::kable(format = "html")
Well they're not. They belong to the best episodes of TNG.
While I created this dataset I found that descriptions in the script are very nice
This is the first one:
r TNG$text[[1]]
Which made me think, how many times is this description used? It feels as if the scene is used very often.
TNG %>% filter(type == "description") %>% filter(grepl("enterprise", text, ignore.case = TRUE) , grepl("warp speed", text, ignore.case = TRUE)) %>% select(text, Season) %>% knitr::kable(format = "html")
Not that often it seems.
Found at: https://www.heatherbuchanan.ca/products/captain-picard-tea-earl-grey-hot-greeting-card
Picard seems to drink a lot of earl grey tea.
in fact someone did a montage of all the time he orders it
TNG %>% filter(grepl("PICARD", who), grepl(" tea ", text)) %>% select(who, text, Season, act) %>% knitr::kable(format = "html")
That's weird. In the original scripts there is little to no mentioning of earl grey tea. In fact when I search for the exact phrase it only happens seven times.
grep("Tea. Earl Grey. Hot", TNG$text, value = TRUE, ignore.case = TRUE)
I haven't checked everything and I had some errors during the construction, so some scripts are not complete and some parts are perhaps wrongly classified as speech or description.
The creation of the dataset took me 15 hours and linking it to the IMDB database and creating this package took me another 4 hours.
I've dowloaded all the files from http://www.st-minutiae.com/resources/scripts/
And discovered that the scripts (mostly...) follow a convention of
I have used the packages dplyr and readr.
My dataset is CC0 PUBLIC domain.
I'm very curious to see your analyses of TNG. Enjoy
Roel M. Hogervorst
2016-3-27
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.