A package for text analysis made from Hungarian script translations by Patrik Szigeti, formatted to be convenient for text analysis.
This package was created as part of an assignment for Data Science 4 - Unstructured Text Analysis at CEU, Budapest.
Descriptions from IMDb. These shows all aired on the Discovery Channel in Hungary.
dubbR is not on CRAN yet, please install from GitHub:
remotes::install_github('szigony/dubbR')
dub_dataThis function creates the tibbles that will serve as the basis for the exported functions. It is stored in the data_raw folder so that it can be run whenever necessary, but so that it also wouldn't loop through all the files when using library(dubbR). Its contents are stored in the R folder, in the sysdata.rda file.
read_docx function from the docxtractr package) that contain three columns: timestamp, character and text.str_extract function with RegEx, as well as the str_replace function from the stringr package.dub_id column based on the index of the iteration to each of the scripts for ease of identification.dub_id.dub_id.dubbr_metadata, dubbr_text and dubbr_characters) that later serve as inputs for the other functions.dub_metadataMetadata about the audiovisual translations. Returns a tibble with the metadata. Follows the structure of dubbr_metadata that is created by dub_data.
Optional input parameter: shows - the show or shows that are of interest.
| Column | Description |
| --- | --- |
| dub_id | Unique identifier of the scripts. |
| production_code | The production code that was used by the production company. |
| show | The name of the TV show. |
| season | The season of the TV show for which the translation was requested. |
| episode | The episode of the TV show within the season for which the translation was requested. |
dub_textThe text of the audiovisual translations, formatted to be convenient for text analysis. Returns a tibble with the text of the scripts. Follows the structure of dubbr_text that is created by dub_data.
Optional input parameter: shows - the show or shows that are of interest.
| Column | Description |
| --- | --- |
| dub_id | Unique identifier of the scripts. |
| text | The text from the scripts line by line. |
dub_charactersThe characters from the audiovisual translations. This can be used to create anti_joins so that the character names wouldn't skew the analysis. Returns a tibble with the characters from the scripts. Follows the structure of dubbr_characters that is created by dub_data.
Optional input parameter: shows - the show or shows that are of interest.
| Column | Description |
| --- | --- |
| dub_id | Unique identifier of the scripts. |
| character | The characters that appear in the scripts. |
dub_showsA unique list of the shows that appear in the package. It can be used to explore the package and filter the contents. Returns a tibble with the unique list of shows that appear in the package.
| Column | Description |
| --- | --- |
| show | The unique TV shows in the package. |
This is a typical workflow of leveraging the package's capabilities:
dub_shows()
dub_metadata("Fifth Gear")
This returns 18 rows for the 18 scripts of Fifth Gear that are available in the package.
Select a show and...
...filter for the scripts that are stored line by line.
r
dub_text("Fifth Gear")
r
dub_characters("Fifth Gear")
tidytext package to perform text analysis.library(dplyr)
library(tidytext)
library(dubbR)
fg_scripts <- dub_text("Fifth Gear") %>%
unnest_tokens(word, text)
fg_characters <- dub_characters("Fifth Gear") %>%
rename(word = character) %>%
select(word) %>%
mutate(word = tolower(word)) %>%
distinct()
fg_scripts %>%
anti_join(fg_characters) %>%
anti_join(get_stopwords("hu"))
NOTE: header=FALSE but table has a marked header row in the Word documentIn some cases, the first column of the tables with the text are detected by read_docx function as likely headers.
Solution: These scripts were removed from the package for now.
r
Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vectorSolution: Pending.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.