A package for text analysis made from Hungarian script translations by Patrik Szigeti, formatted to be convenient for text analysis.
This package was created as part of an assignment for Data Science 4 - Unstructured Text Analysis at CEU, Budapest.
Descriptions from IMDb. These shows all aired on the Discovery Channel in Hungary.
dubbR
is not on CRAN yet, please install from GitHub:
remotes::install_github('szigony/dubbR')
dub_data
This function creates the tibbles that will serve as the basis for the exported functions. It is stored in the data_raw folder so that it can be run whenever necessary, but so that it also wouldn't loop through all the files when using library(dubbR)
. Its contents are stored in the R folder, in the sysdata.rda
file.
read_docx
function from the docxtractr
package) that contain three columns: timestamp, character and text.str_extract
function with RegEx, as well as the str_replace
function from the stringr
package.dub_id
column based on the index of the iteration to each of the scripts for ease of identification.dub_id
.dub_id
.dubbr_metadata
, dubbr_text
and dubbr_characters
) that later serve as inputs for the other functions.dub_metadata
Metadata about the audiovisual translations. Returns a tibble with the metadata. Follows the structure of dubbr_metadata
that is created by dub_data
.
Optional input parameter: shows
- the show or shows that are of interest.
| Column | Description |
| --- | --- |
| dub_id
| Unique identifier of the scripts. |
| production_code
| The production code that was used by the production company. |
| show
| The name of the TV show. |
| season
| The season of the TV show for which the translation was requested. |
| episode
| The episode of the TV show within the season for which the translation was requested. |
dub_text
The text of the audiovisual translations, formatted to be convenient for text analysis. Returns a tibble with the text of the scripts. Follows the structure of dubbr_text
that is created by dub_data
.
Optional input parameter: shows
- the show or shows that are of interest.
| Column | Description |
| --- | --- |
| dub_id
| Unique identifier of the scripts. |
| text
| The text from the scripts line by line. |
dub_characters
The characters from the audiovisual translations. This can be used to create anti_join
s so that the character names wouldn't skew the analysis. Returns a tibble with the characters from the scripts. Follows the structure of dubbr_characters
that is created by dub_data
.
Optional input parameter: shows
- the show or shows that are of interest.
| Column | Description |
| --- | --- |
| dub_id
| Unique identifier of the scripts. |
| character
| The characters that appear in the scripts. |
dub_shows
A unique list of the shows that appear in the package. It can be used to explore the package and filter the contents. Returns a tibble with the unique list of shows that appear in the package.
| Column | Description |
| --- | --- |
| show
| The unique TV shows in the package. |
This is a typical workflow of leveraging the package's capabilities:
dub_shows()
dub_metadata("Fifth Gear")
This returns 18 rows for the 18 scripts of Fifth Gear that are available in the package.
Select a show and...
...filter for the scripts that are stored line by line.
r
dub_text("Fifth Gear")
r
dub_characters("Fifth Gear")
tidytext
package to perform text analysis.library(dplyr)
library(tidytext)
library(dubbR)
fg_scripts <- dub_text("Fifth Gear") %>%
unnest_tokens(word, text)
fg_characters <- dub_characters("Fifth Gear") %>%
rename(word = character) %>%
select(word) %>%
mutate(word = tolower(word)) %>%
distinct()
fg_scripts %>%
anti_join(fg_characters) %>%
anti_join(get_stopwords("hu"))
NOTE: header=FALSE but table has a marked header row in the Word document
In some cases, the first column of the tables with the text are detected by read_docx
function as likely headers.
Solution: These scripts were removed from the package for now.
r
Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
Solution: Pending.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.