README.md

dubbR

A package for text analysis made from Hungarian script translations by Patrik Szigeti, formatted to be convenient for text analysis.

This package was created as part of an assignment for Data Science 4 - Unstructured Text Analysis at CEU, Budapest.

Table of Contents

Available shows

Descriptions from IMDb. These shows all aired on the Discovery Channel in Hungary.

Installation

dubbR is not on CRAN yet, please install from GitHub:

remotes::install_github('szigony/dubbR')

Functions

dub_data

This function creates the tibbles that will serve as the basis for the exported functions. It is stored in the data_raw folder so that it can be run whenever necessary, but so that it also wouldn't loop through all the files when using library(dubbR). Its contents are stored in the R folder, in the sysdata.rda file.

dub_metadata

Metadata about the audiovisual translations. Returns a tibble with the metadata. Follows the structure of dubbr_metadata that is created by dub_data.

Optional input parameter: shows - the show or shows that are of interest.

| Column | Description | | --- | --- | | dub_id | Unique identifier of the scripts. | | production_code | The production code that was used by the production company. | | show | The name of the TV show. | | season | The season of the TV show for which the translation was requested. | | episode | The episode of the TV show within the season for which the translation was requested. |

dub_text

The text of the audiovisual translations, formatted to be convenient for text analysis. Returns a tibble with the text of the scripts. Follows the structure of dubbr_text that is created by dub_data.

Optional input parameter: shows - the show or shows that are of interest.

| Column | Description | | --- | --- | | dub_id | Unique identifier of the scripts. | | text | The text from the scripts line by line. |

dub_characters

The characters from the audiovisual translations. This can be used to create anti_joins so that the character names wouldn't skew the analysis. Returns a tibble with the characters from the scripts. Follows the structure of dubbr_characters that is created by dub_data.

Optional input parameter: shows - the show or shows that are of interest.

| Column | Description | | --- | --- | | dub_id | Unique identifier of the scripts. | | character | The characters that appear in the scripts. |

dub_shows

A unique list of the shows that appear in the package. It can be used to explore the package and filter the contents. Returns a tibble with the unique list of shows that appear in the package.

| Column | Description | | --- | --- | | show | The unique TV shows in the package. |

Example

This is a typical workflow of leveraging the package's capabilities:

  1. See what shows are available in the package.
dub_shows()
  1. Look at the metadata for the show(s) you're interested in.
dub_metadata("Fifth Gear")

r dub_text("Fifth Gear")

r dub_characters("Fifth Gear")

  1. Use the tidytext package to perform text analysis.
library(dplyr)
library(tidytext)
library(dubbR)

fg_scripts <- dub_text("Fifth Gear") %>%
  unnest_tokens(word, text)

fg_characters <- dub_characters("Fifth Gear") %>%
  rename(word = character) %>%
  select(word) %>%
  mutate(word = tolower(word)) %>%
  distinct()

fg_scripts %>%
  anti_join(fg_characters) %>%
  anti_join(get_stopwords("hu"))

Known issues

In some cases, the first column of the tables with the text are detected by read_docx function as likely headers.

Solution: These scripts were removed from the package for now.

Solution: Pending.



szigony/dubbR documentation built on June 4, 2019, 9:09 a.m.