split_transcript: Split a Transcript Style Vector on Delimiter & Coerce to...

View source: R/split_transcript.R

split_transcriptR Documentation

Split a Transcript Style Vector on Delimiter & Coerce to Dataframe

Description

Split a transcript style vector (e.g., c("greg: Who me", "sarah: yes you!") into a name and dialogue vector that is coerced to a data.table. Leading/trailing white space in the columns is stripped out.

Usage

split_transcript(
  x,
  delim = ":",
  colnames = c("person", "dialogue"),
  max.delim = 15,
  ...
)

Arguments

x

A transcript style vector (e.g., c("greg: Who me", "sarah: yes you!").

delim

The delimiter to split on.

colnames

The column names to use for the data.table output.

max.delim

An integer stating how many characters may come before a delimiter is found. This is useful for the case when a colon is the delimiter but time stamps are also found in the text.

...

Ignored.

Value

Returns a 2 column data.table.

Examples

split_transcript(c("greg: Who me", "sarah: yes you!"))

## Not run: 
## 2015 Vice-Presidential Debates Example
if (!require("pacman")) install.packages("pacman")
pacman::p_load(rvest, magrittr, xml2)

debates <- c(
    wisconsin = "110908",
    boulder = "110906",
    california = "110756",
    ohio = "110489"
)

lapply(debates, function(x){
    xml2::read_html(paste0("http://www.presidency.ucsb.edu/ws/index.php?pid=", x)) %>%
        rvest::html_nodes("p") %>%
        rvest::html_text() %>%
        textshape::split_index(grep("^[A-Z]+:", .)) %>%
        textshape::combine() %>%
        textshape::split_transcript() %>%
        textshape::split_sentence()
})

## End(Not run)

trinker/textshape documentation built on April 5, 2024, 11:39 a.m.