as_transcript: Coerce Text toTranscripts Into R

Description Usage Arguments Value Examples

View source: R/as_transcript.R

Description

Coerce text into a transcript.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
as_transcript(
  text,
  person.regex = NULL,
  col.names = c("Person", "Dialogue"),
  text.var = NULL,
  merge.broke.tot = TRUE,
  header = FALSE,
  dash = "",
  ellipsis = "...",
  quote2bracket = FALSE,
  rm.empty.rows = TRUE,
  na = "",
  sep = NULL,
  skip = 0,
  comment.char = "",
  max.person.nchar = 20,
  ...
)

Arguments

text

Character string: if file is not supplied and this is, then data are read from the value of text. Notice that a literal string can be used to include (small) data sets within R code.

person.regex

A capturing regex describing what is a person portion of a string.

col.names

A character vector specifying the column names of the transcript columns.

text.var

A character string specifying the name of the text variable will ensure that variable is classed as character. If NULL read_transcript() attempts to guess the text.variable (dialogue).

merge.broke.tot

logical. If TRUE and if the file being read in is .docx with broken space between a single turn of talk read_transcript will attempt to merge these into a single turn of talk.

header

logical. If TRUE the file contains the names of the variables as its first line.

dash

A character string to replace the en and em dashes special characters (default is to remove).

ellipsis

A character string to replace the ellipsis special characters.

quote2bracket

logical. If TRUE replaces curly quotes with curly braces (default is FALSE). If FALSE curly quotes are removed.

rm.empty.rows

logical. If TRUE read_transcript() attempts to remove empty rows.

na

A character string to be interpreted as an NA value.

sep

The field separator character. Values on each line of the file are separated by this character. The default of NULL instructs read_transcript() to use a separator suitable for the file type being read in.

skip

Integer; the number of lines of the data file to skip before beginning to read data.

comment.char

A character vector of length one containing a single character or an empty string. Use "" to turn off the interpretation of comments altogether.

max.person.nchar

The max number of characters long names are expected to be. This information is used to warn the user if a separator appears beyond this length in the text.

...

Further arguments to be passed to utils::read.table(), readxl::read_excel(), or read_doc().

Value

Returns a dataframe of dialogue and people.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
## EXAMPLE 1
as_transcript("34    The New York Times reports a lot of words here.
12    Greenwire reports a lot of words.
31    Only three words.
 2    The Financial Times reports a lot of words.
 9    Greenwire short.
13    The New York Times reports a lot of words again.",
    col.names = c("NO", "ARTICLE"), sep = "   ")

## EXAMPLE 2
as_transcript("34..    The New York Times reports a lot of words here.
12..    Greenwire reports a lot of words.
31..    Only three words.
 2..    The Financial Times reports a lot of words.
 9..    Greenwire short.
13..    The New York Times reports a lot of words again.",
    col.names = c("NO", "ARTICLE"), sep = "\\.\\.")

## EXAMPLE 3
as_transcript("JAKE The New York Times reports a lot of words here.
JIM Greenwire reports a lot of words.
JILL Only three words.
GRACE The Financial Times reports a lot of words.
JIM Greenwire short.
JILL The New York Times reports a lot of words again.",
   person.regex = '(^[A-Z]{3,})'
)

Example output

Table: [6 x 2]

  NO ARTICLE                                 
1 34 The New York Times reports a lot of word
2 12 Greenwire reports a lot of words.       
3 31 Only three words.                       
4 2  The Financial Times reports a lot of wor
5 9  Greenwire short.                        
6 13 The New York Times reports a lot of word
. .. ...                                      
Table: [6 x 2]

  NO ARTICLE                                 
1 34 The New York Times reports a lot of word
2 12 Greenwire reports a lot of words.       
3 31 Only three words.                       
4 2  The Financial Times reports a lot of wor
5 9  Greenwire short.                        
6 13 The New York Times reports a lot of word
. .. ...                                      
Table: [6 x 2]

  Person Dialogue                                
1 JAKE   The New York Times reports a lot of word
2 JIM    Greenwire reports a lot of words.       
3 JILL   Only three words.                       
4 GRACE  The Financial Times reports a lot of wor
5 JIM    Greenwire short.                        
6 JILL   The New York Times reports a lot of word
. ...    ...                                      

textreadr documentation built on Oct. 9, 2021, 5:06 p.m.