qe_data: Apply quanteda to text data

Description Usage Arguments Examples

Description

Apply quanteda to text data

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
qe_data(
  data,
  text_columns = NULL,
  id_column = NULL,
  join_data = T,
  method = "word",
  remove_words = c("\\.", "\\-", "\\#", "\\'", "\\,", "\\;", "\\_",
    "\\DESCRIPTION:", "\\:", "\\SBIR", "\\I ", "\\II ", "\\III ", "PHASE"),
  dfm_dictionary = NULL,
  n_top_features = 10,
  stem = F,
  remove_numbers = T,
  remove_punct = T,
  remove_symbols = T,
  exclude_features = F,
  remove_separators = TRUE,
  remove_twitter = T,
  remove_hyphens = T,
  collocation_size = 2,
  include_textstat = F,
  remove_url = T,
  stop_sources = c("smart", "snowball", "stopwords-iso"),
  n_gram_tokens = 2:3,
  include_dfm = F,
  verbose = T
)

Arguments

data

tibble

text_columns

vector text columns

id_column

if not NULL id column

join_data

if TRUE joins to original data

method

what the unit for splitting the text, available alternatives are:

"word"

(recommended default) smartest, but slowest, word tokenization method; see stringi-search-boundaries for details.

"fasterword"

dumber, but faster, word tokenization metho uses stri_split_charclass(x, "[\\p{Z}\\p{C}]+")

"fastestword"

dumbest, but fastest, word tokenization method, calls stri_split_fixed(x, " ")

"character"

tokenization into individual characters

"sentence"

sentence segmenter, smart enough to handle some exceptions in English such as "Prof. Plum killed Mrs. Peacock." (but far from perfect).

dfm_dictionary

if not NULL dictionary of word meanings

n_top_features

if not NULL number of top features for feature count

stem

if TRUE stem words

remove_numbers

logical; if TRUE remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day

exclude_features

if TRUE exclude feature columns

remove_separators

logical; if TRUE remove separators and separator characters (Unicode "Separator" [Z] and "Control [C]" categories). Only applicable for what = "character" (when you probably want it to be FALSE) and for what = "word" (when you probably want it to be TRUE).

remove_twitter

logical; if TRUE remove Twitter characters @ and #; set to TRUE if you wish to eliminate these. Note that this will always be set to FALSE if remove_punct = FALSE.

remove_hyphens

logical; if TRUE split words that are connected by hyphenation and hyphenation-like characters in between words, e.g. "self-storage" becomes c("self", "storage"). Default is FALSE to preserve such words as is, with the hyphens. Only applies if method = "word" or what = "fasterword".

collocation_size

integer collocation size for texstat parameter

include_textstat

if TRUE applies textstat algorithm to text vector

remove_url

logical; if TRUE find and eliminate URLs beginning with http(s) – see section "Dealing with URLs".

stop_sources

stopword source

  • smart

  • snowball

  • stopwords-iso

n_gram_tokens

integer of n-gram tokens - default 2L

include_dfm

if TRUE includes document feature matrix

verbose

if TRUE vervbose

Examples

1
2
3
4
library(tidyverse)
data <- tibble(idSBIR = 190558, nameAward = "WEB-BASED NUTRITION EDUCATION FOR COLLEGE STUDENTS", descriptionAward = "THE NEGATIVE HEALTH AND SOCIAL CONSEQUENCES OF POOR NUTRITION ARE WELL DOCUMENTED IN COLLEGE STUDENTS AND THE OUTCOME OF POOR EATING HABITS IS MANIFEST IN BOTH OBESITY AND A VARIETY OFHEALTH CONCERNS. THIS APPLICATION PROPOSES THE DEVELOPMENT OF A COLLEGE STUDENTWEBSITE CALLED MYSTUDENTBODY.COM (NUTRITION), TO BE BASED AT A COLLEGE PERSONALHEALTH INTERNET PORTAL CALLED MYSTUDENTBODY.COM. THE PROGRAM WILL BE OFFEREDTHROUGH COLLEGES AND UNIVERSITIES TO EDUCATE STUDENTS ABOUT HEALTHY NUTRITIONAND LEARN EFFECTIVE, TAILORED HEALTHY EATING STRATEGIES. USING INTERACTIVE,WEB-BASED TECHNOLOGY, THIS PSYCHOEDUCATIONAL PROGRAM WILL BE SUPPORTED BY ANUMBER OF UNIQUE FEATURES THAT WILL MAKE IT A TRUE INNOVATION IN THE AREA OFNUTRITION EDUCATION FOR COLLEGE STUDENTS. MYSTUDENTBODY.COM (NUTRITION) WILLGUIDE STUDENTS THROUGH AN INTERACTIVE PROGRAM DESIGNED TO TEACH EFFECTIVENUTRITION EDUCATION IN AN INTERNET CONTEXT THAT IS INFORMATIVE, ENGAGING, ANDDRAMATIC. THE CURRENT APPLICATION COMBINES STATE-OF-THE-ART KNOWLEDGE ABOUTTAILORING STRATEGIES WITH ADVANCES IN INTERNET-BASED TECHNOLOGIES.MYSTUDENTBODY.COM (NUTRITION) OFFERS AN ONLINE PERSONALIZED NUTRITION EDUCATIONPROGRAM ALLOWING STUDENTS TO RECEIVE EMPIRICALLY-BASED INFORMATION AND FEEDBACKIN A CONFIDENTIAL MANNER.")

qe_data(data = data, id_column = "idSBIR", text_columns = c("nameAward", "descriptionAward"), n_gram_tokens = 2:3, n_top_features = 5) %>% glimpse()

abresler/govtrackR documentation built on July 11, 2020, 12:30 a.m.