aggregate_rsyntax | R Documentation |
A method for aggregating rsyntax annotations. The intended purpose is to compute aggregate values for a given label in an annotation column.
For example, you used annotate_rsyntax to add a column with subject-predicate labels, and now you want to concatenate the tokens with these labels. With annotate_rsyntax you would first aggregate the subject tokens, then aggregate the predicate tokens. By default (txt = T) the column with concatenated tokens are added.
You can specify any aggregation function using any column in tc$tokens. So say you want to perform a sentiment analysis on the quotes of politicians. You first used annotate_rsyntax to create an annotation column 'quote', that has the labels 'source', 'verb', and 'quote'. You also used code_dictionary to add a column with unique politician ID's and a column with sentiment scores. Now you can aggregate the source tokens to get a single unique ID, and aggregate the quote tokens to get a single sentiment score.
aggregate_rsyntax(
tc,
annotation,
...,
by_col = NULL,
txt = F,
labels = NULL,
rm_na = T
)
tc |
a tCorpus |
annotation |
The name of the rsyntax annotation column |
... |
To aggregate columns for specific |
by_col |
A character vector with other column names in tc$tokens to aggregate by. |
txt |
If TRUE, add columns with concatenated tokens for each label. Can also be a character vector specifying for which specific labels to create this column |
labels |
Instead of using all labels, a character vector of labels can be given |
rm_na |
If TRUE, remove rows with only NA values |
A data.table
## Not run:
tc = tc_sotu_udpipe$copy()
tc$udpipe_clauses()
subject_verb_predicate = aggregate_rsyntax(tc, 'clause', txt=TRUE)
head(subject_verb_predicate)
## We can also add specific aggregation functions
## count number of tokens in predicate
aggregate_rsyntax(tc, 'clause',
agg_label('predicate', n = length(token_id)))
## same, but with txt for only the subject label
aggregate_rsyntax(tc, 'clause', txt='subject',
agg_label('predicate', n = length(token_id)))
## example application: sentiment scores for specific subjects
# first use queries to code subjects
tc$code_features(column = 'who',
query = c('I# I~s <this president>',
'we# we americans <american people>'))
# then use dictionary to get sentiment scores
dict = melt_quanteda_dict(quanteda::data_dictionary_LSD2015)
dict$sentiment = ifelse(dict$code %in% c('negative','neg_positive'), -1, 1)
tc$code_dictionary(dict)
sent = aggregate_rsyntax(tc, 'clause', txt='predicate',
agg_label('subject', subject = na.omit(who)[1]),
agg_label('predicate', sentiment = mean(sentiment, na.rm=TRUE)))
head(sent)
sent[,list(sentiment=mean(sentiment, na.rm=TRUE), n=.N), by='subject']
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.