udpipe_tcorpus: Create a tCorpus using udpipe
In corpustools: Managing, Querying and Analyzing Tokenized Text

udpipe_tcorpus

R Documentation

Create a tCorpus using udpipe

Description

This is simply shorthand for using create_tcorpus with the udpipe_ arguments and certain specific settings. This is the way to create a tCorpus if you want to use the syntax analysis functionalities.

Usage

udpipe_tcorpus(x, ...)

## S3 method for class 'character'
udpipe_tcorpus(
  x,
  model = "english-ewt",
  doc_id = 1:length(x),
  meta = NULL,
  max_sentences = NULL,
  model_path = getwd(),
  cache = 3,
  cores = NULL,
  batchsize = 50,
  use_parser = T,
  start_end = F,
  verbose = T,
  ...
)

## S3 method for class 'data.frame'
udpipe_tcorpus(
  x,
  model = "english-ewt",
  text_columns = "text",
  doc_column = "doc_id",
  max_sentences = NULL,
  model_path = getwd(),
  cache = 3,
  cores = 1,
  batchsize = 50,
  use_parser = T,
  start_end = F,
  verbose = T,
  ...
)

## S3 method for class 'factor'
udpipe_tcorpus(x, ...)

## S3 method for class 'corpus'
udpipe_tcorpus(x, ...)

Arguments

`x`	main input. can be a character (or factor) vector where each value is a full text, or a data.frame that has a column that contains full texts.
`...`	Arguments passed to create_tcorpus.character
`model`	The name of a Universal Dependencies language model (e.g., "english-ewt", "dutch-alpino"), to use the udpipe package (`udpipe_annotate`). If you don't know the model name, just type the language and you'll get a suggestion. Otherwise, use `show_udpipe_models` to get an overview of the available models. For more information about udpipe and performance benchmarks of the UD models, see the GitHub page of the udpipe package.
`doc_id`	if x is a character/factor vector, doc_id can be used to specify document ids. This has to be a vector of the same length as x
`meta`	A data.frame with document meta information (e.g., date, source). The rows of the data.frame need to match the values of x
`max_sentences`	An integer. Limits the number of sentences per document to the specified number.
`model_path`	If udpipe_model is used, this path wil be used to look for the model, and if the model doesn't yet exist it will be downloaded to this location. Defaults to working directory
`cache`	The number of persistent caches to keep for inputs of udpipe. The caches store tokens in batches. This way, if a lot of data has to be parsed, or if R crashes, udpipe can continue from the latest batch instead of start over. The caches are stored in the corpustools_data folder (in udpipe_model_path). Only the most recent [udpipe_caches] caches will be stored.
`cores`	If udpipe_model is used, this sets the number of parallel cores. If not specified, will use the same number of cores as used by data.table (or limited to OMP_THREAD_LIMIT)
`batchsize`	In order to report progress and cache results, texts are parsed with udpipe in batches of 50. The price is that there will be some overhead for each batch, so for very large jobs it can be faster to increase the batchsize. If the number of texts divided by the number of parallel cores is lower than the batchsize, the texts are evenly distributed over cores.
`use_parser`	If TRUE, use dependency parser (only if udpipe_model is used)
`start_end`	If TRUE, include start and end positions of tokens
`verbose`	If TRUE, report progress. Only if x is large enough to require multiple sequential batches
`text_columns`	if x is a data.frame, this specifies the column(s) that contains text. The texts are paste together in the order specified here.
`doc_column`	If x is a data.frame, this specifies the column with the document ids.

Examples

## ...
if (interactive()) {
tc = udpipe_tcorpus(c('Text one first sentence. Text one second sentence', 'Text two'), 
                     model = 'english-ewt')
tc$tokens
}
if (interactive()) {
tc = udpipe_tcorpus(sotu_texts[1:5,], doc_column='id', model = 'english-ewt')
tc$tokens
}
## It makes little sense to have full texts as factors, but it tends to happen.
## The create_tcorpus S3 method for factors is essentially identical to the
##  method for a character vector.

text = factor(c('Text one first sentence', 'Text one second sentence'))
if (interactive()) {
tc = udpipe_tcorpus(text, 'english-ewt-')
tc$tokens
}
# library(quanteda)
# udpipe_tcorpus(data_corpus_inaugural, 'english-ewt')

corpustools documentation built on May 31, 2023, 8:45 p.m.