create_tcorpus: Create a tCorpus

Description Usage Arguments Examples

View source: R/create_tcorpus.r

Description

Create a tCorpus from raw text input

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
create_tcorpus(x, ...)

## S3 method for class 'character'
create_tcorpus(x, doc_id = 1:length(x), meta = NULL,
  udpipe_model = NULL, split_sentences = F, max_sentences = NULL,
  max_tokens = NULL, udpipe_model_path = getOption("corpustools_resources",
  NULL), use_parser = F, remember_spaces = FALSE, verbose = T, ...)

## S3 method for class 'factor'
create_tcorpus(x, doc_id = 1:length(x), meta = NULL,
  udpipe_model = NULL, split_sentences = F, max_sentences = NULL,
  max_tokens = NULL, udpipe_model_path = getOption("corpustools_resources",
  NULL), use_parser = F, remember_spaces = FALSE, verbose = T, ...)

## S3 method for class 'data.frame'
create_tcorpus(x, text_columns = "text",
  doc_column = "doc_id", udpipe_model = NULL, split_sentences = F,
  max_sentences = NULL, max_tokens = NULL,
  udpipe_model_path = getOption("corpustools_resources", NULL),
  use_parser = F, remember_spaces = FALSE, verbose = T, ...)

Arguments

x

main input. can be a character (or factor) vector where each value is a full text, or a data.frame that has a column that contains full texts.

...

not used

doc_id

if x is a character/factor vector, doc_id can be used to specify document ids. This has to be a vector of the same length as x

meta

A data.frame with document meta information (e.g., date, source). The rows of the data.frame need to match the values of x

udpipe_model

Optionally, the name of a udpipe language model (e.g., "english", "dutch", "german"), to use the udpipe package to perform natural language processing. On first use, the model will be downloaded to the location specified in the udpipe_model_path argument. By default, dependency parsing (see use_parser argument) is turned off.

split_sentences

Logical. If TRUE, the sentence number of tokens is also computed. (only if udpipe_model is not used)

max_sentences

An integer. Limits the number of sentences per document to the specified number. If set when split_sentences == FALSE, split_sentences will be set to TRUE.

max_tokens

An integer. Limits the number of tokens per document to the specified number

udpipe_model_path

If udpipe_model is used, this path wil be used to look for the model, and if the model doesn't yet exist it will be downloaded to this location. If no path is given, the directory in which corpustool was installed will be used (see resources_path). You can also change the default with set_resources_path.

use_parser

If TRUE, use dependency parser (only if udpipe_model is used)

remember_spaces

If TRUE, a column with spaces after each token is included. Enables correct reconstruction of original text and keeps annotations at the level of character positions (e.g., brat) intact.

verbose

If TRUE, report progress

text_columns

if x is a data.frame, this specifies the column(s) that contains text. The texts are paste together in the order specified here.

doc_column

If x is a data.frame, this specifies the column with the document ids.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
tc = create_tcorpus(c('Text one first sentence. Text one second sentence', 'Text two'))
tc$get()

tc = create_tcorpus(c('Text one first sentence. Text one second sentence', 'Text two'),
                    split_sentences = TRUE)
tc$get()

## with meta (easier to S3 method for data.frame)
meta = data.frame(doc_id = c(1,2), source = c('a','b'))
tc = create_tcorpus(c('Text one first sentence. Text one second sentence', 'Text two'),
                    split_sentences = TRUE,
                    doc_id = c(1,2),
                    meta = meta)
tc
## It makes little sense to have full texts as factors, but it tends to happen.
## The create_tcorpus S3 method for factors is essentially identical to the
##  method for a character vector.
text = factor(c('Text one first sentence', 'Text one second sentence'))
tc = create_tcorpus(text)
tc$get()
d = data.frame(text = c('Text one first sentence. Text one second sentence.',
               'Text two', 'Text three'),
               date = c('2010-01-01','2010-01-01','2012-01-01'),
               source = c('A','B','B'))

tc = create_tcorpus(d, split_sentences = TRUE)
tc
tc$get()

## use multiple text columns
d$headline = c('Head one', 'Head two', 'Head three')
## use custom doc_id
d$doc_id = c('#1', '#2', '#3')

tc = create_tcorpus(d, text_columns = c('headline','text'), doc_column = 'doc_id',
                    split_sentences = TRUE)
tc
tc$get()

## (note that text from different columns is pasted together with a double newline in between)
tc$read_text(doc_id = '#1')

kasperwelbers/corpustools documentation built on Dec. 5, 2018, 9:11 a.m.