corpus_reshape: Recast the document units of a corpus

Description Usage Arguments Value Examples

View source: R/corpus_reshape.R

Description

For a corpus, reshape (or recast) the documents to a different level of aggregation. Units of aggregation can be defined as documents, paragraphs, or sentences. Because the corpus object records its current "units" status, it is possible to move from recast units back to original units, for example from documents, to sentences, and then back to documents (possibly after modifying the sentences).

Usage

1
2
corpus_reshape(x, to = c("sentences", "paragraphs", "documents"),
  use_docvars = TRUE, ...)

Arguments

x

corpus whose document units will be reshaped

to

new document units in which the corpus will be recast

use_docvars

if TRUE, repeat the docvar values for each segmented text; if FALSE, drop the docvars in the segmented corpus. Dropping the docvars might be useful in order to conserve space or if these are not desired for the segmented corpus.

...

additional arguments passed to tokens, since the syntactic segmenter uses this function)

Value

A corpus object with the documents defined as the new units, including document-level meta-data identifying the original documents.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# simple example
corp <- corpus(c(textone = "This is a sentence.  Another sentence.  Yet another.", 
                 textwo = "Premiere phrase.  Deuxieme phrase."), 
                 docvars = data.frame(country=c("UK", "USA"), year=c(1990, 2000)),
                 metacorpus = list(notes = "Example showing how corpus_reshape() works."))
summary(corp)
summary(corpus_reshape(corp, to = "sentences"), showmeta = TRUE)

# example with inaugural corpus speeches
(corp2 <- corpus_subset(data_corpus_inaugural, Year>2004))
corp2_para <- corpus_reshape(corp2, to="paragraphs")
corp2_para
summary(corp2_para, 100, showmeta = TRUE)
## Note that Bush 2005 is recorded as a single paragraph because that text 
## used a single \n to mark the end of a paragraph.

Example output

quanteda version 0.99
Using 2 of 1 threads for parallel computing

Attaching package: 'quanteda'

The following object is masked from 'package:utils':

    View

Corpus consisting of 2 documents.

    Text Types Tokens Sentences country year
 textone     8     11         3      UK 1990
  textwo     4      6         2     USA 2000

Source:  /work/tmp/* on x86_64 by unknown
Created: Thu Feb 15 00:59:56 2018
Notes:   Example showing how corpus_reshape() works.

Corpus consisting of 5 documents.

      Text Types Tokens Sentences country year _document _docid _segid
 textone.1     5      5         1      UK 1990   textone      1      1
 textone.2     3      3         1      UK 1990   textone      1      2
 textone.3     3      3         1      UK 1990   textone      1      3
  textwo.1     3      3         1     USA 2000    textwo      2      1
  textwo.2     3      3         1     USA 2000    textwo      2      2

Source:  /work/tmp/* on x86_64 by unknown
Created: Thu Feb 15 00:59:56 2018
Notes:   corpus_segment.corpus(x, what = to, ...)

Corpus consisting of 4 documents and 3 docvars.
Corpus consisting of 138 documents and 3 docvars.
Corpus consisting of 138 documents, showing 100 documents.

          Text Types Tokens Sentences Year President FirstName  _document
   2005-Bush.1   773   2319       100 2005      Bush George W.  2005-Bush
  2009-Obama.1     4      4         1 2009     Obama    Barack 2009-Obama
  2009-Obama.2    42     53         2 2009     Obama    Barack 2009-Obama
  2009-Obama.3    62     86         4 2009     Obama    Barack 2009-Obama
  2009-Obama.4    12     15         2 2009     Obama    Barack 2009-Obama
  2009-Obama.5    76    108         5 2009     Obama    Barack 2009-Obama
  2009-Obama.6    39     47         2 2009     Obama    Barack 2009-Obama
  2009-Obama.7    36     47         4 2009     Obama    Barack 2009-Obama
  2009-Obama.8    19     22         1 2009     Obama    Barack 2009-Obama
  2009-Obama.9    29     33         1 2009     Obama    Barack 2009-Obama
 2009-Obama.10    56     82         2 2009     Obama    Barack 2009-Obama
 2009-Obama.11    71    106         5 2009     Obama    Barack 2009-Obama
 2009-Obama.12    21     21         1 2009     Obama    Barack 2009-Obama
 2009-Obama.13    20     24         1 2009     Obama    Barack 2009-Obama
 2009-Obama.14    17     20         1 2009     Obama    Barack 2009-Obama
 2009-Obama.15    43     51         2 2009     Obama    Barack 2009-Obama
 2009-Obama.16    73    108         7 2009     Obama    Barack 2009-Obama
 2009-Obama.17    85    144         8 2009     Obama    Barack 2009-Obama
 2009-Obama.18    52     62         3 2009     Obama    Barack 2009-Obama
 2009-Obama.19    97    150         5 2009     Obama    Barack 2009-Obama
 2009-Obama.20    76    117         3 2009     Obama    Barack 2009-Obama
 2009-Obama.21    90    137         4 2009     Obama    Barack 2009-Obama
 2009-Obama.22    60     83         3 2009     Obama    Barack 2009-Obama
 2009-Obama.23    92    142         5 2009     Obama    Barack 2009-Obama
 2009-Obama.24    82    126         3 2009     Obama    Barack 2009-Obama
 2009-Obama.25    65    103         3 2009     Obama    Barack 2009-Obama
 2009-Obama.26    63     84         3 2009     Obama    Barack 2009-Obama
 2009-Obama.27    81    115         4 2009     Obama    Barack 2009-Obama
 2009-Obama.28    69     96         3 2009     Obama    Barack 2009-Obama
 2009-Obama.29    95    158         7 2009     Obama    Barack 2009-Obama
 2009-Obama.30     9     10         1 2009     Obama    Barack 2009-Obama
 2009-Obama.31    20     22         1 2009     Obama    Barack 2009-Obama
 2009-Obama.32    53     65         1 2009     Obama    Barack 2009-Obama
 2009-Obama.33    67     95         6 2009     Obama    Barack 2009-Obama
 2009-Obama.34    34     53         1 2009     Obama    Barack 2009-Obama
 2009-Obama.35    75    106         4 2009     Obama    Barack 2009-Obama
 2009-Obama.36    11     16         3 2009     Obama    Barack 2009-Obama
  2013-Obama.1    20     23         1 2013     Obama    Barack 2013-Obama
  2013-Obama.2    56     82         4 2013     Obama    Barack 2013-Obama
  2013-Obama.3    33     41         1 2013     Obama    Barack 2013-Obama
  2013-Obama.4    76    111         4 2013     Obama    Barack 2013-Obama
  2013-Obama.5    10     10         1 2013     Obama    Barack 2013-Obama
  2013-Obama.6    34     42         2 2013     Obama    Barack 2013-Obama
  2013-Obama.7    22     26         1 2013     Obama    Barack 2013-Obama
  2013-Obama.8    21     21         1 2013     Obama    Barack 2013-Obama
  2013-Obama.9    23     24         1 2013     Obama    Barack 2013-Obama
 2013-Obama.10    44     54         2 2013     Obama    Barack 2013-Obama
 2013-Obama.11    89    130         4 2013     Obama    Barack 2013-Obama
 2013-Obama.12    68     93         5 2013     Obama    Barack 2013-Obama
 2013-Obama.13    90    131         4 2013     Obama    Barack 2013-Obama
 2013-Obama.14    68     97         5 2013     Obama    Barack 2013-Obama
 2013-Obama.15    66     97         4 2013     Obama    Barack 2013-Obama
 2013-Obama.16    72    109         4 2013     Obama    Barack 2013-Obama
 2013-Obama.17    54     76         3 2013     Obama    Barack 2013-Obama
 2013-Obama.18    69    102         6 2013     Obama    Barack 2013-Obama
 2013-Obama.19    81    122         5 2013     Obama    Barack 2013-Obama
 2013-Obama.20    44     56         2 2013     Obama    Barack 2013-Obama
 2013-Obama.21    92    140         4 2013     Obama    Barack 2013-Obama
 2013-Obama.22    66     98         1 2013     Obama    Barack 2013-Obama
 2013-Obama.23   110    189         6 2013     Obama    Barack 2013-Obama
 2013-Obama.24    61     99         4 2013     Obama    Barack 2013-Obama
 2013-Obama.25    63     93         4 2013     Obama    Barack 2013-Obama
 2013-Obama.26    75    109         4 2013     Obama    Barack 2013-Obama
 2013-Obama.27    45     72         3 2013     Obama    Barack 2013-Obama
 2013-Obama.28    39     52         2 2013     Obama    Barack 2013-Obama
 2013-Obama.29    15     18         2 2013     Obama    Barack 2013-Obama
  2017-Trump.1    20     28         1 2017     Trump Donald J. 2017-Trump
  2017-Trump.2    26     29         1 2017     Trump Donald J. 2017-Trump
  2017-Trump.3    17     20         1 2017     Trump Donald J. 2017-Trump
  2017-Trump.4    13     18         3 2017     Trump Donald J. 2017-Trump
  2017-Trump.5    40     48         3 2017     Trump Donald J. 2017-Trump
  2017-Trump.6    35     49         2 2017     Trump Donald J. 2017-Trump
  2017-Trump.7    23     25         1 2017     Trump Donald J. 2017-Trump
  2017-Trump.8    13     13         1 2017     Trump Donald J. 2017-Trump
  2017-Trump.9    12     13         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.10    13     13         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.11    30     38         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.12    21     24         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.13    13     14         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.14     6     10         2 2017     Trump Donald J. 2017-Trump
 2017-Trump.15    12     13         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.16    18     21         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.17    18     21         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.18    13     14         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.19     7      7         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.20    21     25         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.21    19     20         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.22    16     20         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.23    12     14         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.24    60     82         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.25     9     11         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.26    23     39         3 2017     Trump Donald J. 2017-Trump
 2017-Trump.27    14     16         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.28    47     63         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.29    20     22         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.30    25     30         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.31    20     20         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.32    14     16         2 2017     Trump Donald J. 2017-Trump
 2017-Trump.33    23     28         1 2017     Trump Donald J. 2017-Trump
 2017-Trump.34    13     13         1 2017     Trump Donald J. 2017-Trump
 _docid _segid
      1      1
      2      1
      2      2
      2      3
      2      4
      2      5
      2      6
      2      7
      2      8
      2      9
      2     10
      2     11
      2     12
      2     13
      2     14
      2     15
      2     16
      2     17
      2     18
      2     19
      2     20
      2     21
      2     22
      2     23
      2     24
      2     25
      2     26
      2     27
      2     28
      2     29
      2     30
      2     31
      2     32
      2     33
      2     34
      2     35
      2     36
      3      1
      3      2
      3      3
      3      4
      3      5
      3      6
      3      7
      3      8
      3      9
      3     10
      3     11
      3     12
      3     13
      3     14
      3     15
      3     16
      3     17
      3     18
      3     19
      3     20
      3     21
      3     22
      3     23
      3     24
      3     25
      3     26
      3     27
      3     28
      3     29
      4      1
      4      2
      4      3
      4      4
      4      5
      4      6
      4      7
      4      8
      4      9
      4     10
      4     11
      4     12
      4     13
      4     14
      4     15
      4     16
      4     17
      4     18
      4     19
      4     20
      4     21
      4     22
      4     23
      4     24
      4     25
      4     26
      4     27
      4     28
      4     29
      4     30
      4     31
      4     32
      4     33
      4     34

Source:  Gerhard Peters and John T. Woolley. The American Presidency Project.
Created: Thu Feb 15 00:59:56 2018
Notes:   corpus_segment.corpus(x, what = to, ...)

quanteda documentation built on Nov. 20, 2018, 1:04 a.m.