make.samples: Split text to samples

Description Usage Arguments Details Author(s) References See Also Examples

View source: R/make.samples.R

Description

Function that either splits an input text (a vector of linguistic items, such as words, word n-grams, character n-grams, etc.) into equal-sized samples of a desired length (expressed in words), or excerpts randomly a number of words from the original text.

Usage

1
2
3
make.samples(tokenized.text, sample.size = 10000, 
             sampling = "no.sampling", sample.overlap = 0,
             number.of.samples = 1, sampling.with.replacement = FALSE)

Arguments

tokenized.text

input textual data stored either in a form of vector (single text), or as a list of vectors (whole corpus); particular vectors should contain tokenized data, i.e. words, word n-grams, or other features, as elements.

sample.size

desired size of sample expressed in number of words; default value is 10,000.

sampling

one of three values: no.sampling (default), normal.sampling, random.sampling.

sample.overlap

if this opion is used, a reference text is segmented into consecutive, equal-sized samples that are allowed to partially overlap. If one specifies the sample.size parameter of 5,000 and the sample.overlap of 1,000, for example, the first sample of a text contains words 1–5,000, the second 4001–9,000, the third sample 8001–13,000, and so forth.

number.of.samples

optional argument which will be used only if random.sampling was chosen; it is self-evident.

sampling.with.replacement

optional argument which will be used only if random.sampling was chosen; it specifies the method to randomly harvest words from texts.

Details

Normal sampling is probably a good choice when the input texts are long: the advantage is that one gets a bigger number of samples which, in a way, validate the results (when several independent samples excerpted from one text are clustered together). When the analyzed texts are significantly unequal in length, it is not a bad idea to prepare samples as randomly chosen "bags of words". For this, set the sampling variable to random.sampling. The desired size of the sample should be specified via the sample.size variable. Sampling with and without replacement is also available. It has been shown by Eder (2010) that harvesting random samples from original texts improves the performance of authorship attribution methods.

Author(s)

Mike Kestemont, Maciej Eder

References

Eder, M. (2015). Does size matter? Authorship attribution, small samples, big problem. "Digital Scholarship in the Humanities", 30(2): 167-182.

See Also

txt.to.words, txt.to.words.ext, txt.to.features, make.ngrams

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
my.text = "Arma virumque cano, Troiae qui primus ab oris
           Italiam fato profugus Laviniaque venit
           litora, multum ille et terris iactatus et alto
           vi superum, saevae memorem Iunonis ob iram,
           multa quoque et bello passus, dum conderet urbem
           inferretque deos Latio; genus unde Latinum
           Albanique patres atque altae moenia Romae.
           Musa, mihi causas memora, quo numine laeso
           quidve dolens regina deum tot volvere casus
           insignem pietate virum, tot adire labores
           impulerit. tantaene animis caelestibus irae?"
my.words = txt.to.words(my.text)

# split the above text into samples of 20 words:
make.samples(my.words, sampling = "normal.sampling", sample.size = 20)

# excerpt randomly 50 words from the above text:
make.samples(my.words, sampling = "random.sampling", sample.size = 50)

# excerpt 5 random samples from the above text:
make.samples(my.words, sampling = "random.sampling", sample.size = 50,
             number.of.samples = 5)

Example output

### stylo version: 0.6.9 ###

If you plan to cite this software (please do!), use the following reference:
    Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R:
    a package for computational text analysis. R Journal 8(1): 107-121.
    <https://journal.r-project.org/archive/2016/RJ-2016-007/index.html>

To get full BibTeX entry, type: citation("stylo")
Warning message:
no DISPLAY variable so Tk is not available 
paste_1
	- text length (in words): 73
	- nr. of samples: 3
	- nr. of words dropped at the end of the text: 13

sample 1 
"paste_1_1"
   [1] arma
   [2] virumque
   [3] cano
   [4] troiae
   [5] qui
   [6] primus
   [7] ab
   [8] oris
   [9] italiam
  [10] fato
   ... ... 

sample 2 
"paste_1_2"
   [1] alto
   [2] vi
   [3] superum
   [4] saevae
   [5] memorem
   [6] iunonis
   [7] ob
   [8] iram
   [9] multa
  [10] quoque
   ... ... 

sample 3 
"paste_1_3"
   [1] unde
   [2] latinum
   [3] albanique
   [4] patres
   [5] atque
   [6] altae
   [7] moenia
   [8] romae
   [9] musa
  [10] mihi
   ... ... 

(total number of samples:  3)

paste_1
	- text length (in words): 73
	- nr. of random samples: 1
	- sample length: 50

sample 1 
"paste_1_1"
   [1] urbem
   [2] casus
   [3] oris
   [4] ab
   [5] iram
   [6] alto
   [7] patres
   [8] et
   [9] adire
  [10] multa
   ... ... 

(total number of samples:  1)

paste_1
	- text length (in words): 73
	- nr. of random samples: 5
	- sample length: 50

sample 1 
"paste_1_1"
   [1] laviniaque
   [2] quoque
   [3] iactatus
   [4] volvere
   [5] cano
   [6] tot
   [7] iram
   [8] alto
   [9] tot
  [10] irae
   ... ... 

sample 2 
"paste_1_2"
   [1] mihi
   [2] laeso
   [3] tot
   [4] multa
   [5] latio
   [6] bello
   [7] regina
   [8] altae
   [9] quoque
  [10] superum
   ... ... 

sample 3 
"paste_1_3"
   [1] genus
   [2] conderet
   [3] latinum
   [4] litora
   [5] memora
   [6] laviniaque
   [7] ob
   [8] labores
   [9] deos
  [10] quo
   ... ... 

sample 4 
"paste_1_4"
   [1] memora
   [2] virumque
   [3] causas
   [4] quo
   [5] albanique
   [6] italiam
   [7] quidve
   [8] pietate
   [9] tantaene
  [10] unde
   ... ... 

sample 5 
"paste_1_5"
   [1] memora
   [2] latio
   [3] inferretque
   [4] bello
   [5] ob
   [6] iactatus
   [7] albanique
   [8] qui
   [9] numine
  [10] iunonis
   ... ... 

(total number of samples:  5)

stylo documentation built on Dec. 6, 2020, 5:06 p.m.

Related to make.samples in stylo...