Description Usage Arguments Details Author(s) References See Also Examples
Function that either splits an input text (a vector of linguistic items, such as words, word ngrams, character ngrams, etc.) into equalsized samples of a desired length (expressed in words), or excerpts randomly a number of words from the original text.
1 2 3  make.samples(tokenized.text, sample.size = 10000,
sampling = "no.sampling", sample.overlap = 0,
number.of.samples = 1, sampling.with.replacement = FALSE)

tokenized.text 
input textual data stored either in a form of vector (single text), or as a list of vectors (whole corpus); particular vectors should contain tokenized data, i.e. words, word ngrams, or other features, as elements. 
sample.size 
desired size of sample expressed in number of words; default value is 10,000. 
sampling 
one of three values: 
sample.overlap 
if this opion is used, a reference text is segmented
into consecutive, equalsized samples that are allowed to partially
overlap. If one specifies the 
number.of.samples 
optional argument which will be used only if

sampling.with.replacement 
optional argument which will be used only
if 
Normal sampling is probably a good choice when the input texts are
long: the advantage is that one gets a bigger number of samples which,
in a way, validate the results (when several independent samples excerpted
from one text are clustered together).
When the analyzed texts are significantly unequal in length, it is not
a bad idea to prepare samples as randomly chosen "bags of words". For this,
set the sampling
variable to random.sampling
. The desired
size of the sample should be specified via the sample.size
variable.
Sampling with and without replacement is also available. It has been shown
by Eder (2010) that harvesting random samples from original texts improves
the performance of authorship attribution methods.
Mike Kestemont, Maciej Eder
Eder, M. (2015). Does size matter? Authorship attribution, small samples, big problem. "Digital Scholarship in the Humanities", 30(2): 167182.
txt.to.words
, txt.to.words.ext
,
txt.to.features
, make.ngrams
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22  my.text = "Arma virumque cano, Troiae qui primus ab oris
Italiam fato profugus Laviniaque venit
litora, multum ille et terris iactatus et alto
vi superum, saevae memorem Iunonis ob iram,
multa quoque et bello passus, dum conderet urbem
inferretque deos Latio; genus unde Latinum
Albanique patres atque altae moenia Romae.
Musa, mihi causas memora, quo numine laeso
quidve dolens regina deum tot volvere casus
insignem pietate virum, tot adire labores
impulerit. tantaene animis caelestibus irae?"
my.words = txt.to.words(my.text)
# split the above text into samples of 20 words:
make.samples(my.words, sampling = "normal.sampling", sample.size = 20)
# excerpt randomly 50 words from the above text:
make.samples(my.words, sampling = "random.sampling", sample.size = 50)
# excerpt 5 random samples from the above text:
make.samples(my.words, sampling = "random.sampling", sample.size = 50,
number.of.samples = 5)

### stylo version: 0.6.9 ###
If you plan to cite this software (please do!), use the following reference:
Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R:
a package for computational text analysis. R Journal 8(1): 107121.
<https://journal.rproject.org/archive/2016/RJ2016007/index.html>
To get full BibTeX entry, type: citation("stylo")
Warning message:
no DISPLAY variable so Tk is not available
paste_1
 text length (in words): 73
 nr. of samples: 3
 nr. of words dropped at the end of the text: 13
sample 1
"paste_1_1"
[1] arma
[2] virumque
[3] cano
[4] troiae
[5] qui
[6] primus
[7] ab
[8] oris
[9] italiam
[10] fato
... ...
sample 2
"paste_1_2"
[1] alto
[2] vi
[3] superum
[4] saevae
[5] memorem
[6] iunonis
[7] ob
[8] iram
[9] multa
[10] quoque
... ...
sample 3
"paste_1_3"
[1] unde
[2] latinum
[3] albanique
[4] patres
[5] atque
[6] altae
[7] moenia
[8] romae
[9] musa
[10] mihi
... ...
(total number of samples: 3)
paste_1
 text length (in words): 73
 nr. of random samples: 1
 sample length: 50
sample 1
"paste_1_1"
[1] urbem
[2] casus
[3] oris
[4] ab
[5] iram
[6] alto
[7] patres
[8] et
[9] adire
[10] multa
... ...
(total number of samples: 1)
paste_1
 text length (in words): 73
 nr. of random samples: 5
 sample length: 50
sample 1
"paste_1_1"
[1] laviniaque
[2] quoque
[3] iactatus
[4] volvere
[5] cano
[6] tot
[7] iram
[8] alto
[9] tot
[10] irae
... ...
sample 2
"paste_1_2"
[1] mihi
[2] laeso
[3] tot
[4] multa
[5] latio
[6] bello
[7] regina
[8] altae
[9] quoque
[10] superum
... ...
sample 3
"paste_1_3"
[1] genus
[2] conderet
[3] latinum
[4] litora
[5] memora
[6] laviniaque
[7] ob
[8] labores
[9] deos
[10] quo
... ...
sample 4
"paste_1_4"
[1] memora
[2] virumque
[3] causas
[4] quo
[5] albanique
[6] italiam
[7] quidve
[8] pietate
[9] tantaene
[10] unde
... ...
sample 5
"paste_1_5"
[1] memora
[2] latio
[3] inferretque
[4] bello
[5] ob
[6] iactatus
[7] albanique
[8] qui
[9] numine
[10] iunonis
... ...
(total number of samples: 5)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.