make.ngrams: Make text n-grams

Description Usage Arguments Details Author(s) References See Also Examples

View source: R/make.ngrams.R

Description

Function that combines a vector of text units (words, characters, POS-tags, other features) into pairs, triplets, or longer sequences, commonly referred to as n-grams.

Usage

1
make.ngrams(input.text, ngram.size = 1)

Arguments

input.text

a vector containing words or characters to be parsed into n-grams.

ngram.size

an optional argument (integer) indicating the value of n, or the size of n-grams to be produced. If this argument is missing, default value of 1 is used.

Details

Function for combining series of items (e.g. words or characters) into n-grams, or strings of n elements. E.g. character 2-grams of the sentence "This is a sentence" are as follows: "th", "hi", "is", "s ", " i", "is", "s ", " a", "a ", " s", "se", "en", "nt", "te", "en", "nc", "ce". Character 4-grams would be, of course: "this", "his ", "is a", "s a ", " a s", etc. Word 2-grams: "this is", "is a", "a sentence". The issue whether using n-grams of items increases the accuracy of stylometric procedures has been heavily debated in the secondary literature (see the reference section for further reading). Eder (2013) e.g. shows that character n-grams are suprisingly robust for dealing with noisy corpora (in terms of a high number of misspelled characters).

Author(s)

Maciej Eder

References

Alexis, A., Craig, H., and Elliot, J. (2014). Language chunking, data sparseness, and the value of a long marker list: explorations with word n-grams and authorial attribution. "Literary and Linguistic Computing", 29, advanced access (doi: 10.1093/llc/fqt028).

Eder, M. (2011). Style-markers in authorship attribution: a cross-language study of the authorial fingerprint. "Studies in Polish Linguistics", 6: 99-114. https://www.ejournals.eu/SPL/2011/SPL-vol-6-2011/.

Eder, M. (2013). Mind your corpus: systematic errors in authorship attribution. "Literary and Linguistic Computing", 28(4): 603-14.

Hoover, D. L. (2002). Frequent word sequences and statistical stylistics. "Literary and Linguistic Computing", 17: 157-80.

Hoover, D. L. (2003). Frequent collocations and authorial style. "Literary and Linguistic Computing", 18: 261-86.

Hoover, D. L. (2012). The rarer they are, the more they are, the less they matter. In: Digital Humanities 2012: Conference Abstracts, Hamburg University, Hamburg, pp. 218-21.

Koppel, M., Schler, J. and Argamon, S. (2009). Computational methods in authorship attribution. "Journal of the American Society for Information Science and Technology", 60(1): 9-26.

Stamatatos, E. (2009). A survey of modern authorship attribution methods. "Journal of the American Society for Information Science and Technology", 60(3): 538-56.

See Also

txt.to.words, txt.to.words.ext, txt.to.features

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Consider the string my.text:
my.text = "Quousque tandem abutere, Catilina, patientia nostra?"
# which can be split into a vector of consecutive words:
my.vector.of.words = txt.to.words(my.text)
# now, we create a vector of word 2-grams:
make.ngrams(my.vector.of.words, ngram.size = 2)

# similarly, you can produce character n-grams:
my.vector.of.chars = txt.to.features(my.vector.of.words, features = "c")
make.ngrams(my.vector.of.chars, ngram.size = 4)

Example output

stylo version: 0.6.4
Warning message:
no DISPLAY variable so Tk is not available 
[1] "quousque tandem"    "tandem abutere"     "abutere catilina"  
[4] "catilina patientia" "patientia nostra"  
 [1] "q u o u" "u o u s" "o u s q" "u s q u" "s q u e" "q u e  " "u e   t"
 [8] "e   t a" "  t a n" "t a n d" "a n d e" "n d e m" "d e m  " "e m   a"
[15] "m   a b" "  a b u" "a b u t" "b u t e" "u t e r" "t e r e" "e r e  "
[22] "r e   c" "e   c a" "  c a t" "c a t i" "a t i l" "t i l i" "i l i n"
[29] "l i n a" "i n a  " "n a   p" "a   p a" "  p a t" "p a t i" "a t i e"
[36] "t i e n" "i e n t" "e n t i" "n t i a" "t i a  " "i a   n" "a   n o"
[43] "  n o s" "n o s t" "o s t r" "s t r a"

stylo documentation built on Dec. 6, 2020, 5:06 p.m.

Related to make.ngrams in stylo...