View source: R/make.table.of.frequencies.R
make.table.of.frequencies | R Documentation |
Function that collects several frequency lists and combines them into a single frequency table. To this end a number of rearrangements inside particular lists are carried out. The table is produced using a reference list of words/features (passed as an argument).
make.table.of.frequencies(corpus, features, absent.sensitive = TRUE,
relative = TRUE)
corpus |
textual data: either a corpus (represented as a list), or a single text (represented as a vector); the data have to be split into words (or other features, such as character n-grams or word pairs). |
features |
a vector containing a reference feature list that will be used
to build the table of frequencies (it is assumed that the reference list
contains the same type of features as the |
absent.sensitive |
this optional argument is used to prevent building
tables of words/features that never occur in the corpus. When switched on
(default), variables containing 0 values across all samples, will be excluded.
However, in some cases this is important to keep all the variables regardless
of their values. This is e.g. the case when comparing two corpora: even if
a given word did not occur in corpus A, it might be present in corpus B.
In short: whenever you perform any analysis involving two or multiple sets
of texts, switch this option to |
relative |
when this argument is switched to |
Maciej Eder
load.corpus
, load.corpus.and.parse
# to get frequencies of the words "a", "the" and "of" from a text:
sample.txt = txt.to.words("My father had a small estate
in Nottinghamshire: I was the third of five sons.")
make.table.of.frequencies(sample.txt, c("a", "the", "of"))
# to get a table of frequencies across several texts:
txt.1 = "Gallia est omnis divisa in partes tres, quarum unam incolunt
Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra
Galli appellantur."
txt.2 = "Si quis antea, iudices, mirabatur quid esset quod, pro tantis
opibus rei publicae tantaque dignitate imperi, nequaquam satis multi
cives forti et magno animo invenirentur qui auderent se et salutem
suam in discrimen offerre pro statu civitatis et pro communi
libertate, ex hoc tempore miretur potius si quem bonum et fortem
civem viderit, quam si quem aut timidum aut sibi potius quam rei
publicae consulentem."
txt.3 = "Nam mores et instituta vitae resque domesticas ac familiaris
nos profecto et melius tuemur et lautius, rem vero publicam nostri
maiores certe melioribus temperaverunt et institutis et legibus."
my.corpus.raw = list(txt.1, txt.2, txt.3)
my.corpus.clean = lapply(my.corpus.raw, txt.to.words)
my.favorite.words = c("et", "in", "se", "rara", "avis")
make.table.of.frequencies(my.corpus.clean, my.favorite.words)
# to include all words in the reference list, no matter if they
# occurred in the corpus:
make.table.of.frequencies(my.corpus.clean, my.favorite.words,
absent.sensitive=FALSE)
# to prepare a table of frequencies of all the words represented in
# a corpus, in descendent occurence order, one needs to make the frequency
# list first, via the function 'make.frequency.list'
complete.word.list = make.frequency.list(my.corpus.clean)
make.table.of.frequencies(my.corpus.clean, complete.word.list)
# to create a table of frequencies of word pairs (word 2-grams):
my.word.pairs = lapply(my.corpus.clean, txt.to.features, ngram.size=2)
make.table.of.frequencies(my.word.pairs, c("et legibus", "hoc tempore"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.