txt.to.words | R Documentation |
Generic tokenization function for splitting a given input text into single words (chains of characters delimited by spaces or punctuation marks).
txt.to.words(input.text, splitting.rule = NULL, preserve.case = FALSE)
input.text |
a string of characters, usually a text. |
splitting.rule |
an optional argument indicating an alternative
splitting regexp. E.g., if your corpus contains no punctuation, you can
use a very simple splitting sequence: |
preserve.case |
Whether or not to lowercase all characters
in the corpus (default is |
The generic tokenization function for splitting a given input text into
single words (chains of characters delimited with spaces or punctuation marks).
In obsolete versions of the package stylo
, the default splitting
sequence of chars was "[^[:alpha:]]+"
on Mac/Linux, and
"\\W+_"
on Windows. Two different splitting rules were used, because
regular expressions are not entirely platform-independent; type
help(regexp)
for more details. For the sake of compatibility, then,
in the version >=0.5.6 a lengthy list of dozens of letters in a few alphabets
(Latin, Cyrillic, Greek, Hebrew, Arabic so far) has been indicated explicitly:
paste("[^A-Za-z", # Latin supplement (Western): "\U00C0-\U00FF", # Latin supplement (Eastern): "\U0100-\U01BF", # Latin extended (phonetic): "\U01C4-\U02AF", # modern Greek: "\U0386\U0388-\U03FF", # Cyrillic: "\U0400-\U0481\U048A-\U0527", # Hebrew: "\U05D0-\U05EA\U05F0-\U05F4", # Arabic: "\U0620-\U065F\U066E-\U06D3\U06D5\U06DC", # extended Latin: "\U1E00-\U1EFF", # ancient Greek: "\U1F00-\U1FBC\U1FC2-\U1FCC\U1FD0-\U1FDB\U1FE0-\U1FEC\U1FF2-\U1FFC", # Coptic: "\U03E2-\U03EF\U2C80-\U2CF3", # Georgian: "\U10A0-\U10FF", "]+", sep="")
Alternatively, different tokenization rules can be applied through
the option splitting.rule
(see above). ATTENTION: this is the only
piece of coding in the library stylo
that might depend on the
operating system used. While on Mac/Linux the native encoding is Unicode,
on Windows you never know if your text will be loaded proprely. A considerable
solution for Windows users is to convert your texts into Unicode (a variety
of freeware converters are available on the internet), and to use an
appropriate encoding option when loading the files:
read.table("file.txt", encoding = "UTF-8"
or
scan("file.txt", what = "char", encoding = "UTF-8"
. If you use
the functions provided by the library stylo
, you should pass this
option as an argument to your chosen function:
stylo(encoding = "UTF-8")
,
classify(encoding = "UTF-8")
, oppose(encoding = "UTF-8")
.
The function returns a vector of tokenized words (or other units) as elements.
Maciej Eder, Mike Kestemont
txt.to.words.ext
, txt.to.features
,
make.ngrams
, load.corpus
txt.to.words("And now, Laertes, what's the news with you?")
# retrieving grammatical codes (POS tags) from a tagged text:
tagged.text = "The_DT family_NN of_IN Dashwood_NNP had_VBD long_RB
been_VBN settled_VBN in_IN Sussex_NNP ._."
txt.to.words(tagged.text, splitting.rule = "([A-Za-z,.;!]+_)|[ \n\t]")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.