Description Usage Arguments Details Value Note References Examples
This function tries to guess the language a text is written in.
1 2 3 4 5 6 7 8 9 | guess.lang(
txt.file,
udhr.path,
comp.length = 300,
keep.udhr = FALSE,
quiet = TRUE,
in.mem = TRUE,
format = "file"
)
|
txt.file |
A character vector pointing to the file with the text to be analyzed. |
udhr.path |
A character string, either pointing to the directory where you unzipped the translations of the Universal Declaration of Human Rights, or to the ZIP file containing them. |
comp.length |
Numeric value,
giving the number of characters to be used of |
keep.udhr |
Logical, whether all the UDHR translations should be kept in the resulting object. |
quiet |
Logical. If |
in.mem |
Logical. If |
format |
Either "file" or "obj". If the latter,
|
To accomplish the task, the method described by Benedetto, Caglioti & Loreto (2002) is used, utilizing both gzip compression and tranlations of the Universal Declaration of Human Rights[1]. The latter holds the world record for being translated into the most different languages, and is publicly available.
An object of class kRp.lang
.
For this implementation the documents provided by the "UDHR in Unicode" project[2] have been used.
Their translations are not part of this package and must be downloaded seperately to use guess.lang
!
You need the ZIP archive containing all the plain text files from https://unicode.org/udhr/downloads.html.
Benedetto, D., Caglioti, E. & Loreto, V. (2002). Language trees and zipping. Physical Review Letters, 88(4), 048702.
[1] https://www.ohchr.org/EN/UDHR/Pages/UDHRIndex.aspx
1 2 3 4 5 6 7 8 9 10 11 12 13 | ## Not run:
# using the still zipped bulk file
guess.lang(
file.path("~","data","some.txt"),
udhr.path=file.path("~","data","udhr_txt.zip")
)
# using the unzipped UDHR archive
guess.lang(
file.path("~","data","some.txt"),
udhr.path=file.path("~","data","udhr_txt")
)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.