token_stats: token statistics
In textTinyR: Text Processing for Small or Big Data Files

token_stats

R Documentation

token statistics

Description

token statistics

Usage

# utl <- token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL,

#                               file_delimiter = ' ', n_gram_delimiter = "_")

Details

the path_2vector function returns the words of a folder or file to a vector ( using the file_delimiter to input the data ). Usage: read a vocabulary from a text file

the freq_distribution function returns a named-unsorted vector frequency_distribution in R for EITHER a folder, a file OR a character string vector. A specific subset of the result can be retrieved using the print_frequency function

the count_character function returns the number of characters for each word of the corpus for EITHER a folder, a file OR a character string vector. A specific number of character words can be retrieved using the print_count_character function

the collocation_words function returns a co-occurence frequency table for n-grams for EITHER a folder, a file OR a character string vector. A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components ( http://nlp.stanford.edu/fsnlp/promo/colloc.pdf, page 172 ). The input to the function should be text n-grams separated by a delimiter (for instance 3- or 4-ngrams ). I can retrieve a specific frequency table by using the print_collocations function

the string_dissimilarity_matrix function returns a string-dissimilarity-matrix using either the dice, levenshtein or cosine distance. The input can be a character string vector only. In case that the method is dice then the dice-coefficient (similarity) is calculated between two strings for a specific number of character n-grams ( dice_n_gram ).

the look_up_table returns a look-up-list where the list-names are the n-grams and the list-vectors are the words associated with those n-grams. The words for each n-gram can be retrieved using the print_words_lookup_tbl function. The input can be a character string vector only.

Methods

token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL, file_delimiter = ' ', n_gram_delimiter = "_")
--------------
path_2vector()
--------------
freq_distribution()
--------------
print_frequency(subset = NULL)
--------------
count_character()
--------------
print_count_character(number = NULL)
--------------
collocation_words()
--------------
print_collocations(word = NULL)
--------------
string_dissimilarity_matrix(dice_n_gram = 2, method = "dice", split_separator = " ", dice_thresh = 1.0, upper = TRUE, diagonal = TRUE, threads = 1)
--------------
look_up_table(n_grams = NULL)
--------------
print_words_lookup_tbl(n_gram = NULL)

Methods

Public methods

token_stats$new()
token_stats$path_2vector()
token_stats$freq_distribution()
token_stats$print_frequency()
token_stats$count_character()
token_stats$print_count_character()
token_stats$collocation_words()
token_stats$print_collocations()
token_stats$string_dissimilarity_matrix()
token_stats$look_up_table()
token_stats$print_words_lookup_tbl()
token_stats$clone()

Method `new()`

Usage

token_stats$new(
  x_vec = NULL,
  path_2folder = NULL,
  path_2file = NULL,
  file_delimiter = "\n",
  n_gram_delimiter = "_"
)

Arguments

x_vec: either NULL or a string character vector
path_2folder: either NULL or a valid path to a folder (each file in the folder should include words separated by a delimiter)
path_2file: either NULL or a valid path to a file
file_delimiter: either NULL or a character string specifying the file delimiter
n_gram_delimiter: either NULL or a character string specifying the n-gram delimiter. It is used in the collocation_words function

Method `path_2vector()`

Usage

token_stats$path_2vector()

Method `freq_distribution()`

Usage

token_stats$freq_distribution()

Method `print_frequency()`

Usage

token_stats$print_frequency(subset = NULL)

Arguments

subset: either NULL or a vector specifying the subset of data to keep (number of rows of the print_frequency function)

Method `count_character()`

Usage

token_stats$count_character()

Method `print_count_character()`

Usage

token_stats$print_count_character(number = NULL)

Arguments

number: a numeric value for the print_count_character function. All words with number of characters equal to the number parameter will be returned.

Method `collocation_words()`

Usage

token_stats$collocation_words()

Method `print_collocations()`

Usage

token_stats$print_collocations(word = NULL)

Arguments

word: a character string for the print_collocations and print_prob_next functions

Method `string_dissimilarity_matrix()`

Usage

token_stats$string_dissimilarity_matrix(
  dice_n_gram = 2,
  method = "dice",
  split_separator = " ",
  dice_thresh = 1,
  upper = TRUE,
  diagonal = TRUE,
  threads = 1
)

Arguments

dice_n_gram: a numeric value specifying the n-gram for the dice method of the string_dissimilarity_matrix function
method: a character string specifying the method to use in the string_dissimilarity_matrix function. One of dice, levenshtein or cosine.
split_separator: a character string specifying the string split separator if method equal cosine in the string_dissimilarity_matrix function. The cosine method uses sentences, so for a sentence : "this_is_a_word_sentence" the split_separator should be "_"
dice_thresh: a float number to use to threshold the data if method is dice in the string_dissimilarity_matrix function. It takes values between 0.0 and 1.0. The closer the thresh is to 0.0 the more values of the dissimilarity matrix will take the value of 1.0.
upper: either TRUE or FALSE. If TRUE then both lower and upper parts of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the upper part will be filled with NA's
diagonal: either TRUE or FALSE. If TRUE then the diagonal of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the diagonal will be filled with NA's
threads: a numeric value specifying the number of cores to use in parallel in the string_dissimilarity_matrix function

Method `look_up_table()`

Usage

token_stats$look_up_table(n_grams = NULL)

Arguments

n_grams: a numeric value specifying the n-grams in the look_up_table function

Method `print_words_lookup_tbl()`

Usage

token_stats$print_words_lookup_tbl(n_gram = NULL)

Arguments

n_gram: a character string specifying the n-gram to use in the print_words_lookup_tbl function

Method `clone()`

The objects of this class are cloneable with this method.

Usage

token_stats$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples



library(textTinyR)

expl = c('one_word_token', 'two_words_token', 'three_words_token', 'four_words_token')

tk <- token_stats$new(x_vec = expl, path_2folder = NULL, path_2file = NULL)

#-------------------------
# frequency distribution:
#-------------------------

tk$freq_distribution()

# tk$print_frequency()


#------------------
# count characters:
#------------------

cnt <- tk$count_character()

# tk$print_count_character(number = 4)


#----------------------
# collocation of words:
#----------------------

col <- tk$collocation_words()

# tk$print_collocations(word = 'five')


#-----------------------------
# string dissimilarity matrix:
#-----------------------------

dism <- tk$string_dissimilarity_matrix(method = 'levenshtein')


#---------------------
# build a look-up-table:
#---------------------

lut <- tk$look_up_table(n_grams = 3)

# tk$print_words_lookup_tbl(n_gram = 'e_w')

textTinyR documentation built on June 24, 2024, 5:16 p.m.

textTinyR index

README.md Functionality of the textTinyR package Word vectors - doc2vec - text clustering

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

textTinyR Text Processing for Small or Big Data Files

token_stats: token statistics In textTinyR: Text Processing for Small or Big Data Files

token statistics

Description

Usage

Details

Methods

Methods

Public methods

Method new()

Usage

Arguments

Method path_2vector()

Usage

Method freq_distribution()

Usage

Method print_frequency()

Usage

Arguments

Method count_character()

Usage

Method print_count_character()

Usage

Arguments

Method collocation_words()

Usage

Method print_collocations()

Usage

Arguments

Method string_dissimilarity_matrix()

Usage

Arguments

Method look_up_table()

Usage

Arguments

Method print_words_lookup_tbl()

Usage

Arguments

Method clone()

Usage

Arguments

Examples

Related to token_stats in textTinyR...

R Package Documentation

Browse R Packages

We want your feedback!

textTinyR
Text Processing for Small or Big Data Files

token_stats: token statistics
In textTinyR: Text Processing for Small or Big Data Files

Method `new()`

Method `path_2vector()`

Method `freq_distribution()`

Method `print_frequency()`

Method `count_character()`

Method `print_count_character()`

Method `collocation_words()`

Method `print_collocations()`

Method `string_dissimilarity_matrix()`

Method `look_up_table()`

Method `print_words_lookup_tbl()`

Method `clone()`