NUSS (Mixed N-Grams and Unigram Sequence Segmentation) is an R package designed to segment and simplify sequences using synthetic n-grams sequences, and dynamic unigram sequence matching. This package is particularly useful for text processing and natural language processing (NLP) tasks.
You can install the development version of NUSS from GitHub using the following commands:
# Install devtools if you haven't already
install.packages("devtools")
# Install NUSS from GitHub
devtools::install_github("theogrost/NUSS")
Here are some basic examples to get you started with NUSS.
You can segment sequences using the ngrams_segmentation
function. First, create an n-grams dictionary, and then use the dictionary to segment sequences.
library(NUSS)
# Segment a sequence using built-in dictionary
unigram_sequencer_segmented <- unigram_sequence_segmentation("thisisscience")
print(unigram_sequencer_segmented)
# Example text data
texts <- c(
"this is science",
"science is fascinating",
"this is a scientific approach",
"science is everywhere",
"the beauty of science"
)
# Segment a sequence using n-grams
ngrams_dict <- ngrams_dictionary(texts)
ngrams_segmented <- ngrams_segmentation("thisisscience", ngrams_dict)
print(ngrams_segmented)
# Segment a sequence using nuss - combined function
nuss_segmented <- nuss("thisisscience", texts)
print(segmented)
You can customize the segmentation process with additional parameters such as simplify
, omit_zero
, and score_formula
.
# Custom segmentation with additional parameters
ngrams_dict <- ngrams_dictionary(texts, clean = TRUE, ngram_min = 1, ngram_max = 5, points_filter = 1)
custom_segmented <- ngrams_segmentation(
"thisisscience",
ngrams_dict,
simplify = TRUE,
omit_zero = TRUE,
score_formula = "points / words.number ^ 2"
)
print(custom_segmented)
NUSS is licensed under the GPL-3.0 License.
For any questions or issues, please open an issue on GitHub or contact the maintainer.
I hope you find NUSS useful for your text processing tasks!
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.