In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.
From Wikipedia
This paragraph gives a nice explanation of what stemming is. Much of academic
work on stemming was focused on English Language and it's somewhat hard to find
stemming algorithms for other languages. ptstem
tries to fix this, by
providing a comprehensive interface for Portuguese Language stemming algorithms.
The implemented algorithms are:
R
in the rslp
package.R
in the hunspell
package.R
it's implemented in the SnowballC
package.ptstem
has only one important function that is called ptstem
. You can easily
stem a text by passing it to ptstem
.
library(ptstem)
text <- "Em morfologia linguística e recuperação de informação a stemização (do inglês, stemming) é o processo de reduzir palavras flexionadas (ou às vezes derivadas) ao seu tronco (stem), base ou raiz, geralmente uma forma da palavra escrita. O tronco não precisa ser idêntico à raiz morfológica da palavra; ele geralmente é suficiente que palavras relacionadas sejam mapeadas para o mesmo tronco, mesmo se este tronco não for ele próprio uma raiz válida. O estudo de algoritmos para stemização tem sido realizado em ciência da computação desde a década de 60. Vários motores de buscas tratam palavras com o mesmo tronco como sinônimos como um tipo de expansão de consulta, em um processo de combinação."
ptstem(text)
By default ptstem
uses the rslp algorithm to stem, and it complete stems with
the most frequent word in the text (This is explained later). Is this example
it's a little hard to see improvements with stemming, because the text doesn't
contain many words with the same root. Let's look at a more simple example.
text <- c("avião", "aviões", "aviação", "viação", "aves", "balão", "balões") ptstem(text)
You can return the suffix stripped words (without completion) by setting the
argument complete = FALSE
.
ptstem(text, complete = FALSE)
You can also change the algorithm used to stem by setting the algorithm
argument.
ptstem(text, algorithm = "hunspell", complete = FALSE)
The hunspell stemmer is not a suffix-stripping algorithm, so it can find related
words that has the same sufffix. It happened here with the word "aviação" that
was related to "viação" instead of "avião" and "aviões". Also you can see that
hunspell is returning valid words, even with complete = FALSE
, but it does not
necessarily returns words that appear in the text, see:
ptstem("aviões", "hunspell", complete = FALSE)
To use the Porter stemmer, simply tweak the algorithm
argument again.
ptstem(text, algorithm = "porter", complete = FALSE)
As Porter stemmer, is a general algorithm, it has some problems when detecting
irregular forms of words. In this example, the stemming didn't relate any words,
if you hadn't used the complete = FALSE
argument, you wouldn't have noticed any
difference between the input and the output vectors.
ptstem(text, algorithm = "porter")
ptstem
has two other arguments that can be used to ignore words in stemming.
n_char
: minimum number of characters of words to be stemmedignore
: vector of words and regex's to igoreSometimes you have some words in a text that you don't want to stem, like
proper names or words in other languages and it's usefull to ignore them.
Sometimes you also have very small words, that if stemmed they loose their
meaning, the rslp
algorithm has some rules about words lenghts, but hunspell
does not. That's why n_char
argument is available.
text <- c("obama", "gostei", "gostou", "gostamos", "é", "e")
ptstem(text, complete = FALSE)
Here rslp
stemmed "obama" to "obam" and "firmware" to "firmw". You can choose
to not stem theese words by setting the ignore
parameter.
ptstem(text, complete = FALSE, ignore = c("obama"))
By default, ptstem
does not stem words with less then three characters. If you
set for at least 1 characters.
ptstem(text, complete = FALSE, n_char = 1)
You can see that "e" and "é" were united.
It's also possible to ignore regex's, using the ignore
argument.
ptstem(text, complete = FALSE, ignore = c("go."))
This doesn't stem words that start with "go".
The goal of stemming algorithms is to group related words and to separate unrelated words. With this in mind, you can talk about two kinds of possible errors when stemming:
To measure these errors the function performance
was implemented. It returns a data.frame
with 3 columns. The name of the stemmer and 2 metrics:
Remember that OI is 0 if you don't stem. So I think the true objective of a stemming algorithm is to reduce UI without augmenting OI too much.
ptstem
package provides a dataset of grouped words for the portuguese language (found in this link). It's in this dataset that performance
function calculates the metrics described above.
See results:
performance()
This is not the only approach for measuring performance of the those algorithms. The article Assessing the impact of Stemming Accuracy on Information Retrieval – A multilingual perspective describes various ways to analyse stemming performance.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.