Improving the word cloud: custom stopwords"
In petro.One: Statistics and Text Mining for Oil and Gas Papers from OnePetro Metadata

Background

In the previous article we created our first word cloud. A word cloud help us to find quickly the focus of the document by means of the size of the words in the plot.

The problem we saw in the first word cloud is that we were seeing words of common use such as using, use, new, approach and case. These words will distract our attention of the technical orientation of the papers we are researching.

In this session, we will eliminate these common usage words with a customized dictionary or list of words.

Load again the 2918 papers metadata

library(petro.One)
library(tm)
library(tibble)

use_example(1)

p1 <- onepetro_page_to_dataframe("neural_network-s0000r1000.html")
p2 <- onepetro_page_to_dataframe("neural_network-s1000r1000.html")
p3 <- onepetro_page_to_dataframe("neural_network-s2000r1000.html")
p4 <- onepetro_page_to_dataframe("neural_network-s3000r1000.html")

nn_papers <- rbind(p1, p2, p3, p4)
nn_papers

Convert and clean document for text mining

Note that here we are removing some elemental common words, the ones supplied by the a text mining package called tm. This is the same function we used in the previous session. It does not eliminate words like using, use, etc.

vdocs <- VCorpus(VectorSource(nn_papers$book_title))
vdocs <- tm_map(vdocs, content_transformer(tolower))      # to lowercase
vdocs <- tm_map(vdocs, removeWords, stopwords("english")) # remove stopwords

Create own custom stopwords

We can take a look at what words to stop if we see the dataframe tdm.df in the previous article. Here are some:

# our custom vector of stop words

my_custom_stopwords <- c("approach", 
                      "case", 
                      "low",
                      "new",
                      "north",
                      "real",
                      "use", 
                      "using"
                      )

Remove custom stopwords from the document corpus

# this is one way to remove custom stopwords
vdocs <- tm_map(vdocs, removeWords, my_custom_stopwords)

Summary table with words frequency

tdm <- TermDocumentMatrix(vdocs)

tdm.matrix <- as.matrix(tdm)
tdm.rs <- sort(rowSums(tdm.matrix), decreasing=TRUE)
tdm.df <- data.frame(word = names(tdm.rs), freq = tdm.rs, stringsAsFactors = FALSE)
as.tibble(tdm.df)                          # prevent long printing of dataframe

You see now that using is not at the top of the table as it was before. Let's plot the wordcloud.

Word cloud with words that occur at least 50 times

library(wordcloud)

set.seed(1234)
wordcloud(words = tdm.df$word, freq = tdm.df$freq, min.freq = 50,
          max.words=200, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"))

Now the wordcloud looks more technical oriented. Words of common use have been removed. That bring us more clarity.

What's next

There are a couple of things that we will notice in this phase of the text mining: (1) words that have similar root (log, logs, network, networks, system vs systems, etc.); and (2) words that are similar but are separated differently by dashes (real time vs. real-time, 3D vs 3-D, etc.); and (3) words that are similar but have puctuation signs such as commas, dots, exclamation sign, etc. (-time, field,).

We will work on them inn the next articles.

Any scripts or data that you put into this service are public.

petro.One documentation built on May 2, 2019, 3:10 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

petro.One
Statistics and Text Mining for Oil and Gas Papers from OnePetro Metadata

Improving the word cloud: custom stopwords"
In petro.One: Statistics and Text Mining for Oil and Gas Papers from OnePetro Metadata

Background

Load again the 2918 papers metadata

Convert and clean document for text mining

Create own custom stopwords

Remove custom stopwords from the document corpus

Summary table with words frequency

Word cloud with words that occur at least 50 times

What's next

Try the petro.One package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

petro.One Statistics and Text Mining for Oil and Gas Papers from OnePetro Metadata

Improving the word cloud: custom stopwords" In petro.One: Statistics and Text Mining for Oil and Gas Papers from OnePetro Metadata

Background

Load again the 2918 papers metadata

Convert and clean document for text mining

Create own custom stopwords

Remove custom stopwords from the document corpus

Summary table with words frequency

Word cloud with words that occur at least 50 times

What's next

Try the petro.One package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

petro.One
Statistics and Text Mining for Oil and Gas Papers from OnePetro Metadata

Improving the word cloud: custom stopwords"
In petro.One: Statistics and Text Mining for Oil and Gas Papers from OnePetro Metadata