README.md

ptwikiwords

Words used in Portuguese Wikipedia

Travis-CI Build Status CRAN_Status_Badge

This data-package contains a dataset with words used in a random sample from ~15.000 pages from the Portuguese Wikipedia.

Installing

It can be installed using:

devtools::install_github("dfalbel/ptwikiwords")

Using

After installing the package, you can load the dataset using:

library(ptwikiwords)
data(ptwikiwords)
head(ptwikiwords)
#> # A tibble: 6 × 3
#>    word  count check
#>   <chr>  <int> <lgl>
#> 1    de 210954  TRUE
#> 2     a 109652  TRUE
#> 3     e 100028  TRUE
#> 4     o  87839  TRUE
#> 5    em  67040  TRUE
#> 6    do  59489  TRUE

The dataset contains 3 columns:

Here is a wordcloud of those words:

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(wordcloud))
words_filter <- ptwikiwords %>%
  filter(check == T) %>%
  slice(1:300)
wordcloud(words_filter$word, words_filter$count)

Here is a wordcloud of the 2-grams.

data(ngrams)
words_filter <- ngrams %>%
  slice(1:100)
wordcloud(words_filter$ngrams, words_filter$count)
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): com o could
#> not be fit on page. It will not be plotted.
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): o primeiro
#> could not be fit on page. It will not be plotted.
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): é um could
#> not be fit on page. It will not be plotted.
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): para a could
#> not be fit on page. It will not be plotted.
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): de um could
#> not be fit on page. It will not be plotted.
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): janeiro de
#> could not be fit on page. It will not be plotted.
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): é uma could
#> not be fit on page. It will not be plotted.
#> Warning in wordcloud(words_filter$ngrams, words_filter$count): setembro de
#> could not be fit on page. It will not be plotted.



dfalbel/ptwikiwords documentation built on May 15, 2019, 5:10 a.m.