README.md

Download, Extract, and Transform Wikipedia Pagecount Dumps

Status

lines of R code: 267, lines of test code: 0

Version

0.1.1.90000 ( 2017-08-18 10:27:06 )

Description

This package is is a worker to download, extract and transform Wikipedia pagecount dumps.

License

GPL (>= 2) Peter Meissner [aut, cre]

Citation

citation("wikipediadumps")

BibTex for citing

toBibtex(citation("wikipediadumps"))

Installation

Latest development version from Github:

devtools::install_github("petermeissner/wikipediadumps")

Usage

Get dumps

# load package
library(wikipediadumps)

# set directory to use
wpd_options(directory="~/wikipediadumps")

get_dumps("200802")
# get dumps for one day
flist <-
  list.files(
    wpd_options()$directory,
    pattern = "-20071231.*\\.gz$",
    full.names = TRUE
  )

# extract data from gz files into save it in separated by languages
system.time({
  res <-
   do.call(
     rbind,
     filter_dumps_to_file(
       flist = flist[1],
       wiki =
         c("en","ceb", "sv", "de", "nl",
           "fr", "ru", "it", "es")
         # c("en","ceb", "sv", "de", "nl",
         #   "fr", "ru", "it", "es",
         #   "pl", "vi", "ja",
         #   "zh", "pt", "ar", "tr",
         #   "id", "fa", "simple",
         #   "ko", "ro", "no", "cs",
         #   "uk", "hu", "fi", "he", "da",
         #   "th", "hi", "ca", "el", "bg",
         #   "sr", "ms","hr","sl","sk","az",
         #   "eo","ta","lt","sh","et","la",
         #   "ka","nn","gl","eu","be","kk",
         #   "ur","hy","uz","zh-min-nan",
         #   "vo","ce","min"
         #  )
      )
   )
})


petermeissner/wikipediadumps documentation built on May 29, 2019, 12:42 p.m.