knitr::opts_chunk$set(echo = TRUE)

The Problem

Providing Wikipedia page view statistics in an accessible and size minimized format:

accessible

minimized

The task at hand is a typical big data problem. The amount of data points is so large that key decisions will scale enormously having huge effects on data structure, network connection at hand, hardware, hardware costs, time costs and time constraints. Furthermore, many traditional approaches - like simply using more hardware - might simply be not feasible. Second, developing an execution plan that brings together goals (accessible data, in 'portable' size), constraints (cost, time) and options is vital. Therefore, a planning phase was needed.

Data

Granularity

There are two types of granularity available, daily aggregates and hourly aggregates:

Further data

Size

Based on the daily/hourly aggregates there follow some extrapolations. In regard to network traffic (downloads), uncompromising and filtering hourly data is 24 times harder to come by than for the daily aggregates.

Most of the size stems from the request titles which are repeated for each day (each hour for the more granular data format) e.g. 630 MB for Germany and from the fact that the titles stored are requests send by users (e.g. 630 MB for German Wikipedia at 2015-01-02) instead of available pages (63 MB for German Wikipedia at 2018-05-01).

In regard to storage size hourly aggregates do not take up more space once they have been put into a proper format without unnecessary redundancies.

Quality

Actions taken so far

Problems ahead and bottlenecks

The real challenge is to process hourly data due to download and uncompress / filtering times needed. Those will have to be parallelized across computers with good internet connections.

For processing daily files no further hardware is needed - only for storing and delivering end results.

For processing hourly data a lot more hardware and network bandwidth is needed: 4 to 6 times the test system. Furthermore, processing hourly data will need much more attention distributing and monitoring the execution. On the upside it seems that a lot of work in regard to processing hourly data has already been done by GESIS (https://github.com/gesiscss/wiki-download-parse-page-views) - still those procedures have to be adapted to fit into a general framework.

hardware

total: 400 - 1000 €

manpower

total: 67 h (2680)

optional



petermeissner/wikipediadumbs documentation built on Nov. 5, 2019, 12:19 a.m.