README.md

jailbreakr

Warning: This project is in the early scoping stages; do not use for anything other than amusement/frustration purposes

Data Liberator. To extract tabular data people put in nontabular structures in a program designed to hold tables.

Installation

Requires the development version of xml2 (for xml_find_lgl) as well as cellrangr and linen. Chances are you'll want rexcel too.

devtools::install_github(c("hadley/xml2",
                           "rsheets/linen",
                           "rsheets/cellranger",
                           "rsheets/rexcel",
                           "rsheets/jailbreakr"))

Goals

There are two large excel spreadsheet corpora; it would be nice to use these to get a feel for what fraction of spreadsheets we can handle or the range of non-table-like data out there.

the things people do to data

The first is the EUSES corpus of 4,447 spreadsheets (16,853 worksheets). This is all xls files (rather than xlsx) and therefore need either an xls -> xlsx conversion or support in jailbreakr for xls files.

The second, larger, one is the Enron corpus of 15,770 spreadsheets (79,983)

Roadmap

Ideas

Can we feed things through openrefine or something?



rsheets/jailbreakr documentation built on May 28, 2019, 3:31 a.m.