split_metadata: Split metadata from a worksheet
In rsheets/jailbreakr: Extract Data From Human-Readable 'Excel' Spreadsheets

Description Usage Arguments Details Value

View source: R/split_metadata.R

Split metadata from a worksheet or region of a worksheet (a worksheet view).

split_metadata(x, include_merged = TRUE, min_jump = 2, min_data_block = 5)

split_metadata_find(x, include_merged = TRUE, min_jump = 2,
  min_data_block = 5)

split_metadata_apply(x, n)

`x`	A worksheet or a worksheet view
`include_merged`	Include merged cells when computing row width. This is generally going to be what you want to do, but if you have a sheet with metadata rows that are entirely merged across, you probably will do better by turning this option off (in that case they'd be counted as a single column wide).
`min_jump`	The minimum number of columns increase that we'll count as moving from the metadata block to the data block. A single column (1) is probably going to be prone to false positives, and in sheets with this pattern the jump is often quite large.
`min_data_block`	The minimum number of rows without decreasing in size before we can conclude that we're in the data block. If we run off the end of the worksheet before reaching this number, we'll conclude no metadata was found.
`n`	A scalar integer representing the number of rows to consider to be metadata (i.e., equivalent to the return value of `split_metadata_find`.

There is a common pattern in spreadsheets where data is stored as:

mmm
mmm
HHHHHHHH
dddddddd
dddddddd

where m is some metadata (perhaps indicating table name, creator, dates, etc), H is the header and d is the actual data. This function will split the metadata (m) part off, leaving a table that is more suitable for further processing. In many ways this is like the skip argument to readxl::read_excel and read.csv, but we will retain the metadata (somewhere!).

The idea here is that the metadata block starts when we get a shift in the number of non-blank cells. There needs to be some heuristics here to help, and things will need to be tuneable: do merged cells count as non-empty (include_merged; default is to include them), how big a jump we look for (min_jump; default is 2 columns), how many rows of the same size do we look for in the data block (min_data_block; default is 5 rows).

Other things that might be useful, but which aren't supported yet, include looking for different colours and fonts in the metadata and the main block; when we switch from one to the other we're likely to see things like a change here.

For split_metadata and split_metadata_apply, a worksheet view; in this view the data$metadata element will be a worksheet view of the metadata. For split_metadata_find, a single integer representing the number of rows of metadata found (with zero indicating no metadata).

rsheets/jailbreakr documentation built on May 28, 2019, 3:31 a.m.