schema-class: The 'schema' class (S4) and its methods
In tabshiftr: Reshape Disorganised Messy Data

schema-class

R Documentation

The `schema` class (S4) and its methods

Description

A schema stores the information of where which information is stored in a table of data.

Slots

cluster: [list(1)]
description of clusters in the table.
format: [list(1)]
description of the table format
variables: [named list(.)]
description of identifying and observed variables.

Setting up schema descriptions

This section outlines the currently recommended strategy for setting up schema descriptions. For example tables and the respective schemas, see the vignette.

Variables: Clarify which are the identifying variables and which are the observed variables. Make sure not to mistake a listed observed variable as identifying variable.
Clusters: Determine whether there are clusters and if so, find the origin (top left cell) of each cluster and provide the required information in setCluster(top = ..., left = ...). It is advised to treat a table that contains meta-data in the top rows as cluster, as this is often the case with implicit variables. All variables need to be specified in each cluster (in case clusters are all organised in the same arrangement), or relative = TRUE can be used. Data may be organised into clusters a) whenever a set of variables occurs more than once in the same table, nested into another variable, or b) when the data are organised into separate spreadsheets or files according to one of the variables (depending on the context, these issues can also be solved differently). In both cases the variable responsible for clustering (the cluster ID) can be either an identifying variable, or a categorical observed variable:
- in case the cluster ID is an identifying variable, provide its name in setCluster(id = ...) and specify it as an identifying variable (setIDVar)
- in case it is a observed variable, provide simply setCluster(..., id = "observed").
Meta-data: Provide potentially information about the format (setFormat).
Identifying variables: Determine the following:
- is the variable available at all? This is particularly important when the data are split up into tables that are in spreadsheets or files. Often the variable that splits up the data (and thus identifies the clusters) is not explicitly available in the table anymore. In such a case, provide the value in setIDVar(..., value = ...).
- all columns in which the variable values sit.
- in case the variable is in several columns, determine additionally the row in which its values sit. In this case, the values will look like they are part of a header.
- in case the variable must be split off of another column, provide a regular expression that results in the target subset via setIDVar(..., split = ...).
- in case the variable is distinct from the main table, provide the explicit (non-relative) position and set setIDVar(..., distinct = TRUE).
Observed variable: Determine the following:
- all columns in which the values of the variable sit.
- the unit and conversion factor.
- in case the variable is not tidy, go through the following cases one after the other:
  - in case the variable is nested in a wide identifying variable, determine in addition to the columns in which the values sit also the rows in which the variable name sits.
  - in case the names of the variable are given as a value of an identifying variable, give the column name as setObsVar(..., key = ...), together with the name of the respective observed variable (as it appears in the table) in values.
  - in case the name of the variable is the ID of clusters, specify setObsVar(..., key = "cluster", value = ...), where values has the cluster number the variable refers to.