inst/doc/motus_data_overview.md

Bird's Eye View of Motus Data

What data look like

Here's a segment of data from a receiver (with a single antenna):

Receiver R
                                                 Time ->
          \==========================================================================\
   Tag A: /         1-----1--1----1-----1-----1            4---4-----4--4-------4--4-/
          \. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \
   Tag B: /                       3-----3--3--3--3--3-------3----3--3                /
          \. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \
   Tag C: /            2--2---2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2   /
          \==========================================================================\

Complication: data are processed in batches

The picture above is complicated by several facts:

It follows that:

Batches

A batch is the result of processing one collection of raw data files from a receiver. Batches are not inherent in the data, but instead reflect how data were processed. Batches arise in these ways:

Batches are artificial divisions in the data stream, so runs of hits will often cross batch boundaries. Adding this complication to the picture above gives this:

Receiver R
                                                 Time ->
          \====================|=================================|=====================\
   Tag A: /         1-----1--1-|---1-----1-----1            4---4|-----4--4-------4--4-/
          \. . . . . . . . . . | . . . . . . . . . . . . . . . . | . . . . . . . . . . \
   Tag B: /                    |   3-----3--3--3--3--3-------3---|-3--3                /
          \. . . . . . . . . . | . . . . . . . . . . . . . . . . | . . . . . . . . . . \
   Tag C: /            2--2---2|--2--2--2--2--2--2--2--2--2--2--2|--2--2--2--2--2--2   /
          \====================|=================================|=====================\
                               |                                 |
          <---- Batch N ------>|<------- Batch N+1 ------------->|<----- Batch N+2 ---->

Receiver Reboots

A receiver reboots when it is powered down and then (possibly much later) powered back up. Reboots often correspond to a receiver:

so motus treats receiver reboots in a special way:

Incremental Distribution of Data

The motusClient R package allows users to build a local copy of the database of all their tags' (or receivers') hits incrementally. A user can regularly call the tagme() function to obtain any new hits of their tags. Because data are processed in batches, tagme() either does nothing, or downloads one or more new batches of data into the user's local DB.

Each new batch corresponds to a set of files processed from a single receiver. A batch record includes these items: - receiver device ID - how many of hits of their tags occurred in the batch - first and last timestamp of the raw data processed in this batch

Each new batch downloaded will include hits of one or more of the users's tags (or someone's tags, if the batch is for a "receiver" database).

A new batch might also include some GPS fixes, so that the user knows where the receiver was when the tags were detected.

A new batch will include information about runs. This information comes in three versions:

Although the unique runID identifier for a run doesn't change when the user calls tagme(), the number of hits in that run and its status (done or not), might change.

Reprocessing Data

motus will occasionally need to reprocess raw files from receivers. There are several reasons:

The (eventual) Reprocessing Contract

Reprocessing can be very disruptive from the user's point of view ("What happened to my hits?"), so motus reprocessing will be:

  1. optional: users should be able to obtain new data without having to accept reprocessed versions of data they already have.

  2. reversible: users should be able to "go back" to a previous version of any reprocessed data they have accepted.

  3. transparent: users will receive a record of what was reprocessed, why, when, what was done differently, and what changed

  4. all-or-nothing: for each receiver boot session for which users have data, these data must come entirely from either the original processing, or a subsequent single reprocessing event. The user must not end up with an undefined mix of data from original and reprocessed sources.

  5. in-band: the user's copy of data will be updated to incorporate reprocessed data as part of the normal process of updating to obtain new data, unless they choose otherwise. We expect that most users will want to accept reprocessed data most of the time.

Initially, motus data processing might not adhere to this contract, but it is an eventual goal.

Reprocessing simplified: only by boot session

A general reprocessing scenario would look like this:

Receiver R
                                                 Time ->
          \=================!==|=================================|======!==============\
   Tag A: /         1-----1-!1-|---1-----1-----1            4---4|-----4!-4-------4--4-/
          \. . . . . . . . .!. | . . . . . . . . . . . . . . . . | . . .!. . . . . . . \
   Tag B: /                 !  |   3-----3--3--3--3--3-------3---|-3--3 !              /
          \. . . . . . . . .!. | . . . . . . . . . . . . . . . . | . . .!. . . . . . . \
   Tag C: /            2--2-!-2|--2--2--2--2--2--2--2--2--2--2--2|--2--2!-2--2--2--2   /
          \=================!==|=================================|======!==============\
                            !  |                                 |      !
          <---- Batch N ----!->|<------- Batch N+1 ------------->|<-----!Batch N+2 ---->
                            !                                           !
                            !<- Reprocess this period (no, too hard!) ->!

if raw data records from an arbitrary stretch of time could be reprocessed. However, this is complicated because runs like 1 2, and 4 above might lose or gain hits within the reprocessing period, but not outside of it. This might even break an existing run into distinct new runs.

This situation is challenging (NB: not impossible; might be a TODO) to formalize and represent in the database if we want to maintain a full history of processing. For example, if reprocessing deletes some hits from run 2, how do we represent both the old and the new versions of that run?

The complications arise due to runs crossing the reprocessing period boundaries, so for simplicity we should choose a reprocessing period that no runs cross. Currently, that means a boot session, as discussed above.

Distributing reprocessed data

The previous section shows why we only reprocess data by boot session. Given that, how do we get reprocessed data to users while fulfilling the reprocessing contract?

Note that a reprocessed boot session will fully replace one or more existing batches and one or more runs, because batches and runs both nest within boot sessions.

Replacement of data by reprocessed versions should happen in-band (5 above), so one approach is this:



jbrzusto/motusServer documentation built on May 19, 2019, 8:19 a.m.