inst/doc/motus_processing_overview.md

Overview of Motus Data Processing

The hardware server (currently sgdata.motus.org) that processes raw files from receivers and generates runs of tag detections runs various software servers that are R applications based on the motusServer package.

The Jobs Database

This is an sqlite DB that tracks (re) processing of uploaded, sync'd, or archived data. The main table is jobs, with this schema:

CREATE TABLE jobs (
 id INTEGER UNIQUE PRIMARY KEY NOT NULL, -- unique job id
 pid INTEGER REFERENCES jobs (id), -- id of parent job; null if this is a "top-level" job
 stump INTEGER REFERENCES jobs (id), -- id of top-level job; i.e. ultimate ancestor of this job. Equal to `id` if `pid` is null
 ctime FLOAT(53), -- timestamp of job creation
 mtime FLOAT(53), -- timestamp of latest change to job information
 type TEXT , -- short string giving type of job.  If type is 'abcdEfg', the job will be handled by a function called 'handleAbcdEfg'
done INTEGER , -- status code.  0: not completed (maybe not started); 1: completed successfully; < 0: error
queue TEXT , -- usually a small integer; queue in which job resides.
 -- This indicates which running processServer instance is processing or has processed this job.
path TEXT , -- filesystem path to job folder, which holds archives or data files used for this job; null if none
oldpath TEXT , -- filesystem path to previous location of job folder; permits recovery in case of crash between time
-- of attempt to move job and recording thereof in the DB
 data JSON, -- parameters, logs, product pointers for this job, as a json-encoded object.  Names of fields generated by the job end in `_`
 motusUserID, -- integer; motus ID of user who launched this job; only non-null in top-level jobs
 motusProjectID -- integer; motus ID of project (selected by user at upload time) which will own the outputs from this job; only non-null in top-level jobs
 )

R accesses this database via an S3 class called Copse, a simple data-base-backed object interface. Jobs are represented as Twigs in the Copse, with a tree structure (subjobs within jobs) and arbitrary data fields for parameters and output products.

Writing to the R objects makes immediate changes to the fields in the jobs table, and reading from the R objects uses the most recently.

Top-level Jobs

A top-level job is created by these events: - user uploads an archive of raw receiver files to be processed - data server polls an attached receiver for new data (typically hourly) - admin requests a re-run of some portion of archived raw files

Each top-level job creates subjobs that perform chunks of the processing. These pieces were chosen in somewhat arbitrary fashion, but with these goals: - if a chunk fails, it should leave the DB and filesystem in a state where the chunk can be retried, in case a bug is fixed - if processing is interrupted during a chunk (e.g. power outage, system crash, fatal bug), retrying it should work - chunks should be conceptually independent, to the extent possible - chunks that require locking objects (such as receiver databases) should be as small as possible

Processing Queues

Top-level jobs are created in either the regular queue /sgm/queue/0, or the priority queue /sgm/priority. By default, uploads go into the former, and sync jobs into the latter. Jobs (re)submitted by admin users can be forced into either queue.

From these two top-level queues, one of the processServers will claim the job. We've typically been running 4 normal processServers that claim jobs from queue 0, and two 'high-priority' processServers that claim jobs from the priority queue. There is nothing different about high-priority servers except for the queue from which they are fed. These are intended to allow low-latency processing of data from attached receivers, frequently and in small quantities. Upload jobs, which might involve very large amounts of data and so take a long time to process, are run on the normal processServers so as not to disrupt the low-latency processing.

Once a top-level job has entered a queue, any subjobs it generates are automatically added to the same queue.

Filesystem Storage of Jobs

Top-level jobs are represented in the filesystem by a folder whose name is a left-zero-padded number equal to the jobID, e.g. 00000001 Currently, the numbers are padded to 8 digits, allowing for 100 M jobs. That could be changed if needed.

Jobs begin life in one of the input queues:

So, e.g., a new upload might begin with the folder /sgm/queue/0/00012345 containing the uploaded file.

When a processServer is available, it looks at its input queue and claims the first job it finds there, waiting if there are none. (Really, it does blocking reads from a pipe connected to an instance of inotifywait, which watchs a folder for file creation or move events).

The job is then moved to the processServer's processing queue:

When a processServer is started, it checks its processing queue for any unfinished jobs (done == 0) and runs those, before looking at its input queue. This allows for resumption of jobs interrupted by a server outage.

When a job (and all of its subjobs) has completed, its folder is moved to /sgm/done. This is currently a flat folder, but needs to be re-organized hierarchically to properly support huge numbers of jobs.

Any job (including subjob) that ends in an error has its stack dump recorded in /sgm/errors as an .rds file, e.g. /sgm/errors/00001270.rds This file can be examined within R by doing:

> library(motusServer)
> hackError(1270, topLevel=FALSE)
With:
 Error in pushToMotus(src): invalid motus device ID for receiver with DB at /mnt/usb/new_sgm_recv/SG-0613BB000593.motus

Traceback (also is in variable bt):
bt[[3]]: h(j)

bt[[2]]: pushToMotus(src)

bt[[1]]: stop("invalid motus device ID for receiver with DB at ", attr(src$con, "dbn

## the bt list holds environments with variables at each level of the stack dump
> ls(bt[[2]])
[1] "batches"    "con"        "deviceID"   "motusTX"    "newBatches"
[6] "sql"        "src"
> bt[[2]]$newBatches
# A tibble: 1 x 10
  batchID motusDeviceID monoBN         tsStart      tsEnd numHits
    <int>         <int>  <int>           <dbl>      <dbl>   <int>
1       1            NA      8 1370809071.1776 1372964178       0
# ... with 4 more variables: ts <dbl>, motusUserID <int>, motusProjectID <int>,
#   motusJobID <int>

Not all variables in the stack dump environments will be valid; e.g. database and file connections will not be.



jbrzusto/motusServer documentation built on May 19, 2019, 8:19 a.m.