get.process.chunks: Get File Information to Allow Processing of Subsets
In ks905383/quantproj: Project Climate Data Based on Changes in Distribution Shape

Description Usage Arguments Value Expected File Structure Subsetting by Latitude / Longitude Bounds

Get information about the climate data files in a folder, and split them up by region-by-latitude chunk to allow for processing subsections of the (very large) files at a time.

1 2	get.process.chunks(defaults, save.output = FALSE, search.dir = character(), show.messages = TRUE)

`defaults`	the output from `set.defaults`. The defaults used are `filevar` and, if `search.dir=numeric()` (by default), `mod.data.dir`. This program also supports subsets by latitude or longitude, if `defaults$lat.clip` / `defaults$lon.clip` exist (as `c(min,max)`, see below).
`save.output`	whether to save the file information as a file in the search directory (as "`process_inputs.RData`"), by default `FALSE`
`search.dir`	by default, `get.process.chunks` searches in the `mod.data.dir` in `defaults`. If a different directory should be examined, use this to set the path.
`show.messages`	whether to show some useful descriptions of the search procedure (by default `TRUE`)

a list giving, for each region-by-latitude chunk the subset suffix (reg) if available, the filename (fn), the latitude and longitude coordinates of the pixels in the subset (lat, one element; and lon), the global pixel id (the variable global_loc if it exists in the NetCDF file, otherwise a new one is created, counting pixels up by files alphabetically), the local id (local_idxs) linear index within the file of each pixel), the number of experiment runs in the file (the variable run in the NetCDF file; 1 otherwise), and the within-file indices along each location dimension (either just location or lon x lat) dim_idxs.

In general, most common forms of climate file structures are supported, especially the CMIP5 structure (for best results, filenames should still be in CMIP5 format with an optional "_[ ]" suffix for regional subsets, etc. - see the set.defaults documentation for more info). Variables can either be on a lon x lat grid or stored by linear location. Files can either contain all runs of a model or can be saved by run. Files can either contain the whole timeframe of a model run or be split up in consecutive temporal chunks. Furthermore:

filename: the code searches for NetCDF files using the search string "[defaults$filevar]_day_.*nc" (by default; this can be changed by setting defaults$search.str). Make sure no other NetCDF files with that pattern are present in the search directory (by default defaults$mod.data.dir).
variable setup: Currently, the code expects the primary variable to have either a location dimension (giving the linear index of a location), or a lon x lat grid. These are all identified by name - the search terms used can be set in defaults$varnames - out-of-the-box, the package for example supports "lat", "latitude", "Latitude", and "latitude_1" as possible names for the "lat" dimension.
locations: The code expects there to be two location variables, lat and lon (CMIP5 syntax), giving the lat/lon location of every pixel in the file. The names of those variables can be any of the alternatives given by defaults$varnames - e.g. out of the box, the code also checks for "latitude", "longitude", etc. See set.defaults for information on adding naming conventions.
multiple runs: If there are multiple runs in the file, there should be a run variable/dimension in the file giving the run id as an integer

If defaults$lat.clip and/or defaults$lon.clip exist, only information and file locations of pixels within those lat/lon bounds are returned. $lat.clip and $lon.clip should be vectors of the form c(min,max), e.g. defaults$lat.clip=c(23,52), defaults$lon.clip=c(-125,-65) for a box around the continental USA. Longitude coordinates can be entered either in a [-180 180] or in a [0 360] format, regardless of the loading data's format - they'll be matched in format before subsetting. The global.loc.idx (used for filenames) is unaffected by the subsetting, meaning that the idx are still counted with regards to the full lat/lon universe. This allows an initial subset to be expanded without issues with output file structures.