Guide to Recent SoilProfileCollection Changes

This document performs some basic demonstrations of new and revised functionality of the SoilProfileCollection (SPC).

Changes to the object

The SoilProfileCollection() constructor returns an empty object containing default values.

str(aqp::SoilProfileCollection())
Formal class 'SoilProfileCollection' [package "aqp"] with 11 slots
  ..@ idcol       : chr "id"
  ..@ hzidcol     : chr "hzID"
  ..@ hzdesgncol  : chr(0) 
  ..@ hztexclcol  : chr(0) 
  ..@ depthcols   : chr [1:2] "top" "bottom"
  ..@ metadata    :List of 6
  .. ..$ aqp_df_class    : chr "data.frame"
  .. ..$ depth_units     : chr "cm"
  .. ..$ original.order  : int(0) 
  .. ..$ target.order    : int(0) 
  .. ..$ aqp_group_by    : chr ""
  ..@ horizons    :'data.frame':    0 obs. of  4 variables:
  .. ..$ id    : chr(0) 
  .. ..$ hzID  : chr(0) 
  .. ..$ top   : num(0) 
  .. ..$ bottom: num(0) 
  ..@ site        :'data.frame':    0 obs. of  1 variable:
  .. ..$ id: chr(0) 
  ..@ sp          :Formal class 'SpatialPoints' [package "sp"] with 3 slots
  .. .. ..@ coords     : num [1, 1] 0
  .. .. ..@ bbox       : logi [1, 1] NA
  .. .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
  .. .. .. .. ..@ projargs: chr NA
  ..@ diagnostic  :'data.frame':    0 obs. of  0 variables
  ..@ restrictions:'data.frame':    0 obs. of  0 variables

Which, in the new show method, appears in console as:

SoilProfileCollection with 0 profiles and 0 horizons
profile ID: id  |  horizon ID: hzID 
Depth range: NA - NA cm

----- Horizons (0 / 0 rows  |  4 / 4 columns) -----
[1] id     hzID   top    bottom
<0 rows> (or 0-length row.names)

----- Sites (0 / 0 rows  |  1 / 1 columns) -----
[1] id
<0 rows> (or 0-length row.names)

Spatial Data: [EMPTY]

Extending the object (data.table and tibble)

By inheritance, the S3 classes data.table and tibble (AKA tbl_df) can be used seamlessly [experimental] with most internal SPC methods. The object you initialize the SPC with defines the object type for its slots.

The methods defined for subclasses allow for additional efficiencies, enhanced compatibility with external workflows, informative output (e.g. pertaining to datatypes and units), aesthetics and more. Some specialty methods have not yet been converted over to using these functionalities (e.g. slice).

aqp_df_class

An accessor for the data.frame-like object type used in a SPC is: aqp_df_class(spc). The result is used for internal methods to ensure that slots are in the correct class.

The aqp_df_class record in @metadata is new to allow (and track) data.table and tibble subclasses of data.frame use within a SPC. It is a unit-length character value. The value stored refers to the class used for the @horizons, @site, @diagnostics and @restrictions slots.

The method is public so users can check the value if needed -- though there should rarely if ever be a "need" to use it. The default return value is "data.frame" but could be: "data.table", "tbl_df" OR some future data.frame subclass.

Internal methods for compatibility between data.frame subclasses

.as.data.frame.aqp

This is an internal wrapper method to ensure the correct object type is used. It applies only necessary conversions. The method is checking the class (class(df)[1]) of the input df (a data.frame-like object).

Normal operations that subset, join or modify data within the SPC regularly "check" object classes via .as.data.frame.aqp. If df has changed from the "target" class (usually aqp_df_class(spc)) it is coerced.

Conversions to objects other than data.frame require the namespaces for data.table or tibble to be installed. Relevant messages and warnings are generated with a fallback object type of data.frame.

For old objects without the metadata(spc)$aqp_df_class slot set (e.g from Rda) a large amount of warnings may be generated. Run spc <- rebuildSPC(spc) to fix this.

Uniqueness of data.table objects

data.table co-opts the [,j] index for its own purposes -- which, if the user desires this, can still be accessed from a particular slot e.g. horizons(yourDTspc)[,dtjindex]. However, internally, we use unambiguous references to ensure all code is compatible with all subclasses.

.data.frame.j

An internal method .data.frame.j is defined to ensure that data.frame -like j-index subsetting using multiple columns can be applied uniformly to all classes (whether objects are data.frame or tibble v.s. data.table).

This is not intended to be a method for users to have access to (though I find it quite useful as a data.table noob).

Integrity of the SoilProfileCollection

"Integrity" refers to the relative state of the data.frame-like slot components of a SoilProfileCollection and methods we use to maintain them.

Specifically, integrity is relevant to @site and @horizons slots, but also, @diagnostic, @restrictions, @sp and @metadata.

Three major issues pertain to SoilProfileCollection "integrity":

  1. Initialization from data.frame, data.table or tibble

  2. JOINs

  3. Horizon and site ordering with respect to profile_id() output

Initialization from data.frame, data.table or tibble

Much of the internal bookkeeping of the SPC relies on the sort order that is derived from the input horizon data at object creation. It is assumed, but not strictly enforced everywhere, that the site IDs follow the exact same order and do not get rearranged.

JOINs

Following LEFT JOIN (site<-, horizons<-) to add data the object's site/horizon order must be preserved.

See: df_test.R for examples using data.frame , tibble and_data.table_.

SPC Integrity, spc_in_sync and horizonless sites

Two new functions (spc_in_sync and .coalesce.idx) have been added to aqp that address the identification of relative site/horizon state.

This is different from what was done in the past for SPC validity -- which was enforcing profile-level logic constraints. Profile-level checks ultimately are a little more involved than what we need for the object-wide method -- especially considering the relaxations that were made to create SPCs from "bad" data -- and future changes for horizonless sites.

Several additional diagnostics need to be added to keep track of profiles without horizon data.

As long as top depths are in order within site IDs, a SPC comprised of one or more topologically "invalid" profiles can still have "valid" object integrity. That is, the results for the topologically valid profiles are "valid" -- but any that are invalid -- for any of the reasons identified in checkHzDepthLogic may either a) not work with certain functions or b) return bad / NA results.

There is no explicit need to enforce a specific order on the site ID, as long as the order relative to e.g. a character sorted order is known. I walked back my original changes that allowed for data order to be preseved, but I think this is a good goal and achievable.

Now that spc_in_sync has been implemented as the internal validity method we can continuously test the constraints we have defined at the OBJECT level. Modifications incrementally will allow us to achieve horizonless sites (#137) and arbitrary site sorting.

reorderHorizons

I think the current behavior of [,SoilProfileCollection-method NOT re-arranging data is worth keeping for the sake of efficiency and backwards compatibility, but I have implemented a reorderHorizons method. reorderHorizons is primarily intended as a proof of concept/for testing "fixes" to corrupted data.

When coupled with a replacement of a (re-sorted) site slot reorderHorizons could be used to define a SPC re-ordering method.

We would need to require that depths respect initial input order of IDs [only sort top depths within IDs]. Several functions (such as sim profile_compare) that rely on specific sort orders would be broken (for SPCs made by this method).

library(aqp)

df <- data.frame(id = 1, top=0, bottom=1)
depths(df) <- id ~ top + bottom

# Case 1; this gets first profile
df[1,]

# Case 2: this gets first horizon
df[1,]

# Case 3: this returns an _empty_ SPC (after fix for #89)
df[,0]

# Case 4: incidentally, so does this (since there are currently no indexing errors generated)
df[1234,]

Using zero (or zero-length) indexes

Empty SPCs (including df[,0] df[0,] df[numeric(0),] df[,numeric(0)]) look like this:

> SoilProfileCollection with 0 profiles and 0 horizons
> profile ID: id  |  horizon ID: hzID 
> Depth range: NA - NA cm
> 
> ----- Horizons (0 / 0 rows  |  4 / 4 columns) -----
>   [1] id     top    bottom hzID  
> <0 rows> (or 0-length row.names)
> 
> ----- Sites (0 / 0 rows  |  1 / 1 columns) -----
>   [1] id
> <0 rows> (or 0-length row.names)

In principle one can use a 0 j-index subset with a SPC as they could with a data.frame . Since SPCs must juggle several different objects with the same site index, there is opportunity to reinterpret how this should work.

The following behavior relates to concept of site data "built" from horizon by normalization in depths<- or site<- (formula method).

The issue here is that by "allowing" sites to persist without horizons we might retain the sites after some subsetting operations -- which could affect calculation of object length.

If made default, it may affect some historic logic in undesirable ways -- and will impact internal functioning of the SPC in "unexpected" (though probably not "wrong") ways too.

Currently:

  1. df[,0] drops all horizon data from a multi-profile SPC df.

  2. If a j-index subset results in removal of all horizons in a profile (e.g. a profile doesn't have 5 horizons, but an j-index of 5 was used) then the corresponding site information is removed as well (#89)

  3. If sites are removed by subsetting, the length of the SPC object decreases accordingly.

  4. Since no i index is specified, and no profiles can have a 0th horizon, df[,0] will remove site, diagnostics, restriction, spatial, etc. from every profile creating an EMPTY SPC.

  5. Similarly, any index that results in all profiles having all sites or horizons out of bounds will return an empty result. e.g. df[,978] or df[123,]. [I think these should throw "index out of bounds" errors [only if no profiles have a corresponding index]

dropsite = FALSE

drop = FALSE-type (e.g. dropsite) argument could be used to trigger horizonless site "preserving" behavior.

Case 3 shown above df[1, 0, dropsite = FALSE] would create (and SPC would consider "valid") something like the following result:

> SoilProfileCollection with 1 profiles and 0 horizons
> profile ID: id  |  horizon ID: hzID 
> Depth range: ? - ? cm
> 
>  ----- Horizons (0 / 0 rows  |  4 / 4 columns) -----
>    [1] id     top    bottom hzID  
>  <0 rows> (or 0-length row.names)
> 
> ----- Sites (1 / 1 rows  |  1 / 1 columns) -----
>  id
>   1

Generalizing the integrity method for horizonless sites

Since horizonless sites have no horizons, they will not show up in anything based on horizon data -- which will affect integrity calculations for the object [as currently written].

To overcome this we need:

  1. Count site IDs that are without horizons in spc_in_sync check number 1

  2. $[n_{sites.in.site}] = [n_{sites.coalesced.from.horizons}] + [n_{horizonless.sites}]$

  3. Ignore horizonless site IDs in check number 2 against the coalesced horizon-level site IDs.

  4. Add a check to make sure that all sites that are "supposed" to be horizonless are horizonless.

  5. Enhancement: define the argument dropsite for [,SoilProfileCollection-method

  6. Enhancement: define method appendSite<- to add horizonless sites to an existing SPC safely

  7. Enhancement: add basic checks to replaceHorizons<- to ensure IDs, sorting and metadata are OK [relative to site]

Backwards compatibility

In order to preserve historic functionality depths<- (and for general simplicity of interfaces to the object) we should still allow promotion from horizon-level data.frame containing NA in depths.

Within .initSPCfromMF the profiles where every horizon has top and bottom depth equal to NA will have their horizon records removed -- setting them up in the object as truly "horizonless".

The key issue with this is it would not offer an opportunity to use e.g. site(spc) <- ~ var1 + var2 + var3 to normalize site variables from horizon [if the horizon records were already removed by depths<-].

appendSites<-

We could add another method for appending to site data for horizonless sites -- say from a separate data.frame containing just site level data (that may or may not overlap with existing sites?). I don't think we currently have functionality to support this aside from rbind [replacing the site slot] as the site<- LEFT JOIN data.frame method will only handle existing sites.

An append method would theoretically allow you to build an SPC "in reverse" directly from an empty SPC object... and starting with site data... which I am not sure is a great idea. This is partially due to some extra error checking I added to the base constructor to allow an empty object to "work".

library(aqp)

your.site.data <- read.csv("foo.csv")
your.hz.data <- read.csv("foo2.csv")

spc <- SoilProfileCollection()

# hypothetical set up site slot first? you would need to use default ID names and such
appendSites(spc) <- your.site.data # 

# presuming horizonless sites were allowed, we would be OK up to this point
# but the user could potentially build something corrupt with replaceHorizons and/or direct slot edits
# you would still mostly be limited to default ID names (I think) unless we change replaceHorizons
#   - could take ID + horizon ID as arguments -- which would have to match any some site data and exist in horizon data
replaceHorizons(spc) <- your.hz.data

replaceHorizons<- etc.

Basic checks in replaceHorizons and other unconventional entry points to make sure everything is in place would be worth while. I think we can make sure SPC init methods get triggered internally. i.e if you call replaceHorizons, should it look in site and call initSPCfromMF, rebuild site and then put any extra data (e.g. non-ID columns, rows from horizonless sites) back in?



ncss-tech/aqp documentation built on April 19, 2024, 5:38 p.m.