In paulhegedus/OFPE: On-Farm Precision Experiments (OFPE) Data Management and Analysis Tools

Introduction

This tutorial covers the data aggregation process. This step requires an OFPE formatted database as created on this page or as done in Vignette 1 of this package. This vignette also requires the user to have completed Vignette 2 or have data imported into an OFPE formatted database. This includes both data collected on-farms and data collected from remote sensing sources. Please refer to the OFPE Technical Website for more information. This tutorial covers the aggregation of the data imported to the database from various sources to the response variables of interest, yield and protein. This creates an 'aggregated' dataset stored in the 'farmername_a' schema in the database that holds data containing information from all of the disparate sources at a common location.

Data aggregation is required for the analysis and prescription generation steps of the OFPE worfklow. Data is gathered from many different sources, but to be useful for statistical analysis and simulation, the data must be brought together into one dataset per field/year of interest in order to use for the OFPE data cycle.

The process for aggregating data in the the database is outlined and a more detailed description in the activity diagram on this page, where a component diagram can also be found. This vignette outlines step 3 of the OFPE data cycle, executed using the R package. This is represented by step 3 in the blue ring of the figure below. There is an associated page on the OFPE web application (green ring) that utilizes these R classes and functions to achieve the same effect in a user friendly environment.

knitr::include_graphics('ofpe_data_workflow.png')

Resources

The resources below are strongly recommended as supplemental information regarding the use and intent of this vignette and associated functions.

OFPE Project Website: Project information and products from the MSU OFPE project.
OFPE Technical Website: Website with more detailed descriptions of the OFPE data cycle and workflow, as well as tutorials for external data processes that cannot be performed in R.
OFPE Overview: Page of the OFPE Technical Website describing an overview of the OFPE data cycle and workflow.
OFPE Data Aggregation: Detailed description of the OFPE on-farm data import process.

Again, it is assumed that the user has completed Vignette 1 and 2 and understands the database creation, management, and import methods described at this site.

Workflow

Raw data from various sources need to be wrangled and aggregated together before statistical analysis and model development. Because yield and protein data are related to crop production and quality, the source of winter wheat value, covariate and explanatory data are used to enrich yield and protein datasets Due to equipment differences, yield and protein are gathered at different temporal resolutions (3 and 10 seconds, respectively) and subsequent spatial resolution, so are treated as separate datasets and not combined based on interpolation or estimation of protein data. The OFPE project aims to use the finest resolution of data possible to make decisions with data that has as little natural variation removed as possible.

The activity workflow for enriching yield and protein datasets and a more detailed description is described on this page, where a component diagram can also be found.

1.0 Set-Up

The user must have access to an OFPE formatted database for using the vignette below. If using your own data and database, modify the pertinent information to your paths and file/folder/database names. This vignette does not use the OFPEDATA package directly, however references data from the package that was imported to the example database in Vignettes 1 and 2.

1.1 Load Packages

#devtools::install_github("paulhegedus/OFPE")
library(OFPE)

1.2 Connect to Database

First, a connection is formed to the database created in this tutorial. This vignette, and the rest of the vignettes are working with an example database named 'OFPE' (as named in the tutorial above and exemplified in Vignettes 1 and 2). This class defaults to a PostgreSQL database.

dbCon <- DBCon$new(
  user = "postgres",
  password = "<your_password>",
  dbname = "<your_db_name>", 
  host = "localhost",
  port = "5432"
)
OFPE::removeTempTables(dbCon$db) # removes temporary tables

2.0 Aggregate Data

The steps for aggregating covariate information to yield or protein data consists of selecting the options and inputs related to the aggregation process and then executing the aggregation workflow. Both of these processes use R6 classes. The 'AggInput' class holds all of the options related to the aggregation workflow and is required for instantiating the 'AggDat' class. Note that you can only aggregate data that you have collected and imported into the database. This will not go collect your data, that is what Vignettes 1 and 2 cover.

2.1 Initialize AggInput Class

This class holds information related to the data aggregation workflow seen in the activity diagrams on this page.

The aggregation process includes multiple options. First, the user must choose the field to aggregate data in. If using the interactive feature, the user can supply a shapefile bounding box to use for querying the database for information available to aggregate, otherwise the field selection is used for querying using database relations. The user then has the choice of what response variable to aggregate data to. Yield and protein are separated because they are collected at different resolutions. This also includes the 'Satellite' option, in which farmers can aggregate remotely sensed data for any year they have data for. The aggregation of 'Satellite' data is important because it will enable the user to simulate these years in the analysis and simulation step of the OFPE data workflow. The user also selects the 'current year', or the year of interest to collect data. This year represents the year that the crop was harvested and observed. The 'previous year' is the last year a crop was harvested in the field.

The user then selects the appropriate experimental variable (as-applied nitrogen or seeding rates). The option is provided to aggregate data to the raw locations ('Observed') of the yield or protein data, or to aggregate data to the centroids of 10m grid cells ('Grids'). The 'Satellite' option uses the 'Grid' option by default since there are no observed locations. The user also has the option on the range of data to use for aggregating. In winter-wheat systems, farmers face a decision point at March 30th when they must decide how much N fertilizer to apply. This also is relevant for spring wheat farmers deciding on seeding rates. When making decisions farmers will not have information past this point, so the OFPE ethos is that no data is used for characterizing crop responses that won't be available for the farmer at this decision point. The user has the option, however, to collect data from the 'current year' up to the 'Decision Point' at March 30th, or from the 'Full Year' through December 31st.

When selecting options interactively, the 'AggInputs' class will guide the user through selecting the specific data in the database to use for all years specified. The user is also guided through the selection of the columns of interest within each file. If not using the interactive functions, these must be supplied. For each year, response, and explanatory variable, the corresponding filenames in the database must be selected for accuracy and to ensure the user is using the data that they intend. For each file selected, the user needs to specify the column that contains the information of interest, for example in yield data, the user must select the column that corresponds to the measurement for yield. If the user is aggregating as-applied nitrogen fertilizer, the user also needs to specify if a/the conversion rate from the applied product to lbs-N/acre is needed.

aggInputs <- AggInputs$new(dbCon)
aggInputs$selectInputs()

2.2 Aggregate Data

After the user selects the inputs, they can pass the 'AggInputs' class to the 'AggDat' class to execute the aggregation action. This process takes the AggInputs class and executes the aggregation process depending on these inputs. Most operations occur within the PostgreSQL database for speed and memory purposes. The response variable data is separated from the raw file to make the base of the aggregation file based on whether using the observed or gridded locations. The 'previous year' response variable data are then extracted from the raw files and associated with the 'current year' data by intersection of grid cells or nearest neighbor to observed data. This is repeated for the experimental variable, with any product conversion applied. Blank application areas indicate the application of 0 fertilizer rates, and filled in appropriately before a 30m buffer is taken from the field edge as part of the cleaning process. Across the whole field, data falling outside of 4SD from the mean is removed, as well as any negative yield and protein values (as they are impossible). If the grid option is selected, this process is repeated with 1SD within each gridcell. After all on-farm data has been aggregated, the remotely sensed raster data is extracted to the locations of the aggregated data. Finally, if SSURGO data is present it is added, otherwise skipped.

If the 'Satellite' response variable is chosen, then the process begins with the creation of 10m centroid locations and the extraction of remotely sensed and SSURGO data if present. Aggregated data are saved in the 'farmername_a' schemas in tables appropriately labeled based on the response variable.

aggDat <- AggDat$new(aggInputs)
aggDat$aggregateData()

Now the process is complete and the specified data has been aggregated together and saved if the user indicated to do so. Proceed by repeating this step and aggregating data for previous year's crop responses in the same field, other crop responses in the same year and field, the two former in a different field, and for satellite data in years that you will want available to simulate for.

Conclusion

This vignette provided a demonstration of how to aggregate data that you have imported to your database. Data aggregation is required for the analysis and prescription generation steps of the OFPE worfklow. Data is gathered from many different sources, but to be useful for statistical analysis and simulation, the data must be brought together into one dataset per field/year of interest in order to use for the OFPE data cycle. Because the user will be aggregating many fields and years for potentially multiple response variables, as well as satellite data, it is possible to initialize AggInputs classes with user defined options upon intialization rather than with the selectInputs() method. The user can then use do.call in their initialization lists to create a list of AggInputs classes that are instantiated. Then the user can lapply over this list with a function that initializes and executes the aggregateData() method of the AggDat class.

A diagram demonstrating the implementation and existence of modules related to the data aggregation step can be seen here, diagramming the relationship of the classes used in this step.

Whenever a database connection is open it needs to be closed.