The primary focus on this package is its data sets. The philosophy of the package is largely twofold:
1. Minimize changes to data that result in loss of information
2. R functions should not be necessary for one to benefit from the package, which emphasizes the data sets
With respect to the first aspect of the package's philosophy, the package has 4 types of data sets:
1. Raw trawl data files as recieved by data providers (.zip of .csv, .dat, etc)
2. Raw trawl data files that have been converted to R objects (raw.___
)
3. Clean trawl data files have standardized naming conventions, units, and minor corrections (clean.___
)
4. Gridded data sets as R objects (surface and bottom temperature, depth)
The raw data (1) can be found using system.file(package='trawlData')
, which will point you to the 'inst' directory. Modifications to these files by the package maintainers are rare and minor, and are made only to enable R to read them. The raw data files as R objects (2) often constitute merging multiple raw data files, and so in some cases minor modifications to content type had to be made; but modifications are minor. The clean trawl data as R objects (3) have been greatly modified from the original version, but modifications generally take the form of corrections (taxonomy, confirmed typos), formatting (capitalization, date/time formats), or units (area, mass, latitude/ longitude coordinates). Cleaning emphasizes maintaining as much of the original information as possible; no aggregation is performed, no columns are deleted, and no rows are dropped. However, columns are added, included one indicating which rows likely represent low-quality observations that should be deleted, as well as a column indicating why a row has been recommended for deletion. Finally, gridded data sets (4) contain environmental data that are in formats consistent with the raster
package.
With respect to the second aspect of the package's philosophy, I encourage users to do the following:
data(package="trawlData") # list data files
For each of the data files listed there, please explore using using ?
. Documentation of these data sets is an ongoing effort, and is still in its infancy as of v0.2.0; however, for objects that are data.frame
s or data.table
s the help file should at least provide an accurate description of the dimensions of the data set and the column names/ classes.
The data sets are the most valuable part of the package. However, we have added value by trying to standardize the content among these data sets, fixing errors, and formatting the content so that it will be readily usable for most analyses.
The bulk of the code that generates these data sets is included in the package as functions. Some of the code that automates these procedures are not in the package, but are still available on the package's GitHub repo (mostly under the pkgBuild
folder). Let's take, for example, the Newfoundland data set. Like all regions, this survey region has 3 data sets associated with it: raw data files (zipped up under inst/newf/newf/zip
), the raw R object (raw.newf
), and the clean R object (clean.newf
). To illustrate how the functions in the package emphasize the data, I will show how to use the package functions to recreate the Newfoundland .RData files. A similar process can be applied to any region.
library(data.table) # fast library(LaF) # for reading newf, which is .fwf library(trawlData) newf <- read.trawl("newf") # just supply any region abbreivation all.equal(newf, raw.newf) # the result of read.trawl() is the raw.__ object str(newf) # alternatively, check out ?raw.newf tables() # at the time of writing this vignette, newf is 151 MB.
We use the data.table
package to handle as many aspects of data storage and manipulation as possible. Why? It is remarkably fast and versatile. For all regions except newf, data read-in is largely handled by data.table::fread()
; however, the Newfoundland data sets are fix-width-format (fwf), and we use the LaF
package to read in these data formats because they are not supported by fread
.
Note that we just recreated the raw.newf
object very succinctly using the read.trawl()
function. Any region name can be passed to that function to read in the zip file ("neus"
is the only region that does not have raw data files zipped)
Cleaning the data sets requires more steps than reading. Because of the large impact that cleaning data can have on any analysis, we take care to organize and document the procedures we use in trawlData
; by documenting and modularizing these procedures, we offer users both guidance and freedom.
The cleaning procedures that produce the clean.___ objects are grouped into 5 functions, arranged by the type of cleaning task that they perform. These functions are intended to be executed in a specific order (for reason which may become obvious), but there are currently no checks as to whether a particular object has already been modified by the 'prerequisite' cleaning functions. The 5 functions are:
clean.names
standardizes column names clean.format
standardizes classes, units, and fixes typos; minimal date formatting clean.names
clean.columns
adds columns that can be calculated from current columns, formats dates clean.tax
fixes taxonomy, adds ecological information rfishbase
and taxize
packages, but huge manual effort too clean.trimRow
adds columns indicating which rows should be dropped keep.row
column (TRUE means keep, FALSE drop) row_flag
column, which helps to diagnose why a row is recommended for deletion: As you can see, these functions do a lot. Keeping track of what they do is essential to making sure that the corrections to the data are appropriate for your project and analysis. Organizing these common alterations to the trawl data, and giving them names, will allow you to more readily communicate to colleagues and others who interact with your research what aspects of your data you altered. Furthermore, this organization makes it easy to find an tailor alterations. Most clean.__
functions have two mains parts: 1) a function that manipulates all regions based on the same rules, and that then calls 2) a region-specific function to address the needs of the data from that region alone. For example, you may examine the clean.trimRow
function and its subfuction, trawlData:::clean.trimRow
:
print(trawlData::clean.trimRow) # only the Spp2 flag is added here # But we can see the call to switch(), # which specifies region-specific functions; # Let's examine one: print(head(trawlData:::clean.trimRow.newf, 5)) # use ::: b/c this is not an exported function # ... print(tail(trawlData:::clean.trimRow.newf, 5))
Thus, we can treat the functions as a notebook for the procedures. Even if we never use them, knowing they exist and reading them informs us about how why we recommend certain changes, which better enables users to an informed decision about when to accept and when to deviate from a prescribed procedure.
Below we use these functions to recreate the contents to raw.newf
.
clean.names(newf, reg="newf") # changes made in-place clean.format(newf, reg="newf") clean.columns(newf, reg="newf") newf <- clean.tax(newf, reg="newf") # only function requiring reassignment clean.trimRow(newf, reg="newf") all.equal(newf, clean.newf)
Up to this point, very little information has been lost relative to original data files, and in fact much has been added. However, all of this information is obviously cumbersome to work with. When you first load trawlData
, you will have the clean.___ data available to you, which is where we left off with newf
. Let's trim it down to something more manageable. First, we'll remove bad rows.
(newf <- newf[(keep.row)])
Then we can drop columns. Although you can use standard data.table
syntax (or data.frame
, if you convert it first) to drop columns, clean.trimCol
can be a handy function to work with. There are 3 ways to use the function. First, you can simply specify the columns you want to keep via the cols
argument. But cols
already has a list of recommended columns, so it may be more convenient to simply c.add=
columns to that list, and/or c.drop=
columns. Note that columns non-existent columns are specified, no action is taken. Below is an example where we make use of c.add=
and c.drop=
.
clean.trimCol( X=newf, c.add=c("trophicLevel"), c.drop=c("ref","species", "genus", "weight","cnt", "cntcpue", "season", "effort", "cntcpue","stemp","keep.row" ) )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.