presentation/presentation.md

Data Management Tips

author: Charlie Labuzzetta date: September 2, 2020 autosize: true incremental: false depends: dplyr, dataManagement, tidyr width: 1440 height: 900

What is Data Management?

The 3 R's of Data Management:

Redundancy: Backup

incremental: true

For important data / files, you need to have a backup of your data (at least one, but preferrably more)!

Example:

Reliability: Personal File System

Separate file system into major professional events. Further breakdown by courses, research projects, extracurricular programs, personal files.

phd - courses - stat_500 - stat_510 - research - bmp - conformal - datafewsion - personal

BoxSync

Iowa State University provides free unlimited cloud storage via Box.com

A helpful tool to enable automatic backup of your data to Box.com, is to use BoxSync. - Store folders / files you want to backup in the BoxSync folder - As you add / edit files in these folders, changes are automatically backed up to Box.com

Follow the link below for more information on installing BoxSync on Windows or Mac computers:

https://support.box.com/hc/en-us/articles/360043697194-Installing-Box-Sync

BoxSync Example

Research Data

Within your backed up personal file system, you will likely have data related to your research.

After you have a organized file system, you can worry about your research projects:

GitHub

GitHub is a web platform for tracking changes to your computer codes and scripts, that you should be using in combination with RStudio, whenever you work with data!

If you don't have an account: https://www.github.com

Once you have an account, add the free education pack through ISU: https://education.github.com/pack - Free GitHub Pro while a student - Extra Storage - Create a free personal webpage (Example: https://labuzzetta.github.io) - Private repositories

RStudio: Give up on Excel and learn R!

R Data Projects

Using GitHub and RStudio, you can create a data project which is:

R Data Project Format

Project Subdirectories:

Example: STRIPS-2 Data

Dr. Lisa Schulte-Moore's STRIPS (Science-based Trials of Rowcrops Integrated with Prairie Strips) Project is one group that has transitioned to using the R Data Package Format.

And that's just 2015-2016!

As an example, we will look into the challenges of storing and analyzing (fake) data similar to those collected by the STRIPS project and examine how the R Data Package format would be useful.

Data storage and analysis

When we take a look back our raw data, there may be many inconsistencies that we'd like to change:

Naming Standards - Poor Example

In the column names, look for: - Inconsistency - Spaces / Weird Characters

|Observer |Location |SPECIES |Date |Time | distance(m)|How recognized? | making sound| |:--------|:-----------------|:-------|:----------|:-------------------|-----------:|:---------------|------------:| |JD |Riverside |AMRO |2016-05-11 |1899-01-01 19:01:00 | 45|Visual | 0| |JD |Riverside Park |EABL |05/11/2016 |1899-01-01 19:53:00 | 90|visual | 0| |mj |Backyard |DOWO |2016-05-16 |1899-01-01 16:27:00 | 24|Visual | 1| |Jd |Backyard |AMGO |05-17-2016 |1899-01-01 06:55:00 | 53|Visual | 0| |Mj |backyard |CLSW |2016-06-03 |1899-01-01 21:43:00 | 31|visual | 0| |Mj |backyard |GHOW |06-03-2016 |1899-01-01 22:24:00 | 20|auditory | 1| |jd |Park |EABL |06/06/2016 |1899-01-01 18:28:00 | 34|visual | 0| |MJ |in Riverside park |AMRO |2016-06-17 |1899-01-01 14:43:00 | 20|heard it | 1| |MJ |Riverside |GBHE |2016-06-17 |1899-01-01 15:07:00 | 60|visual | 0| |mj |RP |PUFI |06/17/2016 |15:45:00 | 40|visual | 0| |mj |RP |EABL |2016-06-17 |16:00:00 | 34|visual | 0| |Jd |yard |AMGO |2016-06-18 |07:42:00 | 23|visual | 0|

Naming Standards - Better?

In the column names, look for: - full words - lower case with underscores

| sighting_id| observer_id| location_id|species_id |date |visual_recognition |audio_recognition | |-----------:|-----------:|-----------:|:----------|:-------------------|:------------------|:-----------------| | 1| 492| 1|AMRO |2016-05-11 19:01:00 |yes |no | | 2| 492| 1|EABL |2016-05-11 19:53:00 |yes |no | | 3| 213| 2|DOWO |2016-05-16 16:27:00 |yes |yes | | 4| 492| 3|AMGO |2016-05-17 06:55:00 |yes |no | | 5| 213| 2|CLSW |2016-06-03 21:43:00 |yes |no | | 6| 213| 2|GHOW |2016-06-03 22:24:00 |no |yes | | 7| 492| 1|EABL |2016-06-06 18:28:00 |yes |no | | 8| 213| 1|AMRO |2016-06-17 14:43:00 |yes |yes | | 9| 213| 1|GBHE |2016-06-17 15:07:00 |yes |no | | 10| 213| 1|PUFI |2016-06-17 15:45:00 |yes |no | | 11| 213| 1|EABL |2016-06-17 16:00:00 |yes |no | | 12| 492| 3|AMGO |2016-06-18 07:42:00 |yes |no |

Discussion - Naming Standards

There are many ways to name files / columns with consistency: - lowercase_underscore - camelCaseNoSpaces - CapitalizeEveryWord - PLEASE_DONT_DO_THIS

Suggestion: lowercase_underscore

Important People in the "R" World suggest: http://r-pkgs.had.co.nz/style.html

Data Types - Dates

There are so many ways to store dates: - 2000-07-04 14:00:00 - 2000-07-04 2:00 PM - 2000/07/04 02:00:00 PM - 04/07/2000

|Observer |Location |SPECIES |Date |Time | distance-meters|How recognized? | making sound| |:--------|:-----------------|:-------|:----------|:-------------------|---------------:|:---------------|------------:| |JD |Riverside |AMRO |2016-05-11 |1899-01-01 19:01:00 | 45|Visual | 0| |JD |Riverside Park |EABL |05/11/2016 |1899-01-01 19:53:00 | 90|visual | 0| |mj |Backyard |DOWO |2016-05-16 |1899-01-01 16:27:00 | 24|Visual | 1| |Jd |Backyard |AMGO |05-17-2016 |1899-01-01 06:55:00 | 53|Visual | 0| |Mj |backyard |CLSW |2016-06-03 |1899-01-01 21:43:00 | 31|visual | 0| |Mj |backyard |GHOW |06-03-2016 |1899-01-01 22:24:00 | 20|auditory | 1| |jd |Park |EABL |06/06/2016 |1899-01-01 18:28:00 | 34|visual | 0| |MJ |in Riverside park |AMRO |2016-06-17 |1899-01-01 14:43:00 | 20|heard it | 1| |MJ |Riverside |GBHE |2016-06-17 |1899-01-01 15:07:00 | 60|visual | 0| |mj |RP |PUFI |06/17/2016 |15:45:00 | 40|visual | 0| |mj |RP |EABL |2016-06-17 |16:00:00 | 34|visual | 0| |Jd |yard |AMGO |2016-06-18 |07:42:00 | 23|visual | 0|

Data Types - Dates

Suggestion: When possible, YYYY-mm-dd or YYYY/mm/dd or YYYYmmdd, and hh:mm:ss or hhmmss - Stores in chronological order - Bypasses AM/PM Errors - Reduces Month/Day confusion - May be combined into a single column when analyzing raw data

Main objective: Choose one preferred format

Suggestion: If using R, look into the lubridate package

| sighting_id| observer_id| location_id|species_id |date |time |visual_recognition |audio_recognition | |-----------:|-----------:|-----------:|:----------|:-------------|:-----------|:------------------|:-----------------| | 1| 492| 1|AMRO |2016-05-11 |19:01:00 |yes |no | | 2| 492| 1|EABL |2016-05-11 |19:53:00 |yes |no | | 3| 213| 2|DOWO |2016-05-16 |16:27:00 |yes |yes | | 4| 492| 3|AMGO |2016-05-17 |06:55:00 |yes |no | | 5| 213| 2|CLSW |2016-06-03 |21:43:00 |yes |no | | 6| 213| 2|GHOW |2016-06-03 |22:24:00 |no |yes | | 7| 492| 1|EABL |2016-06-06 |18:28:00 |yes |no | | 8| 213| 1|AMRO |2016-06-17 |14:43:00 |yes |yes | | 9| 213| 1|GBHE |2016-06-17 |15:07:00 |yes |no | | 10| 213| 1|PUFI |2016-06-17 |15:45:00 |yes |no | | 11| 213| 1|EABL |2016-06-17 |16:00:00 |yes |no | | 12| 492| 3|AMGO |2016-06-18 |07:42:00 |yes |no |

Data Types - Units

How/where should measurement units be stored? - In raw data, note the unit as full words in the column name after a dash - But we will want to remove these units from the polished data - Put the units in documentation / metadata files - Consider changing booleans to "Yes"/"No" - Use appropriate data types (integers vs decimals) and check your abbreviations

What's wrong with the table below?

|Observer |Location |SPECIES |Date |Time | distance-meters|How recognized? | making sound| |:--------|:-----------------|:-------|:----------|:-------------------|---------------:|:---------------|------------:| |JD |Riverside |AMRO |2016-05-11 |1899-01-01 19:01:00 | 45|Visual | 0| |JD |Riverside Park |EABL |05/11/2016 |1899-01-01 19:53:00 | 90|visual | 0| |mj |Backyard |DOWO |2016-05-16 |1899-01-01 16:27:00 | 24|Visual | 1| |Jd |Backyard |AMGO |05-17-2016 |1899-01-01 06:55:00 | 53|Visual | 0| |Mj |backyard |CLSW |2016-06-03 |1899-01-01 21:43:00 | 31|visual | 0| |Mj |backyard |GHOW |06-03-2016 |1899-01-01 22:24:00 | 20|auditory | 1| |jd |Park |EABL |06/06/2016 |1899-01-01 18:28:00 | 34|visual | 0| |MJ |in Riverside park |AMRO |2016-06-17 |1899-01-01 14:43:00 | 20|heard it | 1| |MJ |Riverside |GBHE |2016-06-17 |1899-01-01 15:07:00 | 60|visual | 0| |mj |RP |PUFI |06/17/2016 |15:45:00 | 40|visual | 0| |mj |RP |EABL |2016-06-17 |16:00:00 | 34|visual | 0| |Jd |yard |AMGO |2016-06-18 |07:42:00 | 23|visual | 0|

R Data Package Review

Again, when making an R Data Package, look at: https://github.com/jarad/RDataPackageTemplate - Store raw data in /data-raw - Clean raw data with R scripts - Import clean data to R package - Easily manipulate and analyze clean data

Database Design - Making data easier to analyze

How to think about database design:

A Simple Example:

Single Table Format:

| type | color | store_1_price | store_2_price | |-------|-------|---------------|---------------| | apple | red | 0.29 | 0.27 |

Database Design - Making data easier to analyze

Instead we could use a relational database format:

Fruits

| fruit_id | type | color | |----|-------|-------| | 29 | apple | red |

Stores

| fruit_id | price | store_id | |----|-------|----------| | 29 | 0.29 | 1 | | 29 | 0.27 | 2 |

Database Design - A More Complex Example

Consider the following scenario: - An organization wants to have a bird watching competition - Competitors will register with: - Name, Birthday, Email - Competitors will record where, when and which birds they see - List of locations (Name, Address, GPS Coordinates) - List of observations (Who, When, Where, Which Species, Observation details) - List of all species observed (Common Name, Species, Genus)

How could this data be managed?

Example - Not the best management

|Observer |Location |SPECIES_1 |Date_1 |Time_1 |SPECIES_2 |Date_2 |Time_2 | |:--------|:-----------------|:---------|:----------|:-------------------|:---------|:----------|:-------------------| |JD |Riverside |AMRO |2016-05-11 |01/01/1899 19:01:00 |EABL |2016-05-11 |01/01/1899 19:53:00 | |Jd |Backyard |AMGO |2016-05-17 |01/01/1899 06:55:00 |AMGO |2016-06-18 |01/01/1899 07:42:00 | |mj |Backyard |DOWO |2016-05-16 |01/01/1899 16:27:00 |CLSW |2016-06-03 |01/01/1899 21:43:00 | |MJ |in Riverside park |AMRO |2016-06-17 |01/01/1899 14:43:00 |GBHE |2016-06-17 |01/01/1899 15:07:00 |

Avoid using repetitive groups of columns

See dataManagement::sightings_really_bad for the full table.

Example - Better Managements

| observer_id|first_name |last_name |birth_date |email | |-----------:|:----------|:---------|:----------|:-------------------| | 492|John |Doe |1987-09-03 |john.doe@gmail.com | | 213|Mary |Jane |1959-02-27 |mary.jane@gmail.com |

| location_id|location_name |street_address |city |state |country | |-----------:|:--------------|:----------------|:-----------|:-----|:-------| | 1|Riverside Park |100 State St |La Crosse |WI |USA | | 3|Backyard |418 Red Apple Dr |La Crescent |MN |USA |

| sighting_id| observer_id| location_id|species_id |date | |-----------:|-----------:|-----------:|:----------|:-------------------| | 11| 213| 1|EABL |2016-06-17 16:00:00 | | 12| 492| 3|AMGO |2016-06-18 07:42:00 |

|species_id |genus |species |common_name | |:----------|:------|:-------|:------------------| |EABL |Sialia |sialis |Eastern Bluebird | |AMGO |Spinus |tristis |American Goldfinch |

Database Design

Relational database design helps to: - Reduce the amount of repeat data - Helps standardize data tables - Can reduce errors and NA entries - Makes it easier to understand the metadata / documentation files

Look further at the following tables and documentation: - dataManagement::observer - dataManagement::location - dataManagement::sighting - dataManagement::species

To view documentation for a dataset: - Use ? before the dataset name - Example: ?dataManagement::sighting

Example R Data Project

https://github.com/labuzzetta/apsimo

Review

The 3 R's of Data Management:

Action Items

Additional Topics

Accessing this presentation

You can find this presentation and the associated data on my GitHub page:

https://github.com/labuzzetta/dataManagement

Instructions for installing the package from GitHub are listed on the webpage.

A link to an online version of the presentation alone is:

https://rpubs.com/clabuzze/datamanagement



clabuzze/dataManagement documentation built on Sept. 8, 2020, 10:37 a.m.