README.md

(PART) Creating a Covid19R Package

Building a New Data Package

So, you want to create an R data package if covid19 data for the project? Great! You’re welcome to create your own from scratch, or, we’ve created a package template for you to use - particularly useful to first-time package creators! For either case, let’s go through the steps you’ll need to follow, including some best practices, from start to letting us know it works, all the way to CRAN submission!!

Before your start:

How should I divide the data?

You might be interested in making a package that brings in multiple different data sets for a given area. Or, the data source you’re accessing contains multiple different types of data, or data at different levels of spatial organization. Should you deploy these as one big long data set or multiple data sets.

There can be different reasons for taking either path. In general, we advise you to think about, how will an end-user use a single data set? Assume that they have minimal information about your dataset initially (I mean, hopefully they won’t, but nuanced dataset details can be difficult to grasp at first), but want to create a clean, clear, accurate analysis or visualization. For example, the NY Times reports both state and county level data and multiple data types. In covid19nytimes, we deploy one state-level datas et and one county-level data set. This minimizes confusion and possible mistakes (summing county-level data = state level data, and if both were in one data set, mistakes could be made in over-aggregating and getting 2X the number of cases). Within each data set, however, multiple data types are reported, as they can be filtered or shown together, even.

Other data sets will provide some more complexity. In the JHU data, for example, information for some countries is reported at the Province level, and for some countries, it’s at the country level. However, it’s one global dataset, and so the whole set is returned together. However, the location_type column clearly shows what is aggregated, what is not, and using a simple tidyr::separate() country and province-level data can be split for easier aggregation and display.

In essence, how the raw data is structured will inform you how to split or not split the final tidy data.

Packages to use to help yourself out

If this is your first time writing a package, there are a few packages that will help you greatly to develop your package. I’ll also presume you’re doing this within RStudio, which has a variety of tools to make your lives easier in building and deploying packages.

Files to edit on start:

OK, you’re ready and raring to go. We’re going to write this as if you are using the template. Adapt as needed if you are rolling your own.

The meat of your task for the library

All packages in the covid19R have, at minimum, two functions. One function returns all of the information about the dataset in the package. The second function refreshes a dataset to the most current version. If there are multiple datasets per package, only one of the get_info functions is needed. However, each dataset should have its own refresh function. This is for two reasons. First, each dataset might require different code to parse it. Second, the covid19R data harvesting scripts use the names of your datasets to dynamically call the refresh functions. Along the way, there are a few other R helper files to setup in R.

Three columns ask for info from our controlled vocabulary - data types, location types, and spatial extent of dataset. If you have multiple entries for any of these, separate entries by a comma. This will make it easier for end-users to search through information about all datasets and find yours! If you have new types you need to add to our controlled vocabulary, file an issue with the appropriate template, and we’ll add it! We want to bring in all types of data!

Remember, each dataset that your package provides needs one complete set of information.

For different types, we employ a standardized vocabulary which you must conform to. See here for documentation. If you have a data type, location type, or location standard that we do not have, great! We are always looking to expand! Submit an issue and request that we add the new type!

If you would like a local dataset to accompany this package

It is often helpful to have a demo dataset to work with for a new user, rather than for them to have to refresh the whole thing. Also, sometimes data source standard change, and you will want to compare the new incoming data to what it previously looked like. For that reason, in the data-raw directory, we have provided a file DATASET.R which you can edit to use for each dataset you scrape to save a frozen version that can be deployed with your package. As it will be static and not updating, we recommend labelling it *_demo, as we have shown in the example. This is not required, but recommended. If you are not going to do this, feel free to delete the data-raw directory as well as R/data.R.

Documenting your functions and data

Vignettes

Tests

We have provided two example tests using testthat which provide bare minimum checks in the directory tests and associated subdirectories. Edit and use these to make sure whatever incoming data from your source meets your expectations, particularly as you get this package ready to push to the public. Run the tests using the Tests option in the Build tab in RStudio.

Files to edit and things to do for release to the public

Making your package a part of the Covid19R Project

OK! You’re there! It works, and your build is more or less clean (at least, only notes). Close your issue about developing a new package and… file a new issue to onboard this package with the onboarding template! We’ll take a look, test it out, and if it’s ready, we’ll add it in! Nice work! (and if it’s not, we’ll help you fix it)

Submitting to CRAN

YOUR_PACKAGE_NAME

Lifecycle:
maturing CRAN
status Travis build
status

The YOUR_PACKAGE package harvests the data made freely available by the XXX. See USEFUL_URL_ABOUT_DATA for more.

Installation

ONLY INCLUDE IF SUBMITTED TO/ON CRAN You can install the released version of covid19nytimes from CRAN with:

install.packages("YOUR_PACKAGE")

Or the latest development version from github

devtools::install_github("USER_OR_ORG/YOUR_PACKAGE")

Data

The package has the data from XXXXXX. The package comes with static data that was downloaded at the time of the last package update.

library(YOUR_LIBRARY_NAME)

head(DATA) %>% knitr::kable()

Getting the Most Up to Date Data

To get the most updated data, run the following functions

Columns

The data follows the covid19R standard for tidy Covid-19 data. The data columns are as follows:

Sample visualization



Covid19R/covid19_package_template documentation built on June 29, 2020, 9:37 p.m.