library(countyweather) options("noaakey" = Sys.getenv("noaakey"))
knitr::opts_chunk$set( collapse = TRUE, error = TRUE )
While data from weather stations is available at the specific location of the weather station, it is often useful to have estimates of daily or hourly weather aggregated on a larger spatial level. For U.S.-based studies, it can be particularly useful to be able to pull time series of weather by county. For example, the health data used in environmental epidemiology studies is often aggregated at the county level for U.S. studies, making it very useful for environmental epidemiology applications to be able to create weather datasets by county.
This package builds on functions from the rnoaa
package to identify weather stations within a county based on its FIPS code and then pull weather data for a specified date range from those weather stations. It then does some additional cleaning and aggregating to produce a single, county-level weather dataset. Further, it maps the weather stations used for that county and date range and allows you to create and write datasets for many different counties using a single function call.
If you are pulling weather data from single weather station, you should use rnoaa
directly. However, countyweather
allows you to pull and aggregate data from weather stations more easily at the county level for the US.
To use this package, you will need an API key from NOAA to be able to access the weather data. This API key is input with some of your data requests to NOAA within functions in this package. You can request an API key from NOAA here: http://www.ncdc.noaa.gov/cdo-web/token. You should keep this key private.
Once you have this NOAA API key, you'll need to pass it through to some of the functions in this package that pull data from NOAA. The most secure way to use this API key is to store it in your .Renviron
configuration file. Then you can save it as the value of an object in R code or R markdown documents without having to include the key itself in the script. To store the NOAA API key in your .Renviron
configuration file, first check and see if you already have an .Renviron
file in your home directory. You can check this by running the following from your R command line:
any(grepl("^\\.Renviron", list.files("~", all.files = TRUE)))
If this call returns TRUE
, then you already have an .Renviron
file.
If you already have it, open that file (for example, with system("open ~/.Renviron")
). If you do not yet have an .Renviron
file, open a new text file (in RStudio, do this by navigating to File > New File > Text File) and save this text file as .Renviron
in your home directory. If prompted with a complaint, you DO want to use a filename that begins with a dot .
Once you have opened or created an .Renviron
file, type the following into the file, replacing "your_emailed_key" with the actual string that NOAA emails you:
noaakey=your_emailed_key
Do not put quotation marks or anything else around your key. Do make sure to add a blank line as the last line of the file. If you find you're having problems getting this to work, go back and confirm that you've included a blank line as the last line in your .Renviron
file. This is the most common reason for this part not working.
Next, you'll need to restart R. Once you restart R, you can get the value of this NOAA API key from .Renviron
anytime with the call Sys.getenv("noaakey")
. Before using functions that require the API key, set up the object rnoaakey
to have your NOAA API key by running:
options("noaakey" = Sys.getenv("noaakey"))
This will pull your NOAA API key from the .Renviron
file and save it as the object noaakey
, which functions in this package need to pull weather data from NOAA's web services. You will want to put this line of code as one of the first lines of code in any R script or R Markdown file you write that uses functions from this package.
Weather data is collected at weather station, and there are often multiple weather stations within a county. The countyweather
package allows you to pull weather data from all stations in a specified county over a specified date range. The two main functions in the countyweather package are daily_fips
and hourly_fips
, which pull daily and hourly weather data, respectively. By default, the weather data pulled from all weather stations in a county will then be averaged for each time point to create an average time series of daily or hourly measurements for that county. There is also an option that allows the user to opt out of the default aggregation across weather stations, and instead pull separate time series for each weather station in the county. This option is explained in more detail later in this document. Opting out of the default aggregation can be useful if you would like to use a method other than a simple average to aggregate across weather stations within a county.
Throughout, functions in this package identify a county using the county's Federal Information Processing Standard (FIPS) code. FIPS codes are 5-digit codes that uniquely identify every U.S. county. The first two digits of a county FIPS code specify state and the last three specify the county within the state. This package pulls data based on FIPS designations as of the 2010 Census. Users will not be able to pull data for the few FIPS codes that have undergone substantial changes since 2010 - for a list of those codes see the Census Bureau's summary of these counties for the 2010s.
Currently, this package can pull daily and hourly weather data for variables like temperature and precipitation. For resources with complete lists of weather variables available through this package, as well as sources of this weather data, see the section later in this document titled "More on the weather data".
The daily_fips
function can be used to pull daily weather data for all weather stations within the geographic boundaries of a county. This daily weather data comes from NOAA's Global Historical Climatology Network. When pulling data for a county, the user can specify date ranges (date_min
, date_max
), which weather variables to include in the output dataset (var
), and restrictions on how much non-missing data a weather station must have over the time period to be included when generating daily county average values (coverage
). This function will pull any available data for weather stations in the county under the specified restrictions and output both a dataset of average daily observations across all county weather stations, as well as a map plotting the stations used in the county-wide averaged data.
Here is an example of creating a dataset with daily precipitation for Miami-Dade county (FIPS code = 12086) for August 1992, when Hurricane Andrew stuck:
andrew_precip <- daily_fips(fips = "12086", date_min = "1992-08-01", date_max = "1992-08-31", var = "prcp") save(andrew_precip, file = "data/andrew_precip.RData")
load("data/andrew_precip.RData")
names(andrew_precip)
The output from this function call is a list that includes three elements: a daily timeseries of weather data for the county (andrew_precip$daily_data
), a dataframe with meta-data about the weather stations used to create the timeseries data, as well as statistical information about the weather values pulled from these stations (andrew_precip$station_metadata
), and a map showing the locations of weather stations included in the county-averaged dataset (andrew_precip$station_map
).
Here are the first few rows of the daily_data
dataset:
head(andrew_precip$daily_data)
The dataset includes columns for date (date
), precipitation (in mm, prcp
), and also the number of stations used to calculate each daily average precipitation observation (prcp_reporting
).
This function performs some simple data cleaning and quality control on the weather data originally pulled from NOAA's web services; see the "More on the weather data" section later in this document for more details, including the units for the weather observations collected by this function.
Here is a plot of this data, with colors used to show the number of stations included in each daily observation:
library(ggplot2) ggplot(andrew_precip$daily_data, aes(x = date, y = prcp, color = prcp_reporting)) + geom_line() + geom_point() + theme_minimal() + xlab("Date in 1992") + ylab("Daily rainfall (mm)") + scale_color_continuous(name = "# stations\nreporting")
From this plot, you can see both the extreme precipitation associated with Hurricane Andrew (Aug. 24) and that the storm knocked out quite a few of the weather stations normally available.
A map is also included in the output of daily_fips
with the stations used for the county average, as the station_map
element:
andrew_precip$station_map
This map uses U.S. Census TIGER/Line shapefiles (vintage 2011) and functions from the ggmap
package to overlay weather station locations on a shaped map showing the county's boundaries.
The station_metadata
dataframe gives information about all of the stations contributing data to the daily_data
dataframe, as well as information about how the values by each station vary within each weather variable. If a weather station is contributing data for multiple variables, it will show up in this dataframe multiple times. Here's what the station_metadata
dataframe looks like for the andrew_precip
list:
andrew_precip$station_metadata
For each station, the dataframe gives an id
and name
, as well as latitude
and longitude
. var
indicates the variable for which the station is pulling data. If a station is contributing data for multiple variables, that station will show up in the dataframe once for each of those variables. For each variable and station combination, the dataframe also shows calc_coverage
, which is the calculated percent of non-missing values. You can filter these by using the daily_fips
option coverage
. standard_dev
gives the standard deviation for each sample of weather data from each station and weather variable, min
and max
give the minumum and maximum values, and range
gives the range of these values. These last four statistical calculations (standard deviation, maximum, minimum, and range) are only included for the seven core hourly weather variables (which include wind_direction
, wind_speed
, ceiling_height
, visibility_distance
, temperature
, and temperature_dewpoint
-- for more details on these variables, see the "More on the weather data" section below). The values of these columns are set to "NA" for other variables, such as quality flag data.
If you are interested in looking at the weather values for certain stations, you can use the average_data = FALSE
option in daily_fips
. For more on this option and a few others, see the "Futher options available in the package" section below.
You can use the hourly_fips
function to pull hourly weather data by county from NOAA's Integrated Surface Data (ISD) weather dataset. In this case, NOAA's web services will not identify weather stations by FIPS, so instead this function will pull all stations within a certain radius of the county's population mean center to represent weather within that county. While there are seven main weather variables that are possible to pull (listed below in the "More on the weather data" section), temperature
and wind_speed
tend to be non-missing most often.
An estimated radius is calculated for each county using 2010 U.S. Census Land Area data -- each county is assumed to be roughly ciruclar. The calculated radius (in km), as well as the longitude and latitude of the geographic center for each county are included as elements in the list returned from hourly_fips
.
Here is an example of pulling hourly data for Miami-Dade, for the year of Hurricane Andrew. While daily weather data can be pulled using a date range specified to the day, hourly data can only be pulled by year (for one or multiple years) using the year
argument:
andrew_hourly <- hourly_fips(fips = "12086", year = 1992, var = c("wind_speed", "temperature")) save(andrew_hourly, file = "data/andrew_hourly.RData")
load("data/andrew_hourly.RData")
names(andrew_hourly)
The output from this call is a list object that includes six elements. andrew_hourly$hourly_data
is an hourly timeseries of weather data for the county. The other five elements, station_metadata
, station_map
, radius
, lat_center
, and lon_center
, are explained in more detail below.
Here are the first few rows of of the hourly_data
dataset:
head(andrew_hourly$hourly_data)
If you need to get the timestamp for each observation in local time, you can use the add_local_time
function from the countytimezones
package to do that:
andrew_hourly_data <- as.data.frame(andrew_hourly$hourly_data) library(countytimezones)
andrew_hourly_data <- add_local_time(df = andrew_hourly_data, fips = "12086", datetime_colname = "date_time") save(andrew_hourly_data, file = "data/andrew_hourly_data.RData")
load("data/andrew_hourly_data.RData")
head(andrew_hourly_data)
Here is a plot of hourly wind speeds for Miami-Dade County, FL, for the month of Hurricane Andrew:
library(dplyr) library(lubridate) to_plot <- andrew_hourly$hourly_data %>% filter(months(date_time) == "August") ggplot(to_plot, aes(x = date_time, y = wind_speed, color = wind_speed_reporting)) + geom_line() + theme_minimal() + xlab("Date in August 1992") + ylab("Wind speed (m / s)") + scale_color_continuous(name = "# stations\nreporting")
Again, the intensity of conditions during Hurricane Andrew is clear, as is the reduction in the number of reporting stations during the storm.
The list object returned by hourly_fips
also includes a map of station locations (station_map
):
andrew_hourly$station_map
Because hourly data is pulled by radius from each county's geographic center, this plot inlcudes the calculated radius from which stations are pulled. This radius is calculated for each county using 2010 U.S. Census Land Area data. U.S. Census TIGER/Line shapefiles are used to provide county outlines, included on this plot as well. Because stations are pulled within a radius from the county's center, stations from outside of the county's boundaries may sometimes be providing data for that county.
Other list elements returned by hourly_fips
include station_metadata
, radius
, lat_center
, and lon_center
. radius
is the estimated radius (in km) for the county calculated using 2010 U.S. Census Land Area data -- the county is assumed to be roughly ciruclar. lat_center
and lon_center
are the longitude and latitude of the geographic center for the county, respectively.
The station_metadata
dataframe gives information about all of the stations contributing data to the hourly_data
dataframe, as well as information about how the values by each station vary within each weather variable. If a weather station is contributing data for multiple variables, it will show up in this dataframe multiple times. Here's what the station_metadata
dataframe looks like for the andrew_hourly
list:
andrew_hourly$station_metadata
usaf
and wban
are station ids. station
is a unique identifier for each station -- usaf and wban ids have been pasted together, separated by "-". (Note: values for wban
or usaf
are sometimes missing (originally indicated by "99999" or "999999"), which could result in a station
value like 722024-NA
.) station_name
is the name for each station, and var
indicates the variable for which the station is pulling data. If a station is contributing data for multiple variables, that station will show up in the dataframe once for each of those variables. For each variable and station combination, the dataframe also shows calc_coverage
, which is the calculated percent of non-missing values. You can filter these by using the hourly_fips
option coverage
. standard_dev
gives the standard deviation for each sample of weather data from each station and weather variable, and range
gives the range of these values. Here, we can see in row 7 that the OPA LOCKA station has a very low percent coverage for temperature (0.0018), and a correspondingly high standard deviation (17.94). If you are interested in looking at the weather values for certain stations, you can use the average_data = FALSE
option in hourly_fips
. For more on this option and a few others, see the "Futher options available in the package" section below. The dataframe also gives station countries, states, elevation (in meters), the earliest and latest dates for which the station has available data (begin
and end
, respectively), longitude, and latitude.
There are a few functions that allow the user to write out daily or hourly timeseries datasets for many different counties to a specified local directory, as well as plots of this data. For daily weather data, see the functions write_daily_timeseries
and plot_daily_timeseries
. For hourly, see write_hourly_timeseries
and plot_hourly_timeseries
.
For example, if we wanted to compare daily weather in the month of August for three counties in southern Florida, we could run:
fl_counties <- c("12086", "12087", "12011") write_daily_timeseries(fips = fl_counties, date_min = "1992-08-01", date_max = "1992-08-31", var = "prcp", out_directory = "~/Documents/andrew_data")
The write_daily_timeseries
function saves each county's timeseries as a separate file in a subdirectory called "data" of the directory specified in the out_directory
option. The data_type
argument allows the user to specify either .rds or .csv files (the default is to write .rds files). Each file is a time series dataframe of daily weather data. A dataframe of station metadata is saved in a second subdirectory called "metadata", and maps showing locations of weather stations contributing to each time series are saved in a subdirectory called "maps." At this stage, if you were to include a county in the fips
argument without available data, a file would not be created for that county.
The function plot_daily_timeseries
creates and saves plots for each of these files. (Note: the data_type
argument for this function also defaults to read .rds files, so if you chose to write .csv files, make sure to change that arugment in this function as well to data_type = "csv"
.)
plot_daily_timeseries("prcp", data_directory = "~/Documents/andrew_data/data", plot_directory = "~/Documents/andrew_data/plots", date_min = "1992-08-01", date_max = "1992-08-31")
Here's an example of what the timeseries plots for the three Florida counties would look like:
two <- daily_fips(fips = "12087", date_min = "1992-08-01", date_max = "1992-08-31", var = "prcp") save(two, file = "data/two.RData")
load("data/two.RData")
three <- daily_fips(fips = "12011", date_min = "1992-08-01", date_max = "1992-08-31", var = "prcp") save(three, file = "data/three.RData")
load("data/three.RData")
oldpar <- par(mfrow = c(1, 3)) df <- andrew_precip$daily_data plot(df$date, df$prcp, type = "l", col = "red", main = "12086", xlab = "date", ylab = "prcp", xlim = c(as.Date("1992-08-01"), as.Date("1992-08-31"))) df2 <- two$daily_data plot(df2$date, df2$prcp, type = "l", col = "red", main = "12087", xlab = "date", ylab = "prcp", xlim = c(as.Date("1992-08-01"), as.Date("1992-08-31"))) df3 <- three$daily_data plot(df3$date, df3$prcp, type = "l", col = "red", main = "12011", xlab = "date", ylab = "prcp", xlim = c(as.Date("1992-08-01"), as.Date("1992-08-31"))) par(oldpar)
coverage
For hourly_fips
, daily_fips
, and time series functions, the user can choose to filter out any stations that report variables for less than a certain percent of the specified date range (coverage
). For example, if you were to set coverage
to 0.90, only stations that reported non-missing values at least 90% of the time over the specified date range would be included in your data.
average_data
In both daily_fips
and hourly_fips
, the default is to return a single daily average for the county for each day in the timeseries, giving the value averaged across all available stations on that day. However, there is also an option called average_data
which allows the user to specify whether they would like the weather data returned before it has been averaged across stations. If this argument is set to FALSE
, the functions will return separate daily data for each station in the county. For our Hurricane Andrew example, we can specify average_data = FALSE
:
not_averaged <- daily_fips(fips = "12086", date_min = "1992-08-01", date_max = "1992-08-31", var = "prcp", average_data = FALSE, station_label = TRUE) save(not_averaged, file = "data/not_averaged.RData")
load("data/not_averaged.RData")
not_averaged_data <- not_averaged$daily_data head(not_averaged_data) unique(not_averaged_data$id)
In this example, there are six stations contributing weather data to the timeseries. We can plot the data by station to get a sense for how values from each station compare, and which stations were presumably knocked out by the storm, with different colors used to show values for different stations:
library(ggplot2) ggplot(not_averaged_data, aes(x = date, y = prcp, colour = id)) + geom_line() + theme_minimal()
It might be interesting here to compare this plot with the station map, this time with station labels included (done using station_label = TRUE
when we pulled this data using daily_fips
):
not_averaged$station_map
The hourly Integrated Surface Data includes quality codes for each of the main weather variables. For more information about the hourly weather variables, see the "More on the weather data" section below. We can use these codes to remove suspect or erroneous values from our data. The values in wind_speed_quality
, for example, take on the following values: (Values in this table were pulled from the ISD documentation file.)
qual <- data.frame(code = c(0, 1, 2, 3, 4, 5, 6, 7, 9), definition = c(" Passed gross limits check", "Passed all quality control checks", "Suspect", "Erroneous", "Passed gross limits check , data originate from an NCEI data source", "Passed all quality control checks, data originate from an NCEI data source", "Suspect, data originate from an NCEI data source", "Erroneous, data originate from an NCEI data source", "Passed gross limits check if element is present")) library(knitr) kable(qual, format = "markdown")
Because it doesn't make sense to average these codes across stations, the codes should only be pulled when using the option to pull station-specific values (average_data = FALSE
).
ex <- hourly_fips("12086", 1992, var = c("wind_speed", "wind_speed_quality"), average_data = FALSE) save(ex, file = "data/ex.RData")
load("data/ex.RData")
ex_data <- ex$hourly_data head(ex_data)
We can replace all wind speed observations with quality codes of 2, 3, 6, or 7 with NA
s.
ex_data$wind_speed_quality <- as.numeric(ex_data$wind_speed_quality) ex_data$wind_speed[ex_data$wind_speed_quality %in% c(2, 3, 6, 7)] <- NA
Functions in this package that pull daily weather values (daily_fips()
, for example) are pulling data from the Daily Global Historical Climatology Network (GHCN-Daily) through NOAA's FTP server. The data is archived at the National Centers for Environmental Information (NCEI) (formerly the National Climatic Data Center (NCDC)), and spans from the 1800s to the current year.
Users can specify which weather variables they would like to pull. The five core daily weather variables are precipitation (prcp
), snowfall (snow
), snow depth (snwd
), maximum temperature (tmax
) and minimum temperature (tmin
). The daily weather data is filtered so that included weather variables fall within a range of possible values. These ranges were chosen to include national maximum recorded values.
daily_vars <- data.frame(variables = c("prcp", "snow", "snwd", "tmax", "tmin"), description = c("precipitation", "snowfall", "snow depth", "maximum temperature", "minumum temperature"), units = c("mm", "mm", "mm", "degrees Celsius", "degrees Celsius"), most_extreme_value = c("1100 mm", "1600 mm", "11500 mm", "57 degrees C", "-62 degrees C")) library(knitr) kable(daily_vars, format = "markdown", col.names = c("Variable", "Description", "Units", "Most extreme value"))
tmax
, tmin
, and prcp
were originally recorded in tenths of units, and are listed as such in NOAA documentation. These values are converted to standard units (degrees Celsius and mm, respectively) in countyweather
output.
There are several additional, non-core variables available. For example, acmc
gives the "average cloudiness midnight to midnight from 30-second ceilometer data (percent)." The complete list of available weather variables can be found under 'element' from the GHCND's readme file.
While the datasets resulting from functions in this package return a cleaned and aggregated dataset, @menne2012overview give more information aboout the raw data in the GHCND database.
Hourly weather data in this package is pulled from NOAA's Integrated Surface Data (ISD), and is available from 1901 to the current year. The data is archived at the National Centers for Environmental Information (NCEI) (formerly the National Climatic Data Center (NCDC)), and is also pulled through NOAA's FTP server.
The seven core hourly weather variables are wind_direction
, wind_speed
, ceiling_height
, visibility_distance
, temperature
, temperature_dewpoint
, and air_pressure
. Values in this table were pulled from the ISD documentation file.
hourly_vars <- data.frame(Variable = c("wind_direction", "wind_speed", "ceiling_height", "visibility_distance", "temperature", "temperature_dewpoint", "air_pressure"), Description = c("The angle, measured in a clockwise direction, between true north and the direction from which the wind is blowing", "The rate of horizontal travel of air past a fixed point", "The height above ground level of the lowest cloud or obscuring phenomena layer aloft with 5/8 or more summation total sky cover, which may be predominately opaque, or the vertical visibility into a surface-based obstruction", "The horizontal distance at which an object can be seen and identified", "The temperature of the air", "The temperature to which a given parcel of air must be cooled at constant pressure and water vapor content in order for saturation to occur", "The air pressure relative to Mean Sea Level"), Units = c("Angular Degrees", "Meters per Second", "Meters", "Meters", "Degrees Celsuis", "Degrees Celsius", "Hectopascals"), Minimum = c("1", "0", "0", "0", "-93.2", "-98.2", "860"), Maximum = c("360", "90", "22000 (indicates 'Unlimited')", "160000", "61.8", "36.8", "1090"))
library(pander) pander::pander(hourly_vars, split.cell = 75, split.table = Inf)
There are other columns available in addition to these weather variables, such as quality codes (e.g., wind_direction_quality
--- each of the main weather variables has a corresponding quality code that can be pulled by adding _quality
to the end of the variable name).
For more information about the weather variables described in the above table and other available columns, see the ISD documentation file.
The following error message will come up after running functions pulling daily data if there isn't available data (for your specified date range, coverage, and weather variables) for a particular station or stations:
In rnoaa::meteo_pull_monitors(monitors = stations, keep_flags = FALSE,: The following stations could not be pulled from the GHCN ftp:
USR0000FTEN
Any other monitors were successfully pulled from GHCN.
The following error message will come up after running functions pulling hourly data (hourly_fips()
) if there isn't available data for any of the stations in your specified county. Note: some weather variables tend to be missing more often than others.
Error in isd_monitors_data(fips = fips, year = x, var = var, radius = radius): None of the stations had available data.
The following error message will come up after running write_daily_timeseries
or write_hourly_timeseries
if the function is unable to pull data for a particular fips code in your fips
vector:
Unable to pull weather data for FIPS code "(specified fips code)" for the specified percent coverage, year(s), and/or weather variables.
If you run functions that use NOAA API calls without first requesting an API key from NOAA and setting up the key in your R session, you will see the following error message:
Error in getOption("noaakey", stop("need an API key for NOAA data")) : need an API key for NOAA data
You might also see this warning message:
Warning message: Error: (400) - Token parameter is required.
If you get one of these messages, run the code:
options("noaakey" = Sys.getenv("noaakey"))
and then try again. If you still get an error, you may not have set up your NOAA API key correctly in your .Renviron
file. See the "Required set-up" section of this document for more details on doing that correctly.
Sometimes, some of NOAA's web services will be off-line. In this case, you may get an error message when you try to pull data like:
Error in gzfile(file, mode) : cannot open the connection
or
Error in tt$results : $ operator is invalid for atomic vectors In addition: Warning message: Error: (500) - An error occured while servicing your request.
In this case, re-starting your R session might fix the problem. If not, wait a few hours and then try again.
If you get other error messages or run into problems with this package, please submit a reproducible example on this repository's Issues page.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.