knitr::opts_chunk$set(echo = TRUE)
library(senamhiR)
library(dplyr)

The package provides an automated solution for the acquisition of archived Peruvian climate and hydrology data directly within R. The data was compiled from the Senamhi website, and contains all of the data that was available as of April 10, 2018. This data was originally converted from HTML, and is now accessible via an API hosted by the package author.

It is important to note that the info on the Senamhi website has not undergone quality control, however, this package includes a helper function to perform the most common quality control operations for the temperature variables. More functions will be added in the future.

Installing

This package is under active development, and is not available from the official Comprehensive R Archive Network (CRAN). To make installation easier, I have written a script that should facilitate the installation of the package and its dependencies. Use the following command to run this script: ``` {r, eval = FALSE} source("https://gitlab.com/snippets/1793256/raw");install("senamhiR")

_Note: It is always a good idea to review code before you run it. Click the URL in the above command to see the commands that we will run to install._

Once the packages have installed, load **senamhiR** by:
``` {r, eval = FALSE}
library(senamhiR)

Basic workflow

The functions contained in the senamhiR functions allow for the discovery and visualization of meteorological and hydrological stations, and the acquisition of daily climate data from these stations.

station_search()

To search for a station by name, use the station_search() function. For instance, to search for a station with the word 'Santa' in the station name, use the following code:

station_search("Santa")

Note that the tbl_df object (a special sort of data.frame) won't print more than the first 10 rows by default. To see all of the results, you can wrap the command in View() so that it becomes View(find_station("Santa")).

Note that you can also use wildcards as supported by the glob2rx() from the utils package by passing the argument glob = TRUE, as in the following example.

station_search("San*", glob = TRUE)

You can filter your search results by region, by station type, by a given period, and by proximity to another station or a vector of coordinates. You can use any combination of these four filters in your search. The function is fully documented, so take a look at ?station_search. Let's see some examples.

Find all stations in the San Martín Region

station_search(region = "SAN MARTIN")

Find stations named "Santa", with data available between 1971 to 2000

station_search("Santa", period = 1971:2000)

Find all stations between 0 and 100 km from Station No. 000401

station_search(target = "000401", dist = 0:100)

Find all stations that are within 50 km of Machu Picchu

station_search(target = c(-13.163333, -72.545556), dist = 0:50)

Acquire data: senamhiR()

Once you have found your station of interest, you can download the daily data using the eponymous senamhiR() function. The function takes two arguments, station and year. If year is left blank, the function will return all available archived data.

If I wanted to download data for Requena (station no. 000280) from 1981 to 2010, I could use:

requ <- senamhiR("000280", 1981:2010)

Note: Since the StationID numbers contain leading zeros, any station that is less than six characters long will be padded with zeroes. i.e. 280 becomes 000280.

requ

Make sure to use the assignment operator (<-) to save the data into an R object, otherwise the data will just print out to the console, and won't get saved anywhere in the memory.

For easier station visualization

map_stations()

Sometimes a long list of stations is hard to visualize spatially. The map_stations() function helps to overcome this. This function takes a list of stations and shows them on a map powered by the Leaflet library. Like the previous function, the map function is even smart enough to take a search as its list of stations as per the example below. Note that this mapping functionality requires the leaflet package to be installed, and it is not included as a dependency of senamhiR.

Show a map of all stations that are between 30 and 50 km of Machu Picchu

map_stations(station_search(target = c(-13.163333, -72.545556), dist = 30:50), zoom = 7)

Quality control functions

There are two functions included to perform some basic quality control.

quick_audit()

The quick_audit() function will return a tibble listing the percentage or number of missing values for a station. For instance, the following command will return the percentage of missing values in our 30-year Requena data set:

quick_audit(requ, c("Tmax", "Tmin"))

Use report = "n" to show the number of missing values. Use by = "month" or by = "year" to show missing data by month or year. For instance, the number of days for which Mean Temperature was missing at Tocache in 1980:

toca <- senamhiR("000463", year = 1980)
quick_audit(toca, "Tmean", by = "month", report = "n")

qc()

There is an incomplete and experimental function to perform automated quality control on data acquired through this package. Fow now, the package tests temperature and river level only. The logic used between these two types of data is different. Note that these methods are not necessarily statistically robust, and have not been subjected to rigourous testing. Your mileage may vary. In all cases, the original values are archived in an "observations" column, so you can always restore the original values manually.

Temperature variables

Case 1: Missing decimal point

Any number above 100 °C or below -100 °C is tested:

If the number appears to have missed a decimal place (e.g. 324 -> 32.4; 251 -> 25.1), we try to divide that number by 10. If the result is within 1.5 standard deviations of all values 30 days before and after the day in question, we keep the result, otherwise, we discard it.

If the number seems to be the result of some other typographical error (e.g. 221.2), we discard the data point.

Case 2: Tmax < Tmin

We perform the same tests for both Tmax and Tmin. If the number is within 1.5 standard deviations of all values 30 days before and after the day in question, we leave the number alone. (Note: this is often the case for Tmin but seldom the case for Tmax). If the number does not fall within 1.5 standard deviations, we perform an additional level of testing to check if the number is the result of a premature decimal point (e.g. 3.4 -> 34.0; 3 -> 30.0). In this case, we try to multiply the number by 10. If this new result is within 1.5 standard deviations of all values 30 days before and after the day in question, we keep the result, otherwise, we discard it.

I have less confidence in this solution than I do for Case 1.

Example:
requ_dirty <- senamhiR("000280") #1960 to 2018
requ_qc <- qc(requ_dirty)
requ_qc %>% filter(Observations != "") %>% select(Fecha, `Tmax (C)`, `Tmin (C)`, `Tmean (C)`, Observations)
Cases that are currently missed:
Cases where this function is plain wrong:

River level:

Case 1: Suspected decimal place shift

The function first calculates the daily range in river level across the four daily observations. If any range is greater than the (somewhat arbitrary) value of ten times the average range, then we extract a slice of the level observations corresponding to two days before and two days after the day in question. We standardize the slice of data; if any single standardized value is above 1 (below -1), we try to multiply (divide) the value by 10. If the new value falls within 1.5 standard deviations of the mean of the "good" values, then we keep the modified value and call it a decimal place error, otherwise, we set the value to missing and label it as an error.

Example:
options(tibble.width = Inf)
pico_dirty <- senamhiR("230715") #2003 to 2018
pico_qc <- qc(pico_dirty)
pico_qc %>% filter(Observations != "") %>% select(Fecha, starts_with("Nivel"), Observations)
Cases that are currently missed:
Cases where this function is plain wrong:

Variables controlled for:

No other variables are currently tested; hydrological data is not tested. This data should not be considered "high quality", use of the data is your responsibility. Note that all values that are modified from their original values will be recorded in a new "Observations" column in the resultant tibble.

Disclaimer

The package outlined in this document is published under the GNU General Public License, version 3 (GPL-3.0). The GPL is an open source, copyleft license that allows for the modification and redistribution of original works. Programs licensed under the GPL come with NO WARRANTY. In our case, a simple R package isn't likely to blow up your computer or kill your cat. Nonetheless, it is always a good idea to pay attention to what you are doing, to ensure that you have downloaded the correct data, and that everything looks ship-shape.

What to do if something doesn't work

If you run into an issue while you are using the package, you can email me and I can help you troubleshoot the issue. However, if the issue is related to the package code and not your own fault, you should contribute back to the open source community by reporting the issue. You can report any issues to me here on GitLab.

If that seems like a lot of work, just think about how much work it would have been to do all the work this package does for you, or how much time went in to writing these functions ... it is more than I'd like to admit!

Senamhi terms of use

Senamhi's terms of use are here, but as of writing that link was redirecting to the Senamhi home page. An archived version is available here. The terms allow for the free and public access to information on the Senamhi website, in both for-profit and non-profit applications. However, Senamhi stipulates that any use of the data must be accompanied by a disclaimer that Senamhi is the proprietor of the information. The following text is recommended (official text in Spanish):

A message similar to the English message above is printed to the R console whenever the package is loaded.



ConorIA/senamhiR documentation built on May 6, 2019, 12:51 p.m.