R-language tools for working with THREDDS served data from NERACOOS.
remotes::install_github("BigelowLab/neracoos")
library(neracoos)
X <- get_Maine_buoy(buoy = "A01", dataset = "CTD 1m", form = 'ncdf')
x <- nc_get_table(X,
dnames = "time",
vnames = c("conductivity",
"temperature",
"salinity",
"sigma_t"))
# # A tibble: 274,400 x 5
# time conductivity temperature salinity sigma_t
# <dttm> <dbl> <dbl> <dbl> <dbl>
# 1 2001-07-10 04:00:00 40.2 17.0 30.9 22.4
# 2 2001-07-10 05:00:01 40.2 17.0 31.0 22.4
# 3 2001-07-10 06:00:00 40.1 16.9 31.0 22.4
# 4 2001-07-10 07:00:00 40.0 16.8 30.9 22.5
# 5 2001-07-10 08:00:01 39.8 16.6 30.9 22.5
# 6 2001-07-10 09:00:00 39.4 16.2 30.9 22.5
# 7 2001-07-10 10:00:00 39.3 16.1 30.9 22.5
# 8 2001-07-10 11:00:01 38.9 15.6 30.9 22.6
# 9 2001-07-10 12:00:00 38.7 15.5 30.9 22.7
# 10 2001-07-10 13:00:00 39.0 15.7 30.8 22.6
# # … with 274,390 more rows
You can also retrieve as a tsibble. Observe that we retrieve one fewer record that for tibble output. That's because transformation to a time-series disallows duplicates - so we one record duplicated by time. tsibble
does allow for duplicated time as long as a key variable is assigned also - a key is a like a grouping. In our simplistic usage we don't assign a key, so duplicates are disallowed.
x <- nc_get_table(X,
dnames = "time",
vnames = c("conductivity",
"temperature",
"salinity",
"sigma_t"),
form = "tsibble")
# # A tsibble: 274,399 x 5 [!] <UTC>
# time conductivity temperature salinity sigma_t
# <dttm> <dbl> <dbl> <dbl> <dbl>
# 1 2001-07-10 04:00:00 40.2 17.0 30.9 22.4
# 2 2001-07-10 05:00:01 40.2 17.0 31.0 22.4
# 3 2001-07-10 06:00:00 40.1 16.9 31.0 22.4
# 4 2001-07-10 07:00:00 40.0 16.8 30.9 22.5
# 5 2001-07-10 08:00:01 39.8 16.6 30.9 22.5
# 6 2001-07-10 09:00:00 39.4 16.2 30.9 22.5
# 7 2001-07-10 10:00:00 39.3 16.1 30.9 22.5
# 8 2001-07-10 11:00:01 38.9 15.6 30.9 22.6
# 9 2001-07-10 12:00:00 38.7 15.5 30.9 22.7
# 10 2001-07-10 13:00:00 39.0 15.7 30.8 22.6
# # … with 274,389 more rows
Let's be sure to close the ncdf4
connection now that we have our data in hand.
nc_close(X)
Note that tsibble
supports both regular and irregular time series. You can see that this one is irregular as the time spacing shows variability. But note also the [!]
in the printout header. If this were a regular time series, then the interval would be shown. We can regularize this time series by judicious rounding, but first we should know more about the frequency.
Let's find the differences between consecutive observations in minutes.
dt <- diff_time(x$time, units = "mins")
summary(as.numeric(dt))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 30.00 30.00 30.00 37.17 30.00 283350.00
Well, that's is surprising... the median is 30 minutes, but the head of the table would lead us to believe that they are hourly. And the large max tells us there is a break in the series somewhere that is about 197 days long.
ints_30 <- seq(from = 15, to = 120 + 30, by = 30)
ix <- as.vector(table(findInterval(as.numeric(dt),ints_30)))
names(ix) <- c("15-44", "45-74", "75-104", "105-134", "135-Inf")
ix
# 15-44 45-74 75-104 105-134 135-Inf
# 223578 50758 2 16 44
So, most, but not all, are within a +/-15 minute band around 30 minutes (between 15 and 44 minutes). It's a mixed bag. Let's see what the actual whole minute distribution looks like.
m <- format(x$time, "%M")
table(m)
# 00 29 30 59
# 110773 37269 74535 51822
Ah, so there are pretty much on the hour and half hour. We can reasonably assign them to the nearest 30 minute interval.
x <- regularize(x, unit = "30 min")# # A tsibble: 274,399 x 5 [30m] <UTC>
# time conductivity temperature salinity sigma_t
# <dttm> <dbl> <dbl> <dbl> <dbl>
# 1 2001-07-10 04:00:00 40.2 17.0 30.9 22.4
# 2 2001-07-10 05:00:00 40.2 17.0 31.0 22.4
# 3 2001-07-10 06:00:00 40.1 16.9 31.0 22.4
# 4 2001-07-10 07:00:00 40.0 16.8 30.9 22.5
# 5 2001-07-10 08:00:00 39.8 16.6 30.9 22.5
# 6 2001-07-10 09:00:00 39.4 16.2 30.9 22.5
# 7 2001-07-10 10:00:00 39.3 16.1 30.9 22.5
# 8 2001-07-10 11:00:00 38.9 15.6 30.9 22.6
# 9 2001-07-10 12:00:00 38.7 15.5 30.9 22.7
# 10 2001-07-10 13:00:00 39.0 15.7 30.8 22.6
# # … with 274,389 more rows
Regularizing allows for gap analysis and (simple) filling.
has_gaps(x)
# # A tibble: 1 x 1
# .gaps
# <lgl>
# 1 TRUE
scan_gaps(x)
# # A tsibble: 65,555 x 1 [30m] <UTC>
# time
# <dttm>
# 1 2001-07-10 04:30:00
# 2 2001-07-10 05:30:00
# 3 2001-07-10 06:30:00
# 4 2001-07-10 07:30:00
# 5 2001-07-10 08:30:00
# 6 2001-07-10 09:30:00
# 7 2001-07-10 10:30:00
# 8 2001-07-10 11:30:00
# 9 2001-07-10 12:30:00
# 10 2001-07-10 13:30:00
# # … with 65,545 more rows
y <- fill_gaps(x,
conductivity = median(conductivity, na.rm = TRUE),
salinity = -99)
# # A tsibble: 339,954 x 5 [30m] <UTC>
# time conductivity temperature salinity sigma_t
# <dttm> <dbl> <dbl> <dbl> <dbl>
# 1 2001-07-10 04:00:00 40.2 17.0 30.9 22.4
# 2 2001-07-10 04:30:00 34.7 NA -99 NA
# 3 2001-07-10 05:00:00 40.2 17.0 31.0 22.4
# 4 2001-07-10 05:30:00 34.7 NA -99 NA
# 5 2001-07-10 06:00:00 40.1 16.9 31.0 22.4
# 6 2001-07-10 06:30:00 34.7 NA -99 NA
# 7 2001-07-10 07:00:00 40.0 16.8 30.9 22.5
# 8 2001-07-10 07:30:00 34.7 NA -99 NA
# 9 2001-07-10 08:00:00 39.8 16.6 30.9 22.5
# 10 2001-07-10 08:30:00 34.7 NA -99 NA
# # … with 339,944 more rows
You can see that the time series has been populated with missing records at 30 minute intervals,
and that the variables have been assigned either NA
or a value where specified. For instance,
where gaps have been filled for salinity we assigned -99 while for conductivity we replace with the median conductivity of the gappy input.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.