R/importNOAA.R
In worldmet: Import Surface Meteorological Data from NOAA Integrated Surface Database (ISD)

Documented in importNOAA

#' Import Meteorological data from the NOAA Integrated Surface Database (ISD)
#'
#' This is the main function to import data from the NOAA Integrated Surface
#' Database (ISD). The ISD contains detailed surface meteorological data from
#' around the world for over 30,000 locations. For general information of the
#' ISD see
#' [https://www.ncei.noaa.gov/products/land-based-station/integrated-surface-database](https://www.ncei.noaa.gov/products/land-based-station/integrated-surface-database)
#' and the map here
#' [https://gis.ncdc.noaa.gov/maps/ncei](https://gis.ncdc.noaa.gov/maps/ncei).
#'
#' Note the following units for the main variables:
#'
#' \describe{
#'
#' \item{date}{Date/time in POSIXct format. **Note the time zone is GMT (UTC)
#' and may need to be adjusted to merge with other local data. See details
#' below.**}
#'
#' \item{latitude}{Latitude in decimal degrees (-90 to 90).}
#'
#' \item{longitude}{Longitude in decimal degrees (-180 to 180). Negative numbers
#' are west of the Greenwich Meridian.}
#'
#' \item{elevation}{Elevation of site in metres.}
#'
#' \item{wd}{Wind direction in degrees. 90 is from the east.}
#'
#' \item{ws}{Wind speed in m/s.}
#'
#' \item{ceil_hgt}{The height above ground level (AGL) of the lowest cloud or
#' obscuring phenomena layer aloft with 5/8 or more summation total sky cover,
#' which may be predominantly opaque, or the vertical visibility into a
#' surface-based obstruction.}
#'
#' \item{visibility}{The visibility in metres.}
#'
#' \item{air_temp}{Air temperature in degrees Celcius.}
#'
#' \item{dew_point}{The dew point temperature in degrees Celcius.}
#'
#' \item{atmos_pres}{The sea level pressure in millibars.}
#'
#' \item{RH}{The relative humidity (%).}
#'
#' \item{cl_1,  ...,  cl_3}{Cloud cover for different layers in Oktas (1-8).}
#'
#' \item{cl}{Maximum of cl_1 to cl_3 cloud cover in Oktas (1-8).}
#'
#' \item{cl_1_height, ..., cl_3_height}{Height of the cloud base for each later
#' in metres.}
#'
#' \item{precip_12}{12-hour precipitation in mm. The sum of this column should
#' give the annual precipitation.}
#'
#' \item{precip_6}{6-hour precipitation in mm.}
#'
#' \item{precip}{This value of precipitation spreads the 12-hour total across
#' the previous 12 hours.}
#'
#'
#' \item{pwc}{The description of the present weather description (if
#' available).}
#'
#' }
#'
#' The data are returned in GMT (UTC). It may be necessary to adjust the time
#' zone when combining with other data. For example, if air quality data were
#' available for Beijing with time zone set to "Etc/GMT-8" (note the negative
#' offset even though Beijing is ahead of GMT. See the `openair` package and
#' manual for more details), then the time zone of the met data can be changed
#' to be the same. One way of doing this would be `attr(met$date, "tzone") <-
#' "Etc/GMT-8"` for a meteorological data frame called `met`. The two data sets
#' could then be merged based on `date`.
#'
#' @section Parallel Processing:
#'
#'   If you are importing a lot of meteorological data, this can take a long
#'   while. This is because each combination of year and station requires
#'   downloading a separate data file from NOAA's online data directory, and the
#'   time each download takes can quickly add up. [importNOAA()] and
#'   [importNOAAlite()] can use parallel processing to speed downloading up,
#'   powered by the capable `{mirai}` package. If users have any `{mirai}`
#'   "daemons" set, these functions will download files in parallel. The
#'   greatest benefits will be seen if you spawn as many daemons as you have
#'   cores on your machine, although one fewer than the available cores is often
#'   a good rule of thumb. Your mileage may vary, however, and naturally
#'   spawning more daemons than station-year combinations will lead to
#'   diminishing returns.
#'
#'   ```
#'   # set workers - once per session
#'   mirai::daemons(4)
#'
#'   # import lots of data - NB: no change in importNOAA()!
#'   big_met <- importNOAA(code = "037720-99999", year = 2010:2020)
#'   ```
#'
#' @param code The identifying code as a character string. The code is a
#'   combination of the USAF and the WBAN unique identifiers. The codes are
#'   separated by a \dQuote{-} e.g. `code = "037720-99999"`.
#' @param year The year to import. This can be a vector of years e.g. `year =
#'   2000:2005`.
#' @param hourly Should hourly means be calculated? The default is `TRUE`. If
#'   `FALSE` then the raw data are returned.
#' @param source The NOAA ISD service stores files in two formats; as delimited
#'   CSV files (`"delim"`) and as fixed width files (`"fwf"`). [importNOAA()]
#'   defaults to `"delim"` but, if the delimited data store is down, users may
#'   wish to try `"fwf"` instead. Both data sources should be identical to one
#'   another.
#' @param quiet If `FALSE`, print missing sites / years to the screen, and show
#'   a progress bar if multiple sites are imported.
#' @param path If a file path is provided, the data are saved as an rds file at
#'   the chosen location e.g.  `path = "C:/Users/David"`. Files are saved by
#'   year and site.
#' @param n.cores No longer recommended; please set [mirai::daemons()] in your R
#'   session. This argument is provided for back compatibility, and is passed to
#'   the `n` argument of [mirai::daemons()] on behalf of the user. Any set
#'   daemons will be reset once the function completes. Default is `NULL`, which
#'   means no parallelism. `n.cores = 1L` is equivalent to `n.cores = NULL`.
#' @export
#' @return Returns a data frame of surface observations. The data frame is
#'   consistent for use with the `openair` package. Note that the data are
#'   returned in GMT (UTC) time zone format. Users may wish to express the data
#'   in other time zones, e.g., to merge with air pollution data.
#'
#' @family NOAA ISD functions
#' @author David Carslaw
#' @examples
#'
#' \dontrun{
#' # import some data
#' beijing_met <- importNOAA(code = "545110-99999", year = 2010:2011)
#'
#' # importing lots of data? use mirai for parallel processing
#' mirai::daemons(4)
#' beijing_met2 <- importNOAA(code = "545110-99999", year = 2010:2025)
#' }
importNOAA <- function(
  code = "037720-99999",
  year = 2014,
  hourly = TRUE,
  source = c("delim", "fwf"),
  quiet = FALSE,
  path = NA,
  n.cores = NULL
) {
  source <- rlang::arg_match(source)

  if (!is.null(n.cores)) {
    if (!rlang::is_integerish(n.cores)) {
      cli::cli_abort(
        "{.field n.cores} should be an integer. You have provided {.type {n.cores}}."
      )
    }
    if (n.cores > 1L) {
      rlang::check_installed(c("mirai", "carrier"))
      if (mirai::daemons_set()) {
        cli::cli_warn(
          "{.fun mirai::daemons} have already been set. Ignoring {.field n.cores}."
        )
      } else {
        on.exit(mirai::daemons(0))
        cli::cli_warn(
          c(
            "!" = "Using {.fun mirai::daemons} to set multiple workers. These will be reset at the end of the function call.",
            "i" = "It is now preferred that users use {.fun mirai::daemons} directly over the use of {.field n.cores}.",
            "i" = "See {.url https://openair-project.github.io/worldmet/articles/worldmet.html} for an example."
          ),
          .frequency = "regularly",
          .frequency_id = "worldmet_mirai"
        )
        mirai::daemons(n.cores)
      }
    }
  }

  # main web site https://www.ncei.noaa.gov/products/land-based-station/integrated-surface-database
  # formats document https://www.ncei.noaa.gov/data/global-hourly/doc/isd-format-document.pdf
  # brief csv file description https://www.ncei.noaa.gov/data/global-hourly/doc/CSV_HELP.pdf
  # gis map https://gis.ncdc.noaa.gov/map/viewer/#app=cdo&cfg=cdo&theme=hourly&layers=1
  # go through each of the years selected, use parallel processing

  # sites and years to process
  site_process <- expand.grid(
    code = code,
    year = year,
    stringsAsFactors = FALSE
  )

  # read in files - in_parallel defaults to sequential if no cores, will
  # use parallelism if daemons are set
  if (source == "delim") {
    dat <-
      purrr::pmap(
        site_process,
        purrr::in_parallel(
          \(code, year) getDatDelim(code = code, year = year, hourly = hourly),
          getDatDelim = getDatDelim,
          hourly = hourly
        ),
        .progress = !quiet
      ) |>
      purrr::list_rbind()
  }

  if (source == "fwf") {
    dat <-
      purrr::pmap(
        site_process,
        purrr::in_parallel(
          \(code, year) {
            getDatFwf(
              code = code,
              year = year,
              hourly = hourly,
              precip = TRUE,
              PWC = TRUE
            )
          },
          getDatFwf = getDatFwf,
          hourly = hourly
        ),
        .progress = !quiet
      ) |>
      purrr::list_rbind()
  }

  if (is.null(dat) || nrow(dat) == 0) {
    # get source not used
    trysource <- c("fwf", "delim")
    trysource <- trysource[trysource != source]

    # message
    cli::cli_inform(
      c(
        "x" = "Specified {.field site}-{.field year} combinations do not exist.",
        "i" = "Is the ISD service down? Check {.url https://www.ncei.noaa.gov/data/global-hourly/}.",
        "i" = 'Try {.code importNOAA(..., source = "{trysource}")} to access an alternative data store.'
      )
    )
    return()
  }

  # check to see what is missing and print to screen
  actual <- dplyr::select(dat, "code", "date", "station") |>
    dplyr::mutate(year = as.numeric(format(.data$date, "%Y"))) |>
    dplyr::slice(1, .by = c("code", "year"))

  actual <- dplyr::left_join(site_process, actual, by = c("code", "year"))

  if (length(which(is.na(actual$date))) > 0 && !quiet) {
    cli::cli_h1("The following sites / years are missing:")
    cli::cli_ul(id = "worldmetmissing")
    missing <- dplyr::filter(actual, is.na(.data$date))
    for (i in seq_len(nrow(missing))) {
      df_i <- missing[i, ]
      cli::cli_li(
        items = "{.field site:} {df_i$code} in {.field year:} {df_i$year}"
      )
    }
    cli::cli_end(id = "worldmetmissing")
  }

  if (!is.na(path)) {
    if (!dir.exists(path)) {
      cli::cli_warn("Directory does not exist; file not saved.", call. = FALSE)
      return()
    }

    # save as year / site files
    writeMet <- function(dat) {
      saveRDS(
        dat,
        paste0(path, "/", unique(dat$code), "_", unique(dat$year), ".rds")
      )
      return(dat)
    }

    dat |>
      dplyr::mutate(year = format(.data$date, "%Y")) |>
      (\(x) split(x, x[c("code", "year")]))() |>
      purrr::map(writeMet)
  }

  return(dat)
}

#' @noRd
getDatDelim <- function(code, year, hourly) {
  ## location of data
  file.name <- paste0(
    "https://www.ncei.noaa.gov/data/global-hourly/access/",
    year,
    "/",
    gsub(pattern = "-", "", code),
    ".csv"
  )

  # suppress warnings because some fields might be missing in the list
  # Note that not all available data is returned - just what I think is most useful
  met_data <- try(
    suppressWarnings(readr::read_csv(
      file.name,
      col_types = readr::cols_only(
        STATION = readr::col_character(),
        DATE = readr::col_datetime(format = ""),
        SOURCE = readr::col_double(),
        LATITUDE = readr::col_double(),
        LONGITUDE = readr::col_double(),
        ELEVATION = readr::col_double(),
        NAME = readr::col_character(),
        REPORT_TYPE = readr::col_character(),
        CALL_SIGN = readr::col_double(),
        QUALITY_CONTROL = readr::col_character(),
        WND = readr::col_character(),
        CIG = readr::col_character(),
        VIS = readr::col_character(),
        TMP = readr::col_character(),
        DEW = readr::col_character(),
        SLP = readr::col_character(),
        AA1 = readr::col_character(),
        AW1 = readr::col_character(),
        GA1 = readr::col_character(),
        GA2 = readr::col_character(),
        GA3 = readr::col_character()
      ),
      progress = FALSE
    )),
    silent = TRUE
  )

  if (class(met_data)[1] == "try-error") {
    met_data <- NULL
    return()
  }

  met_data <- dplyr::rename(
    met_data,
    code = "STATION",
    station = "NAME",
    date = "DATE",
    latitude = "LATITUDE",
    longitude = "LONGITUDE",
    elev = "ELEVATION"
  )

  met_data$code <- code

  # separate WND column

  if ("WND" %in% names(met_data)) {
    met_data <- tidyr::separate(
      data = met_data,
      col = "WND",
      into = c("wd", "x", "y", "ws", "z")
    )

    met_data <- dplyr::mutate(
      met_data,
      wd = as.numeric(.data$wd),
      wd = ifelse(.data$wd == 999, NA, .data$wd),
      ws = as.numeric(.data$ws),
      ws = ifelse(.data$ws == 9999, NA, .data$ws),
      ws = .data$ws / 10
    )
  }

  # separate TMP column
  if ("TMP" %in% names(met_data)) {
    met_data <- tidyr::separate(
      data = met_data,
      col = "TMP",
      into = c("air_temp", "flag_temp"),
      sep = ","
    )

    met_data <- dplyr::mutate(
      met_data,
      air_temp = as.numeric(.data$air_temp),
      air_temp = ifelse(.data$air_temp == 9999, NA, .data$air_temp),
      air_temp = .data$air_temp / 10
    )
  }

  # separate VIS column
  if ("VIS" %in% names(met_data)) {
    met_data <- tidyr::separate(
      data = met_data,
      col = "VIS",
      into = c("visibility", "flag_vis1", "flag_vis2", "flag_vis3"),
      sep = ",",
      fill = "right"
    )

    met_data <- dplyr::mutate(
      met_data,
      visibility = as.numeric(.data$visibility),
      visibility = ifelse(
        .data$visibility %in% c(9999, 999999),
        NA,
        .data$visibility
      )
    )
  }

  # separate DEW column
  if ("DEW" %in% names(met_data)) {
    met_data <- tidyr::separate(
      data = met_data,
      col = "DEW",
      into = c("dew_point", "flag_dew"),
      sep = ","
    )

    met_data <- dplyr::mutate(
      met_data,
      dew_point = as.numeric(.data$dew_point),
      dew_point = ifelse(.data$dew_point == 9999, NA, .data$dew_point),
      dew_point = .data$dew_point / 10
    )
  }
  # separate SLP column
  if ("SLP" %in% names(met_data)) {
    met_data <- tidyr::separate(
      data = met_data,
      col = "SLP",
      into = c("atmos_pres", "flag_pres"),
      sep = ",",
      fill = "right"
    )

    met_data <- dplyr::mutate(
      met_data,
      atmos_pres = as.numeric(.data$atmos_pres),
      atmos_pres = ifelse(
        .data$atmos_pres %in% c(99999, 999999),
        NA,
        .data$atmos_pres
      ),
      atmos_pres = .data$atmos_pres / 10
    )
  }

  # separate CIG (sky condition) column
  if ("CIG" %in% names(met_data)) {
    met_data <- tidyr::separate(
      data = met_data,
      col = "CIG",
      into = c("ceil_hgt", "flag_sky1", "flag_sky2", "flag_sky3"),
      sep = ",",
      fill = "right"
    )

    met_data <- dplyr::mutate(
      met_data,
      ceil_hgt = as.numeric(.data$ceil_hgt),
      ceil_hgt = ifelse(.data$ceil_hgt == 99999, NA, .data$ceil_hgt)
    )
  }

  ## relative humidity - general formula based on T and dew point
  met_data$RH <- 100 *
    ((112 - 0.1 * met_data$air_temp + met_data$dew_point) /
      (112 + 0.9 * met_data$air_temp))^8

  if ("GA1" %in% names(met_data)) {
    # separate GA1 (cloud layer 1 height, amount) column
    met_data <- tidyr::separate(
      data = met_data,
      col = "GA1",
      into = c(
        "cl_1",
        "code_1",
        "cl_1_height",
        "code_2",
        "cl_1_type",
        "code_3"
      ),
      sep = ","
    )

    met_data <- dplyr::mutate(
      met_data,
      cl_1 = as.numeric(.data$cl_1),
      cl_1 = ifelse(
        (is.na(.data$cl_1) & .data$ceil_hgt == 22000),
        0,
        .data$cl_1
      ),
      cl_1 = ifelse(.data$cl_1 == 99, NA, .data$cl_1),
      cl_1_height = as.numeric(.data$cl_1_height),
      cl_1_height = ifelse(.data$cl_1_height == 99999, NA, .data$cl_1_height)
    )
  }

  if ("GA2" %in% names(met_data)) {
    met_data <- tidyr::separate(
      data = met_data,
      col = "GA2",
      into = c(
        "cl_2",
        "code_1",
        "cl_2_height",
        "code_2",
        "cl_2_type",
        "code_3"
      ),
      sep = ","
    )

    met_data <- dplyr::mutate(
      met_data,
      cl_2 = as.numeric(.data$cl_2),
      cl_2 = ifelse(.data$cl_2 == 99, NA, .data$cl_2),
      cl_2_height = as.numeric(.data$cl_2_height),
      cl_2_height = ifelse(.data$cl_2_height == 99999, NA, .data$cl_2_height)
    )
  }

  if ("GA3" %in% names(met_data)) {
    met_data <- tidyr::separate(
      data = met_data,
      col = "GA3",
      into = c(
        "cl_3",
        "code_1",
        "cl_3_height",
        "code_2",
        "cl_3_type",
        "code_3"
      ),
      sep = ","
    )

    met_data <- dplyr::mutate(
      met_data,
      cl_3 = as.numeric(.data$cl_3),
      cl_3 = ifelse(.data$cl_3 == 99, NA, .data$cl_3),
      cl_3_height = as.numeric(.data$cl_3_height),
      cl_3_height = ifelse(.data$cl_3_height == 99999, NA, .data$cl_3_height)
    )
  }

  ## for cloud cover, make new 'cl' max of 3 cloud layers
  if ("cl_3" %in% names(met_data)) {
    met_data$cl <- pmax(
      met_data$cl_1,
      met_data$cl_2,
      met_data$cl_3,
      na.rm = TRUE
    )
  }

  # PRECIP AA1
  if ("AA1" %in% names(met_data)) {
    met_data <- tidyr::separate(
      data = met_data,
      col = "AA1",
      into = c("precip_code", "precip_raw", "code_1", "code_2"),
      sep = ","
    )

    met_data <- dplyr::mutate(
      met_data,
      precip_raw = as.numeric(.data$precip_raw),
      precip_raw = ifelse(.data$precip_raw == 9999, NA, .data$precip_raw),
      precip_raw = .data$precip_raw / 10
    )

    # deal with 6 and 12 hour precip
    id <- which(met_data$precip_code == "06")

    if (length(id) > 0) {
      met_data$precip_6 <- NA
      met_data$precip_6[id] <- met_data$precip_raw[id]
    }

    id <- which(met_data$precip_code == "12")

    if (length(id) > 0) {
      met_data$precip_12 <- NA
      met_data$precip_12[id] <- met_data$precip_raw[id]
    }
  }

  # weather codes, AW1

  if ("AW1" %in% names(met_data)) {
    met_data <- tidyr::separate(
      data = met_data,
      col = "AW1",
      into = c("pwc", "code_1"),
      sep = ",",
      fill = "right"
    )

    met_data <- dplyr::left_join(met_data, worldmet::weatherCodes, by = "pwc")
    met_data <- dplyr::select(met_data, -"pwc") |>
      dplyr::rename(pwc = "description")
  }

  ## select the variables we want
  met_data <- dplyr::select(
    met_data,
    dplyr::any_of(c(
      "date",
      "code",
      "station",
      "latitude",
      "longitude",
      "elev",
      "ws",
      "wd",
      "air_temp",
      "atmos_pres",
      "visibility",
      "dew_point",
      "RH",
      "ceil_hgt",
      "cl_1",
      "cl_2",
      "cl_3",
      "cl",
      "cl_1_height",
      "cl_2_height",
      "cl_3_height",
      "pwc",
      "precip_12",
      "precip_6",
      "precip"
    ))
  )

  ## present weather is character and cannot be averaged, take first
  if ("pwc" %in% names(met_data) && hourly) {
    pwc <- met_data[c("date", "pwc")]
    pwc$date2 <- format(pwc$date, "%Y-%m-%d %H") ## nearest hour
    tmp <- pwc[which(!duplicated(pwc$date2)), ]
    dates <- as.POSIXct(paste0(unique(pwc$date2), ":00:00"), tz = "GMT")

    pwc <- data.frame(date = dates, pwc = tmp$pwc)
  }

  ## average to hourly
  if (hourly) {
    met_data <-
      worldmet_time_average(
        met_data,
        avg.time = "hour",
        type = c("code", "station")
      )
  }

  ## add pwc back in
  if (exists("pwc")) {
    met_data <- dplyr::left_join(met_data, pwc, by = "date")
  }

  ## add precipitation - based on 12 HOUR averages, so work with hourly data

  ## spread out precipitation across each hour

  ## only do this if precipitation exists
  if ("precip_12" %in% names(met_data) && hourly) {
    ## make new precip variable
    met_data$precip <- NA

    ## id where there is 12 hour data
    id <- which(!is.na(met_data$precip_12))

    if (length(id) == 0L) {
      return()
    }

    id <- id[id > 11] ## make sure we don't run off beginning

    for (i in seq_along(id)) {
      met_data$precip[(id[i] - 11):id[i]] <- met_data$precip_12[id[i]] / 12
    }
  }

  # replace NaN with NA
  met_data[] <- lapply(met_data, function(x) {
    replace(x, is.nan(x), NA)
  })

  return(dplyr::as_tibble(met_data))
}

#' 'Fixed width file' helpers.
#' @noRd
getDatFwf <- function(code, year, hourly, precip, PWC) {
  procAdditFwf <- function(add, dat, precip, PWC) {
    # function to process additional data such as cloud cover

    # consider first 3 layers of cloud GA1, GA2, GA3
    dat <- extractCloudFwf(add, dat, "GA1", "cl_1")
    dat <- extractCloudFwf(add, dat, "GA2", "cl_2")
    dat <- extractCloudFwf(add, dat, "GA3", "cl_3")

    # 6 and 12 hour precipitation
    if (precip) {
      dat <- extractPrecipFwf(add, dat, "AA112", "precip_12")
      dat <- extractPrecipFwf(add, dat, "AA106", "precip_6")
    }

    if (PWC) {
      dat <- extractCurrentWeatherFwf(add, dat, "AW1")
    }

    return(dat)
  }

  extractPrecipFwf <- function(add, dat, field = "AA112", out = "precip_12") {
    # fields that contain search string
    id <- grep(field, add)

    # variables for precip amount
    dat[[out]] <- NA

    if (length(id) > 1) {
      # location of begining of AA1 etc

      loc <- sapply(id, function(x) regexpr(field, add[x]))

      # extract amount of rain

      amnt <- sapply(seq_along(id), function(x) {
        substr(add[id[x]], start = loc[x] + 5, stop = loc[x] + 8)
      })

      miss <- which(amnt == "9999") # missing
      if (length(miss) > 0) {
        amnt[miss] <- NA
      }

      amnt <- as.numeric(amnt) / 10

      dat[[out]][id] <- amnt
    }

    return(dat)
  }

  extractCloudFwf <- function(add, dat, field = "GA1", out = "cl_1") {
    # 3 fields are used: GA1, GA2 and GA3

    height <- paste0(out, "_height") # cloud height field

    # fields that contain search string
    id <- grep(field, add)

    # variables for cloud amount (oktas) and cloud height
    dat[[out]] <- NA
    dat[[height]] <- NA

    if (length(id) > 1) {
      # location of begining of GA1 etc

      loc <- sapply(id, function(x) regexpr(field, add[x]))

      # extract the variable
      cl <- sapply(seq_along(id), function(x) {
        substr(add[id[x]], start = loc[x] + 3, stop = loc[x] + 4)
      })
      cl <- as.numeric(cl)

      miss <- which(cl == 99) # missing

      if (length(miss) > 0) {
        cl[miss] <- NA
      }

      # and height of cloud
      h <- sapply(seq_along(id), function(x) {
        substr(add[id[x]], start = loc[x] + 6, stop = loc[x] + 11)
      })
      h <- as.numeric(h)

      miss <- which(h == 99999)
      if (length(miss) > 0) {
        h[miss] <- NA
      }

      dat[[out]][id] <- cl
      dat[[height]][id] <- h
    }

    return(dat)
  }

  extractCurrentWeatherFwf <- function(add, dat, field = "AW1") {
    # fields that contain search string
    id <- grep(field, add)

    if (length(id) > 1) {
      # name of output variable
      dat[["pwc"]] <- NA

      # location of begining of AW1
      loc <- sapply(id, function(x) regexpr(field, add[x]))

      # extract the variable
      pwc <- sapply(seq_along(id), function(x) {
        substr(add[id[x]], start = loc[x] + 3, stop = loc[x] + 4)
      })
      pwc <- as.character(pwc)

      # look up code in weatherCodes.RData

      desc <- sapply(pwc, function(x) {
        weatherCodes$description[which(weatherCodes$pwc == x)]
      })

      dat[["pwc"]][id] <- desc
    } else {
      return(dat)
    }

    return(dat)
  }

  weatherCodes <- worldmet::weatherCodes

  # location of data
  file.name <- paste0(
    "https://www1.ncdc.noaa.gov/pub/data/noaa/",
    year,
    "/",
    code,
    "-",
    year,
    ".gz"
  )

  # Download file to temp directory
  tmp <- paste0(tempdir(), basename(file.name))

  # deal with any missing data, issue warning
  bin <- try(utils::download.file(file.name, tmp, quiet = TRUE, mode = "wb"))

  if (inherits(bin, "try-error")) {
    return(NULL)
  }

  column_widths <- c(
    4,
    6,
    5,
    4,
    2,
    2,
    2,
    2,
    1,
    6,
    7,
    5,
    5,
    5,
    4,
    3,
    1,
    1,
    4,
    1,
    5,
    1,
    1,
    1,
    6,
    1,
    1,
    1,
    5,
    1,
    5,
    1,
    5,
    1
  )

  # mandatory fields, fast read
  dat <- readr::read_fwf(
    tmp,
    readr::fwf_widths(column_widths),
    col_types = "ccciiiiicccccccicciccccccccccccccc"
  )

  # additional fields, variable length, need to read eveything
  add <- readLines(tmp)

  # Remove select columns from data frame
  dat <- dat[, c(2:8, 10:11, 13, 16, 19, 21, 25, 29, 31, 33)]

  # Apply new names to the data frame columns
  names(dat) <- c(
    "usaf",
    "wban",
    "year",
    "month",
    "day",
    "hour",
    "minute",
    "latitude",
    "longitude",
    "elev",
    "wd",
    "ws",
    "ceil_hgt",
    "visibility",
    "air_temp",
    "dew_point",
    "atmos_pres"
  )

  # Correct the latitude values
  dat$latitude <- as.numeric(dat$latitude) / 1000

  # Correct the longitude values
  dat$longitude <- as.numeric(dat$longitude) / 1000

  # Correct the elevation values
  dat$elev <- as.numeric(dat$elev)

  # Correct the wind direction values
  dat$wd <-
    ifelse(dat$wd == 999, NA, dat$wd)

  # Correct the wind speed values
  dat$ws <-
    ifelse(dat$ws == 9999, NA, dat$ws / 10)

  # Correct the temperature values
  dat$air_temp <- as.numeric(dat$air_temp)
  dat$air_temp <- ifelse(dat$air_temp == 9999, NA, dat$air_temp / 10)

  # Correct the visibility values
  dat$visibility <- as.numeric(dat$visibility)
  dat$visibility <- ifelse(
    dat$visibility %in% c(9999, 999999),
    NA,
    dat$visibility
  )

  # Correct the dew point values
  dat$dew_point <- as.numeric(dat$dew_point)
  dat$dew_point <- ifelse(dat$dew_point == 9999, NA, dat$dew_point / 10)

  # Correct the atmospheric pressure values
  dat$atmos_pres <- as.numeric(dat$atmos_pres)
  dat$atmos_pres <- ifelse(dat$atmos_pres == 99999, NA, dat$atmos_pres / 10)

  # Correct the ceiling height values
  dat$ceil_hgt <- as.numeric(dat$ceil_hgt)
  dat$ceil_hgt <- ifelse(dat$ceil_hgt == 99999, NA, dat$ceil_hgt)

  # relative humidity - general formula based on T and dew point
  dat$RH <- 100 *
    ((112 - 0.1 * dat$air_temp + dat$dew_point) /
      (112 + 0.9 * dat$air_temp))^8

  dat$date <- ISOdatetime(
    dat$year,
    dat$month,
    dat$day,
    dat$hour,
    dat$minute,
    0,
    tz = "GMT"
  )

  # drop date components that are not needed
  dat <- dplyr::select(
    dat,
    -dplyr::all_of(c("year", "month", "day", "hour", "minute"))
  )

  # process the additional data separately
  dat <- procAdditFwf(add, dat, precip, PWC)

  # for cloud cover, make new 'cl' max of 3 cloud layers
  dat$cl <- pmax(dat$cl_1, dat$cl_2, dat$cl_3, na.rm = TRUE)

  # additional condition, when ceil_height = 22000 and cl_1 is NA, assume no cloud
  dat <- dat |>
    dplyr::mutate(
      cl = ifelse(
        (is.na(.data$cl_1) & .data$ceil_hgt == 22000),
        0,
        .data$cl
      ),
      cl_1 = ifelse(
        (is.na(.data$cl_1) & .data$ceil_hgt == 22000),
        0,
        .data$cl_1
      )
    )

  # select the variables we want
  dat <- dat[
    names(dat) %in%
      c(
        "date",
        "usaf",
        "wban",
        "station",
        "ws",
        "wd",
        "air_temp",
        "atmos_pres",
        "visibility",
        "dew_point",
        "RH",
        "ceil_hgt",
        "latitude",
        "longitude",
        "elev",
        "cl_1",
        "cl_2",
        "cl_3",
        "cl",
        "cl_1_height",
        "cl_2_height",
        "cl_3_height",
        "pwc",
        "precip_12",
        "precip_6"
      )
  ]

  # present weather is character and cannot be averaged, take first
  if ("pwc" %in% names(dat) && hourly) {
    pwc <- dat[c("date", "pwc")]
    pwc$date2 <- format(pwc$date, "%Y-%m-%d %H") # nearest hour
    tmp <- pwc[which(!duplicated(pwc$date2)), ]
    dates <- as.POSIXct(paste0(unique(pwc$date2), ":00:00"), tz = "GMT")

    pwc <- data.frame(date = dates, pwc = tmp$pwc)
    PWC <- TRUE
  }

  # average to hourly
  if (hourly) {
    dat <- worldmet_time_average(
      dat,
      avg.time = "hour",
      type = c("usaf", "wban")
    )
  }

  # add precipitation
  if (precip) {
    # spread out precipitation across each hour
    # met data gives 12 hour total and every other 6 hour total

    # only do this if precipitation exists
    if (all(c("precip_6", "precip_12") %in% names(dat))) {
      # make new precip variable
      dat$precip <- NA

      # id where there is 6 hour data
      id <- which(!is.na(dat$precip_6))
      id <- id[id < (nrow(dat) - 6)] # make sure we don't run off end

      # calculate new 6 hour based on 12 hr total - 6 hr total
      dat$precip_6[id + 6] <- dat$precip_12[id + 6] - dat$precip_6[id]

      # ids for new 6 hr totals
      id <- which(!is.na(dat$precip_6))
      id <- id[id > 6]

      # Divide 6 hour total over each of 6 hours
      for (i in seq_along(id)) {
        dat$precip[(id[i] - 5):id[i]] <- dat$precip_6[id[i]] / 6
      }
    }
  }

  # return other meta data
  meta <- getMeta(returnMap = FALSE, plot = FALSE, end.year = "all")
  info <- meta[meta$code == code, ]

  dat$station <- as.character(info$station)
  dat$latitude <- info$latitude
  dat$longitude <- info$longitude
  dat$elev <- info$`elev(m)`
  dat$code <- code

  # rearrange columns, one to move to front
  move <- c("date", "code", "station")
  dat <- dat[c(move, setdiff(names(dat), move))]

  # replace NaN with NA
  dat <- dplyr::mutate(
    dat,
    dplyr::across(dplyr::everything(), \(x) dplyr::if_else(is.nan(x), NA, x))
  )

  # reorder to match delimited data
  dat <-
    dplyr::relocate(
      dat,
      dplyr::any_of(
        c(
          "code",
          "station",
          "date",
          "latitude",
          "longitude",
          "elev",
          "ws",
          "wd",
          "air_temp",
          "atmos_pres",
          "visibility",
          "dew_point",
          "RH",
          "ceil_hgt",
          "cl_1",
          "cl_2",
          "cl_3",
          "cl",
          "cl_1_height",
          "cl_2_height",
          "cl_3_height",
          "precip_12",
          "precip_6",
          "pwc",
          "precip"
        )
      )
    )

  return(dplyr::tibble(dat))
}
Any scripts or data that you put into this service are public.
worldmet documentation built on Nov. 7, 2025, 5:07 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
worldmet
Import Surface Meteorological Data from NOAA Integrated Surface Database (ISD)

R/importNOAA.R
In worldmet: Import Surface Meteorological Data from NOAA Integrated Surface Database (ISD)

Defines functions getDatFwf getDatDelim importNOAA

Documented in importNOAA

Try the worldmet package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

worldmet Import Surface Meteorological Data from NOAA Integrated Surface Database (ISD)

R/importNOAA.R In worldmet: Import Surface Meteorological Data from NOAA Integrated Surface Database (ISD)

Defines functions getDatFwf getDatDelim importNOAA

Documented in importNOAA

Try the worldmet package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

worldmet
Import Surface Meteorological Data from NOAA Integrated Surface Database (ISD)

R/importNOAA.R
In worldmet: Import Surface Meteorological Data from NOAA Integrated Surface Database (ISD)