R/eq_clean_data.R

#' Import and clean NOAA data for use in analytics and visualization
#'
#' This function will be used to import and clean NOAA data for further
#' analysis. While it is possible that this will be called by the usuer, it is
#' more likely the case that this function will be used in future functions that
#' produced a more valuable output than the cleaned data itself. As is
#' common this function uses the \code{tidyverse} packages. Normally the author prefers
#' the \code{data.table} package for speed, however, the NOAA set is very small
#' and users will more likely be familiar with \code{tidyverse} syntax.  See
#' \code{https://www.ngdc.noaa.gov/nndc/struts/results?&t=101650&s=225&d=225}
#' for data defintions.NOAA data goes back in history thousands of years. For
#' this reason their raw data contains negative and positive years, the sign
#' indicating the B.C vs A.D respectively.
#'
#' @param dataframe A \code{data.frame} of dirty NOAA data that has been read in
#' already. This argument is ignored if the \code{file} argument is provided. File
#' is the more compact method of cleaning data as it does the import directly.
#'
#'
#' @param file A character for the file that contains the 'dirty' data. This
#'   requires the full path if not in the working director. If a vector is given
#'   multiple data sets will be produced.
#'
#' @param dayfill A number indicating day of the month to use for missing data.
#' The missing data is common for old earthquakes for which data is not possible.
#' The default is to pick the first of the month.
#'
#' @param monthfill A number indicating the month to use for missing data.
#' The missing data is common for old earthquakes for which data is not possible.
#' The default is to pick July; middle of the year.
#'
#' @param delim A character string, dictating how the dirty data file is
#'   delimited. This control is passed directly into the \code{delim} argument
#'   of \code{readr::read_delim}. Currently NOAA data is stored in tab
#'   delimited, hence the default equal to \code{"\t"}. This paratmeter is
#'   really a future proofing if NOAA changes or if the user has internal data
#'   collection that intermediately copies the data into a different delimited
#'   format. Non delimited file types are not supported.
#'
#' @param ... This can be used primarily to pass arguments to the support
#' functions imported from other packages:
#'
#' @importFrom readr read_delim
#' @importFrom lubridate ymd
#'
#' @export
#'
#' @examples
#' \dontrun{
#'
#' #for a file from NOAA, simply point machine to file and the data.frame will return
#' eq_clean_data(file="datafromnoaa.txt")
#'
#' #if you have imported the data and just want it cleaned per the standard of this package
#' data<-read.delim("file.csv")
#'
#' eq_clean_data(dataframe=data)
#'
#' }
#'


eq_clean_data<-function(dataframe,file=NULL,
                            dayfill=1,monthfill=7,
                            delim="\t",...){

  ifelse(!is.null(file),
      tmpdata<-readr::read_delim(file=file,delim=delim),
      tmpdata<-dataframe)

  #Fill in missing data for day and month with the parameters passed
  tmpdata[is.na(tmpdata$DAY),c("DAY")]<-dayfill
  tmpdata[is.na(tmpdata$MONTH),c("MONTH")]<-monthfill

  #Create a full date column from components
  tmpdata$DATE<-lubridate::make_date(year=tmpdata$YEAR,
                                     month=tmpdata$MONTH,
                                     day=tmpdata$DAY)
  #Move Date near Date fields
  yearcol<-match("YEAR",names(tmpdata))
  datecol<-match("DATE",names(tmpdata))
  tmpdata<-tmpdata[,c(1:(yearcol-1),datecol,yearcol,(yearcol+1):(datecol-1))]

  #Rip the country name out of the Location name and camel case it
  #Depended on assumption that ":  " will separates Country and Location
  #Assignment required this to be a second function; thus you must look at the other code for documentation.
  tmpdata$LOCATION_NAME<-eq_location_clean(tmpdata)

  #Reclass classes we need as classes
  tmpdata$LATITUDE<-as.numeric(tmpdata$LATITUDE)
  tmpdata$LONGITUDE<-as.numeric(tmpdata$LONGITUDE)
  tmpdata$DEATHS<-as.numeric(tmpdata$DEATHS)
  tmpdata$EQ_PRIMARY<-as.numeric(tmpdata$EQ_PRIMARY)
  tmpdata$TOTAL_DEATHS<-as.numeric(tmpdata$TOTAL_DEATHS)

  return(tmpdata)
}
JJNewkirk/NOAAEQ documentation built on May 27, 2019, 1:12 p.m.