In DanielAyllon/EarthQuakeAnalyzer: Clean, analyze and visualize earthquake data from the NOAA Significant Earthquakes dataset

knitr::opts_chunk$set(echo = TRUE,message=F, warning=F)

R EarthQuakeAnalyzer package

The R EarthQuakeAnalyzer package was developed to work with the NOAA Significant Earthquakes dataset.

The package incorporate tools for processing and visualizing the dataset obtained from the U.S. National Oceanographic and Atmospheric Administration (NOAA) on significant earthquakes around the world. This dataset contains information about 5,933 earthquakes over an approximately 4,000 year time span. The dataset is included in the package in a data frame format - EarthquakeDataFrame - to be loaded through the data() function. Likewise, the raw data - signif.txt - is stored in the inst/extdata - directory so that it can be accessed.

It includes tools for: (1) reading and cleaning the dataset, (2) adding a geom instance to a ggplot object to plot a timeline of selected earthquakes providing information about the magnitude and number of deaths, and (3) creating a leaflet fully interactive map showing the location of epicenters of selected earthquakes with information regarding the magnitude and number of deaths.

Required packages

The R Assignment package requires functions from the readr, dplyr, tidyr, lubridate, stringr, ggplot2, grid and leaflet R packages.

library(readr)
library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)
library(ggplot2)
library(grid)
library(leaflet)

Functions for processing and cleaning the data

eq_clean_data() Function

This function reads the raw earthquake dataset in a tab-delimited format to clean some of the variables. The function first checks whether the file to be read actually exists.If it does not exist, then it downloads from the NOOA website.Then, the function adds a date column created by uniting the year, month, day of each earthquake and converting it to the Date class. It filters the earthquakes without a complete date.

It has only one argument, filename, a character vector of the name of the tab-delimited text file to be read.

Here, it is read and stored the file:

eq_clean_data<-function(filename) {
        if(!file.exists(filename)) {
                utils::download.file("https://www.ngdc.noaa.gov/nndc/struts/results?type_0=Exact&query_0=$ID&t=101650&s=13&d=189&dfn=signif.txt",filename)
        }
        EarthquakeData<-readr::read_delim(filename,delim = "\t") %>%
                dplyr::filter(!is.na(YEAR), YEAR > 0, !is.na(MONTH), !is.na(DAY)) %>%
                dplyr::mutate(YEAR = as.character(YEAR)) %>%
                dplyr::mutate(NEWYEAR=ifelse(nchar(YEAR)==1,paste0("000",YEAR),
                                      ifelse(nchar(YEAR)==2,paste0("00",YEAR),
                                             ifelse(nchar(YEAR)==3,paste0("0",YEAR),YEAR)))) %>%
                tidyr::unite(DATETIME, NEWYEAR, MONTH, DAY) %>%
                dplyr::mutate(DATETIME = lubridate::ymd(DATETIME)) %>%
                dplyr::mutate(LATITUDE = as.numeric(LATITUDE),LONGITUDE = as.numeric(LONGITUDE),EQ_PRIMARY = as.numeric(EQ_PRIMARY),TOTAL_DEATHS = as.numeric(TOTAL_DEATHS),YEAR = as.numeric(YEAR))
        EarthquakeData
}

tail(eq_clean_data("signif.txt"))

eq_location_clean() Function

The package includes a function to clean up the "LOCATION_NAME" column from either the downloaded raw dataset or the processes dataframe, as cleaned by the eq_clean_data() function.

The function returns a dataframe with the column "LOCATION_NAME" cleaned by stripping out the country name (including the colon) and convert names to title case (as opposed to all caps).The dataframe to be cleaned is given by the input parameter EarthquakeData, the only argument of the function.

Here, the dataframe processed in the previous example gets the "LOCATION_NAME" cleaned:

eq_location_clean<-function(EarthquakeData) {
        n_earthquakes<-length(EarthquakeData$LOCATION_NAME)
        clean_loc_name<-data.frame()
        for(i in 1:n_earthquakes){
                myname<-as.character(EarthquakeData$LOCATION_NAME[[i]])
                colon_loc<-regexpr("\\:[^\\:]*$", myname)[[1]]
                myname<-substring(myname,colon_loc + 1) %>% 
                        stringr::str_trim(.) %>%
                        stringr::str_to_title(.)
                clean_loc_name[i,1]<-myname
        }
        colnames(clean_loc_name)<-"LOCATION_NAME"
        myData<-dplyr::select(EarthquakeData,-LOCATION_NAME) 
        myData<-cbind(myData,clean_loc_name)
        myData
}

eq_clean_data("signif.txt") %>% eq_location_clean() %>% select(DATETIME,EQ_PRIMARY,LOCATION_NAME) %>% tail()

Functions for plotting the data with ggplot2

geom_timeline() Function

This function reads the raw earthquake dataset in a dataframe format to create a ggplot geom that plots a time line of earthquakes ranging from the minimum to the maximum dates with a point for each earthquake.

The x aesthetic is a date and an optional y aesthetic is a factor that stratifies the data by country, in which case multiple time lines will be plotted, one for each country.

In addition to showing the dates on which the earthquakes occur, it is also possible to show the magnitudes (i.e. Richter scale value) and the number of deaths associated with each earthquake, which are input via the size and fill and colour optional aesthetics.

In the example, the function plots a time line with all earthquakes since year 2000, the colour of the points indicating the number of deaths:

GeomTimeline<- ggplot2::ggproto("GeomTimeline", ggplot2::Geom,
                          required_aes = "x",
                          default_aes = ggplot2::aes(y=0.1,fill = "blue",colour="blue",
                                                     size=0.01,alpha=0.7,stroke = 0.5,
                                                     shape = 21),
                          draw_key = ggplot2::draw_key_point,
                          draw_panel = function(data, panel_scales, coord) {

                 mydata<-dplyr::mutate(data,x=lubridate::as_date(x)) %>%
                 dplyr::mutate(x=lubridate::decimal_date(x)) %>%
                 dplyr::mutate(y=as.numeric(y)) %>%
                 dplyr::filter(!is.na(x)) %>%
                 dplyr::filter(!is.na(fill)) %>%
                 dplyr::filter(!is.na(size)) %>%
                 dplyr::mutate(size=size/100)

           xmin<-min(mydata$x)
           xmax<-max(mydata$x)
           panel_scales$x.range<-c(xmin,xmax)


           coords <- coord$transform(mydata, panel_scales)

           grid::pointsGrob(
                   x = grid::unit(coords$x,"native"),
                   y = grid::unit(coords$y,"native"),
                   pch = coords$shape,
                   size = grid::unit(coords$size,"native"),
                   gp = grid::gpar(
                           col = coords$colour,
                           alpha=coords$alpha,
                           fill = coords$fill)
                   )
                          }
)


geom_timeline <- function(mapping = NULL, data = NULL, stat = "identity", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ...) {
        ggplot2::layer(
                geom = GeomTimeline, mapping = mapping,  
                data = data, stat = stat, position = position, 
                show.legend = show.legend, inherit.aes = inherit.aes,
                params = list(na.rm = na.rm, ...)
        )
}

quakedata<-eq_clean_data("signif.txt") %>%
  filter(year(DATETIME)>2000 & COUNTRY==c("USA","CHINA"))
ggplot() + geom_timeline(data = quakedata,aes(x = DATETIME, y = COUNTRY, size = EQ_PRIMARY, colour = TOTAL_DEATHS))

geom_timeline_label() Function

This function reads an Earthquake dataset in a dataframe format to create annotations to the plotted data.This geom adds a vertical line to each data point with a text annotation showing the location (or other information) of the earthquake attached to each line.

The x aesthetic is a date and an optional y aesthetic is a factor that stratifies the data by country, in which case multiple time lines will be plotted, one for each country. The label aesthetic typically indicates the name of the location of the earthquake (but other information can be dislayed).

There is also the option to subset to n_max number of earthquakes, where the function takes the n_max largest (by magnitude) earthquakes. The aesthetics magnitude sets the column conatining the info of earthquake's magnitude.

In the example, the function plots a time line with all earthquakes in USA nad China since year 2000, the colour of the points indicating the number of deaths and their size the magnitude of the earthquakes, and shows the name of the ten largest ones:

GeomTimelineLabel<- ggplot2::ggproto("GeomTimelineLabel", ggplot2::Geom,
                                required_aes =c("x","label"),
                                default_aes = ggplot2::aes(y=0.1,n_max=NULL,magnitude=NULL,colour="grey"),
                                draw_key = ggplot2::draw_key_text,
                                draw_panel = function(data, panel_scales, coord, fontsize, angle) {

     n_earthquakes<-data$n_max[1]
     mydata<-dplyr::mutate(data,x=lubridate::as_date(x)) %>%
             dplyr::mutate(x=lubridate::decimal_date(x)) %>%
             dplyr::mutate(y=as.numeric(y)) %>%
             dplyr::filter(!is.na(x)) %>%
             dplyr::filter(!is.na(magnitude))


     myannotationdata<-dplyr::arrange(mydata,dplyr::desc(magnitude)) %>%
             dplyr::slice(1:n_earthquakes)


     xmin<-min(mydata$x)
     xmax<-max(mydata$x)
     panel_scales$x.range<-c(xmin,xmax)


     coords <- coord$transform(mydata, panel_scales)
     coordsannot<-coord$transform(myannotationdata, panel_scales)


     linesQuakes<-
     grid::segmentsGrob(
             x0 = grid::unit(coordsannot$x,"native"),
             y0 = grid::unit(coordsannot$y,"native"),
             x1 = grid::unit(coordsannot$x,"native"),
             y1 = grid::unit((coordsannot$y)+0.1,"native"),
             gp = grid::gpar(col = coords$colour)
     )

     textsquakes<-
     grid::textGrob(
             label=coords$label,
             x = grid::unit(coordsannot$x,"native"),
             y = grid::unit((coordsannot$y)+0.12,"native"),
             just = "left",
             rot = angle,
             gp = grid::gpar(col = "grey50",
                             fontsize = fontsize)
             )

     grid::gTree(children=grid::gList(linesQuakes,textsquakes))
                                }
)


geom_timeline_label <- function(mapping = NULL, data = NULL, stat = "identity", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE,fontsize = 10,
                                angle = 45, ...) {
        ggplot2::layer(
                geom = GeomTimelineLabel, mapping = mapping,
                data = data, stat = stat, position = position,
                show.legend = show.legend, inherit.aes = inherit.aes,
                params = list(na.rm = na.rm,fontsize = fontsize,angle = angle, ...)
        )
}

quakedata<-eq_clean_data("signif.txt") %>% eq_location_clean() %>%
  filter(year(DATETIME)>2000 & COUNTRY==c("USA","CHINA")) 
ggplot() + geom_timeline(data = quakedata,aes(x = DATETIME, y = COUNTRY, size = EQ_PRIMARY, colour = TOTAL_DEATHS)) + geom_timeline_label(data = quakedata,aes(x = DATETIME, label = LOCATION_NAME,y = COUNTRY, n_max = 10, magnitude = EQ_PRIMARY)) + scale_color_gradient()

Functions for mapping the data in a in a leaflet interactive map

eq_map() Function

This function reads a filtered data frame with earthquakes selected to visualize. The function maps the epicenters (LATITUDE/LONGITUDE) and annotates each point in a pop-up window containing annotation data stored in a column of the data frame. Each earthquake is shown with a circle, and the radius of the circle is proportional to the earthquake's magnitude.

It requires the input parameters dataframe, a filtered dataframe, and annot_col, a character vector that indicates which column is used for the annotation in the pop-up window.

Here, the function creates a leaflet interactive map showing the epicenter of the earthquakes in Mexico since year 2000:

eq_map<-function(dataframe,annot_col){
        leaflet::leaflet() %>%
        leaflet::addTiles() %>%
        leaflet::addCircleMarkers(
                data = dataframe,
                radius = ~ ifelse(!is.na(EQ_PRIMARY),EQ_PRIMARY,2),
                lng = ~ LONGITUDE,
                lat = ~ LATITUDE,
                popup = ~ dataframe[[annot_col]])
}

eq_clean_data("signif.txt") %>%
dplyr::filter(COUNTRY == "MEXICO" & lubridate::year(DATETIME) >= 2000) %>% 
eq_map(annot_col = "DATETIME")

eq_create_label() Function

This function takes the filtered and cleaned dataframe as an argument and creates an HTML label that can be used as the annotation text in a leaflet map created by eq_map().

This function puts together a character string for each earthquake that shows the cleaned location (as cleaned by the eq_location_clean() function), the magnitude, and the total number of deaths, with boldface labels for each. If an earthquake is missing values for any of these, both the label and the value are skipped for that element of the tag.

The only input parameter is dataframe, a dataframe that has been cleaned by the eq_location_clean() function.

Here, the function creates a label to be used in a leaflet interactive map showing the epicenter of the earthquakes in Mexico since year 2000:

eq_create_label<-function(dataframe){
        mydataframe<-dplyr::select(dataframe,LATITUDE,LONGITUDE,LOCATION_NAME,EQ_PRIMARY,TOTAL_DEATHS) %>%
                dplyr::mutate(popup_text=paste("<b>Location:</b>", LOCATION_NAME, "<br />"),popup_text=ifelse(!is.na(EQ_PRIMARY),paste(popup_text,"<b>Magnitude:</b>", EQ_PRIMARY, "<br />"),popup_text),
popup_text=ifelse(!is.na(TOTAL_DEATHS),paste(popup_text,      "<b>Total deaths:</b>", TOTAL_DEATHS, "<br />"),popup_text))
}

eq_clean_data("signif.txt") %>% 
dplyr::filter(COUNTRY == "MEXICO" & lubridate::year(DATETIME) >= 2000) %>% 
eq_location_clean() %>%
eq_create_label() %>% 
eq_map(annot_col = "popup_text")