knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

We can use rsinaica to find out which city is the most ozone-polluted in all of Mexico.

First, we load the packages

## Auto-install required R packages
packs <- c("dplyr", "ggplot2", "gghighlight", "lubridate", "anomalize", 
           "aire.zmvm", "tidyr", "zoo", "plotly", "rsinaica")
success <- suppressWarnings(sapply(packs, require, character.only = TRUE))
if (length(names(success)[!success])) {
  install.packages(names(success)[!success])
  sapply(names(success)[!success], require, character.only = TRUE)
}

Download

Then we download the data for the whole year of 2017 using the sinaica_param_data function. Since the maximum data range we can download is 1 month, we have to use a little mapply magic to download the entire year.

# Download all O3 pollution data in 2017
o3_2017 <- bind_rows(
  mapply(sinaica_param_data,
         "O3",
         seq(as.Date("2017-01-01"), as.Date("2017-12-01"), by = "month"),
         seq(as.Date("2017-02-01"), as.Date("2018-01-01"), by = "month") - 1,
         remove_extremes = FALSE,
         SIMPLIFY = FALSE)
)

This is what the data looks like:

knitr::kable(head(o3_2017))

Cleanup

Once we've downloaded the data we filter values below .3 ppm since they're probably calibration errors, and also filter some stations that never report correct ozone values. Given that ozone production depends on chemical reactions between oxides of nitrogen and volatile organic compounds in the presence of sunlight, it’s extremely unlikely to be present in large quantities at night or in the early morning, so we can filter high ozone values during those times.

# o3_2017[which(o3_2017$value_actual != o3_2017$value_original),]
# o3_2017[which(!is.na(o3_2017$date_validated)),]

# Only include stations that reported for more than 80% of days (292)
df_filtered <- o3_2017 %>%
  filter(!is.na(value)) %>%
  add_count(station_id, network_name) %>%
  ## these stations are mostly error values
  filter(!station_id %in% c(173, #"Facultad Psicología" SLP
                            72, #"Bomberos" Irapuato
                            31, #CBTIS ags 
                            303, #Instituto Educativo ags
                            39, #COCABH Mexicali
                            56 #Conalep torreon
                            )) %>%
  # filter values above .095 (bad (MALA) air quality)
  # that occurred during the night, since ozone is a photosensitive
  # pollutant
  filter(!(value > .0955 & hour %in% c(0:11, 21:23))) %>%
  filter(value < .3)

We'll be taking the average of the maximum daily value among all stations in each network for the entire year of 2017, and only include stations that reported data at least 80% of the days.

df_max <- df_filtered %>%
  group_by(date, network_name) %>%
  summarise(max = max(value, na.rm = TRUE)) %>%
  ungroup() %>%
  add_count(network_name) %>%
  filter(n > (365*.8))  %>%
  select(-n) %>%
  arrange(network_name)

When plotting the daily maximums we can see that there are still some obvious errors in the data even after removing the stations that always report erroneous data and the extreme ozone values outside peak daylight hours.

ggplot(df_max, aes(as.Date(date), max)) +
  geom_line(size = .3, color = "black") +
  facet_wrap(~ network_name, ncol = 3) +
  ggtitle("Maximum daily ozone concentration by network") +
  xlab("date") +
  ylab("daily maximum hourly ozone concentration (ppm)") +
  theme_bw() +
  theme(axis.text.x=element_text(angle=60,hjust=1))

Anomalies

We can use the anomalize package to detect and remove the extreme values, while being careful not to remove the high ozone levels that happen in Mexico City, Guadalajara, and Monterrey.

df_max <- df_max %>%
  ungroup() %>%
  group_by(network_name) %>%
  mutate(date = as.Date(date)) %>%
  time_decompose(max) %>%
  anomalize(remainder, method = "iqr", alpha = .03) %>%
  time_recompose()

# Anomaly Visualization
df_max %>% plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.25) +
  labs(title = "Tidyverse Anomalies", subtitle = "STL + GESD Methods") 

df_max <- filter(df_max, anomaly != "Yes")

Besides removing the outliers, after looking at the plots it was decided to remove the whole network of Piedras Negras since it only reported ozone concentrations close to zero during the first few months of 2017.

df_max <- filter(df_max, network_name != "Piedras Negras")

Most Ozone-Polluted City

And here is the most ozone-polluted city in Mexico: the Valle de México metro area

ggplot(df_max, aes(as.Date(date), observed,
                   group = network_name, color = network_name)) +
  geom_line()  +
  theme_bw() +
  ggtitle("Pollution measuring network with highest ozone pollution values in 2017",
          subtitle = "Based on the average of the maximum daily hourly value. Source: SINAICA") +
  xlab("date") +
  ylab(expression(paste(O[3], " ppm"))) +
  gghighlight(mean(observed, na.rm = TRUE), max_highlight = 1L)

Top polluted cities

knitr::kable(
  df_max %>%
    group_by(network_name) %>%
    summarise(mean = mean(observed, na.rm = TRUE)) %>%
    arrange(-mean) %>%
    head(10)
  )

Perhaps not surprisingly, the most ozone-polluted cities in Mexico are also the among the most populous. It's kind of embarrassing that Puebla doesn't have a working air quality network since it is the fourth biggest metro area by population.

top3 <- filter(df_max, network_name %in% c("Valle de México",
                                           "Guadalajara",
                                           "Toluca"))[, 1:3]

top3 <- spread(top3, network_name, observed)
top3$date <- as.Date(as.Date(top3$date))
x <- list(
  title = "date"
)
y <- list(
  title = "daily maximum ozone concentration (ppm)"
)

plot_ly(as.data.frame(top3), x = ~date, y = ~`Valle de México`, name = "Valle de México" ,
        type = 'scatter', mode = 'lines', line = list(color = '#e41a1c'), width = .5) %>%
  add_trace(y = ~Guadalajara, name = 'Guadalajara', mode = 'lines', line = list(color = '#377eb8'), width = .5) %>%
  add_trace(y = ~Toluca, name = 'Toluca', mode = 'lines', line = list(color = '#4daf4a'), width = .5) %>%
  layout(title = "Top 3 most ozone-polluted cities in Mexico", xaxis = x, yaxis = y)

Days with bad air quality

Number of days with bad air quality (Índice IMECA MALO)

df_max %>%
  group_by(network_name) %>%
  filter(observed > 95.5/1000 ) %>%
  summarise(count = n()) %>%
  arrange(-count) %>%
  head() %>%
  knitr::kable()

Number of days with very bad air quality (Índice IMECA MUY MALO).

df_max %>%
  group_by(network_name) %>%
  filter(observed > 155/1000) %>%
  summarise(count = n()) %>%
  arrange(-count) %>%
  head() %>%
  knitr::kable()

Here we see that Guadalajara actually had more very bad days. If this had happened in Mexico City, a phase I smog alert would have been issued and 40% of cars would have been banned from taking to the roads, but Guadalajarans are to smart to implement something like this, since it doesn't seem to work.



diegovalle/rsinaica documentation built on March 23, 2022, 8:58 a.m.