knitr::opts_chunk$set(echo = TRUE,
                      fig.width = 4,
                      fig.height = 4)
library(dplyr)
library(tidyr)
library(ggplot2)
library(cism)

cism, the R package of the Centro de Investigação em Saude de Manhiça, is a collaborative effort to improve the way that CISM researchers interact with and utilize data.

Objectives

The objectives of the R CISM package are simple:

  1. Make retrieving data more simple.

  2. Make data analysis more reproducible.

  3. Make interaction with data more pleasurable and beautiful.

Set up

Before using the cism package, you'll need to (1) install and (2) configure credentials.

1. Install

To use the cism package, you'll first need to install it. With a good internet connection, run the following from within R.

if(!require(devtools)) install.packages("devtools")
install_github('joebrew/cism')

Once you've installed the package, you can use its functionality in any R script by simply including the following line:

library(cism)

2. Configure credentials

Accessing CISM databases requires permissions. All the functionality in the cism package can be used by calling functions with a user's credentials. However, both for the sake of ease as well as to avoid putting any confidential information into R code, it is highly recommended to create a credentials folder with a credentials.yaml file. The contents of this file should look something like this:

dbname: openhds
host: sap.manhica.net
port: 3306 # Use 4706 from outside of the CISM
user: your_user_name
password: your_pass_word

In the above, consider replacing dbname with the database you most often use. Replace port with 4706 if you work outside of the CISM. And replace the user and password fields with your name and password. If you use git, or any other form of version control, ensure that the credentials.yaml file is ignored by any repository viewable by others.

The credentials/credentials.yaml will be automatically detected by the cism package if in the same directory as the working directory, or if it is in a higher directory. Therefore, one should either create a credentials/credentials.yaml for each project, or place one credentials/credentials.yaml into a higher-level folder (for example "My Documents" for Windows users, or "/home/" for Linux users), under which all code resides.

Basic use

Retrieving data from CISM databases

With the cism package, getting data is simple. Assuming you've got a credentials/credentials.yaml in or above your working directory, you can simply point the get_data function at the database and table of your choice:

residency <- get_data(tab = 'residency', dbname = 'openhds')

The above will return the entire "residency" table from the "openhds" database.

If you're a more advanced user and want to filter, select or join directly in SQL before collecting the results into memory, you can write directly in SQL:

small_residency <- get_data(query = "SELECT uuid, startDate FROM residency limit 10", dbname = 'openhds')

Visualizing data

The cism package offers several tools for the visualization of data.

cism_plot

cism_plot is a simple wrapper for plotting a variable of any type, or 2 numeric variables. Its use is relatively straightforward.

We'll start by creating some fake data.

library(sp)
library(cism)
# Create a fake dataset of people
people <- data.frame(id = 1:1000,
                     sex = sample(c('Male', 'Female', 'Unknown'),
                                  size = 1000,
                                  replace = TRUE),
                     latitude = rnorm(mean = coordinates(man3)[,2],
                                      sd = 0.05,
                                      n = 1000),
                     longitude = rnorm(mean = coordinates(man3)[,1],
                                       sd = 0.05,
                                       n = 1000),
                     age = sample(seq(22, 
                                      26,
                                      length = 10000), 
                                  size = 1000, replace = TRUE),
                     money = sample(seq(0,
                                        100000,
                                        length = 1000000), size = 1000, replace = TRUE))

cism_plot intuitively handles the visualization of counts of categorical data:

cism_plot(x = people$sex)

It also handles numeric variables.

cism_plot(x = people$age)

If explicitly told to, it will treat a numeric variable as a categorical.

cism_plot(x = round(people$age), type = 'factor')

Finally, if given two (numeric) variables, CISM plot will generate a simple x-y chart.

cism_plot(x = people$age,
          y = people$money)

cism_map

Like cism_plot, cism_map strives to generate useful and aesthetically pleasing maps with the least amount of input necessary. In the below examples, we'll use the people dataset we generated above.

At its most simple level, cism_map can show the location of longitude and latitude.

cism_map(lng = people$longitude,
         lat = people$latitude)

By providing a "fortified SpatialPolygonsDataFrame" (of which several come with the cism package), the map will get adminsitrative outlines for the relevant areas.

cism_map(lng = people$longitude,
         lat = people$latitude,
         fspdf = man3_fortified)

If we want to see our points colored by a categorical variable, we simply indicate the variable in our call to cism_map.

cism_map(lng = people$longitude,
         lat = people$latitude,
         fspdf = man3_fortified,
         x = people$sex)

By the same token, points can also be colored by a continuous variable.

cism_map(lng = people$longitude,
         lat = people$latitude,
         fspdf = man3_fortified,
         x = people$money)

As with other ggplot objects, plots generated by cism_map or cism_plot can be given further modifications through additional calls to ggplot2 functions. For example:

cism_map(lng = people$longitude,
         lat = people$latitude,
         fspdf = man2_fortified,
         x = people$sex) +
  ggtitle('Spatial distribution by sex')

cism_map_interactive

cism_map_interactive is very similar to cism_map. Rather than build static maps, it builds interactive satellite-based maps.

To use, you'll need a dataset with at least longitude and latitude. Here we'll again use our people dataset:

cism_map_interactive(lng = people$longitude, 
                     lat = people$latitude)

We can add a SpatialPolygonsDataFrame, including those included in the cism package, with the spdf arugment.

cism_map_interactive(lng = people$longitude, 
                     lat = people$latitude, 
                     spdf = man3)

Our points can be simple representations of location (above), or we can color them by a categorical variable.

cism_map_interactive(lng = people$longitude,
                     lat = people$latitude,
                     spdf = man2,
                     x = people$sex)

Rather than a categorical variable (above), we can also color points by a continous variable.

cism_map_interactive(lng = people$longitude,
                     lat = people$latitude,
                     spdf = man2,
                     x = people$age)

Finally, the popup argument can be used to indicate what text should be shown when an icon is clicked.

cism_map_interactive(lng = people$longitude,
                     lat = people$latitude,
                     spdf = man2,
                     x = people$age,
                     popup = people$sex)

Built-in datasets

The cism package also includes several built-in datasets. These are as follows.

cat(paste0(gsub('.rda', '', dir('../data')), collapse = '\n'))

Spatial data

The names "moz", "man", and "mag" signify "Mozambique", "Manhiça", and "Magude" (respectively). The number therafter signifies the administrative level (0 = country, 1 = province, 2 = district, 3 = sub-district). The suffix "_fortified" means that the object is in a "fortified" (ie, ggplot2-compatible) dataframe format; objects without this suffix are SpatialPolygonsDataFrames.

Built-in datasets in the cism package are "lazily" loaded, mean that you can call them at any point after attaching the cism package by simply using their name. For example:

library(cism)
plot(moz2)

The above plots the SpatialPolygonsDataFrame of Mozambique at the 2nd administrative level (ie, district).

By the same token, I could also run the following to get a map of the Magude district at the third administrative level using the ggplot framework for plotting:

library(ggplot2)
library(cism)

ggplot() +
  geom_polygon(data = mag3_fortified,
               aes(x = long,
                   y = lat,
                   fill = id)) +
  coord_map()

Manipulating spatial data

The cism package has a function, ll_from_utm, for the conversion of UTM (Universal Transverse Mercator) coordinates to longitude and latitude. Its use is as follows:

# Create a dataset of some UTM coordinates
fake_data <- data_frame(id = 1:5,
                        x = c(480132,
                              481126,
                              481316,
                              478351,
                              478172),
                        y = c(7192161,
                              7192833,
                              7190708,
                              7189107,
                              7189615))
# Convert to latitude and longitude
converted <- ll_from_utm(x = fake_data[,c('x', 'y')],
                         zone = 36)
# See converted data
head(converted)
# Plot converted data
cism_map_interactive(lng = converted$x,
                     lat = converted$y,
                     point_size = 10)

Dataframes

The cism package also includes non-spatial data. The World Bank's "World Development Indicators" time series data is included for Mozambique in a (plot-friendly) long format. Here's an example of how to use.

# Attach cism package
library(cism)

# Bring wb (world bank) data into memory
wb <- wb

# Look at it
head(wb)

# Subset for only the homicide rate over time
homicides <- 
  wb %>%
  filter(indicator_name == 'Intentional homicides (per 100,000 people)') %>%
  filter(!is.na(value))

# Visualize
cism_plot(x = homicides$year,
          y = homicides$value,
          trend = TRUE) +
  labs(x = 'Year',
       y = 'Rate per 100,000',
       title = 'Homicide rate in Mozambique',
       subtitle = 'According to the World Bank')

There are r length(unique(wb$indicator_code)) development indicators in the dataset. These can be explored below.

DT::datatable(data_frame(`Indicator name` = sort(unique(wb$indicator_name))))

The cism package also includes the location (as well as some supplementary data) of all government health facilities ("unidades sanitarias") in Mozambique. This dataset, named us, can be called by running the following:

us <- cism::us

The data is in the following format:

knitr::kable(head(data.frame(us)))

The locations can be visualized as such:

library(ggplot2)
library(cism)
cism::cism_map(lng = us$longitude,
               lat = us$latitude,
               fspdf = cism::moz1,
               opacity = 0.8,
               point_size = 0.5) +
  labs(title = 'Health facilities')

Aesthetic tweaks

Theme

The cism plot comes with the theme_cism function, which can be appended to any ggplot2-style plot to make it more aesthetically compatible with the CISM's color scheme. For example, rather than

ggplot(data = people,
       aes(x = sex)) +
  geom_bar()

We can instead run

ggplot(data = people,
       aes(x = sex)) +
  geom_bar(fill = 'darkorange', color = 'darkgreen') +
  theme_cism()

Branding

We can also "brand" a plot, by running brand_cism(), which will add the website to the subtitle of the plot:

ggplot(data = people,
       aes(x = sex)) +
  geom_bar() +
  geom_bar(fill = 'darkorange', color = 'darkgreen') +
  theme_cism() +
  brand_cism()

Both theme_cism and brand_cism work on maps as well:

ggplot() +
  geom_polygon(data = mag3_fortified,
               aes(x = long,
                   y = lat,
                   fill = id)) +
  coord_map() +
  theme_cism() +
  brand_cism()

Advanced use

Fuzzy string-matching

It is often the case that datasets must be merged by name. This is problematic when one name is written slightly differently in one dataset than in another (for example, "Esmeralda" and "Esmerelda"). The function fuzzy_match employs the Jaro-Winkler distance algorithm to generate a "distance matrix" which shows how similar one name is to another (0 being identical, 1 being totally different).

The below shows an example of its use.

# Define a list of names
x <- c('joe', 'joseph', 'coloma', 'ramona', 'paloma', 'ramon', 'elena', 'beatriz')
# Define another list of names (with which we're going to match)
y <- c('helena', 'bea', 'jeo', 'joe', 'colombia', 'romania')
# Perform the matching
results <- fuzzy_match(x = x, y = y)
# View results
results

Note that the algorithm performs well. The name "bea" has very few matches, but is most closely matched to "beatriz". The name "joe" has an exact match (ie, a score of 0), but also gets an approximate match (ie, low score) for the name "joseph". "elena" is closely matched with "helena", and equally far (ie, high score) from names like "joe" and "bea".

We can visualize the results as well:

# Reshape for plotting
plot_data <- data.frame(results) %>%
  mutate(x = row.names(results)) %>%
  gather(y, score, helena:romania)
# Visualize results
ggplot(data = plot_data,
       aes(x = x,
           y = y,
           fill = score)) +
  geom_tile() +
  scale_fill_gradient2(name = 'Distance',
                       low = 'darkgreen',
                       mid = 'white',
                       high = 'darkorange',
                       midpoint = 0.5) +
  theme_cism() +
  labs(title = 'Fuzzy matching results',
       subtitle = 'Using the CISM fuzzy_match function') +
  theme(axis.text.x = element_text(angle = 90))

Bibliography management

The doi_to_bib function generates a bibtex format bibliography entry (ready for use by bibliographic management software or for use in .bib file when used with Rmarkdown). The process is simple.

  1. Find an article you wish to cite such as this one.

  2. Copy the doi from the article (usually at the top). In this case, it is 10.1038/s41598-017-08906-x.

  1. Run the following code.
doi_to_bib('10.1038/s41598-017-08906-x')

And you'll get the following output, ready to copy-paste into your bibliography management software or directly into your .bib file if using Rmarkdown.

doi_to_bib('10.1038/s41598-017-08906-x')


joebrew/cism documentation built on May 19, 2019, 2:58 p.m.