suppressPackageStartupMessages(library(rgdal))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))

Here we have data provided by William Henderson, containing data from 5 riders.

bikeroutes.shapefile <- readOGR("../data/RIDE/william-trips-export.json", "OGRGeoJSON", p4s="+proj=tmerc +ellps=WGS84")

bikeroutes.df <- fortify(bikeroutes.shapefile)

head(bikeroutes.df)

ggplot(bikeroutes.df, aes(x=long, y=lat, group=group)) + geom_path()

Okay. The data produced by fortifying the bikeroutes dataframe is rather strange.

barplot(table(bikeroutes.df$order))
barplot(table(bikeroutes.df$piece))
barplot(table(bikeroutes.df$group))

Group has 8192 levels, but there were originally 7712 lines. How do we know which ride is which? The group has a strange system, where the numbers are zero-indexed and the decimal place corresponds to the piece. So to get an ID we can just

bikeroutes.df$group %>% as.character() %>% as.numeric() %>% ceiling() %>% summary()
bikeroutes.df$id <-
  bikeroutes.df$group %>% as.character() %>% as.numeric() %>% ceiling()

Really we need two dataframes: one for all the points that make up the paths and one just with an observation for each route.

rides <- bikeroutes.shapefile@data %>%
  select(rating,
         rating_text,
         owner__pk,
         original_trip_length,
         created,
         pk,
         rr_transformation) %>%
  # No need to have the actual primary keys here. Not good for security anyways.
  mutate(owner__pk = as.numeric(owner__pk),
         pk = as.numeric(pk)) %>%
  tbl_df()

Okay, but now we have another weird issue.

head(rides)
head(bikeroutes.df %>% filter(order == 1))

It seems like we have less data than we thought. It seems that there are four observations of every ride. One variable rr_transformation seems to explain on dimension of this duplication. This does explain why there are only 1928 primary keys and 7712 observations: $$1928 \times 4 = 7712.$$

# Add id to rides dataframe
rides$id <- 1:nrow(rides)

# Give a number to each of copies
rides$version <- as.factor(rep(c(1,2,3,4), 1928))

# Join over to paths
rides.final <- bikeroutes.df %>%
  inner_join(rides, by="id")

# Plot the different versions
rides.final %>%
  ggplot(aes(x = long, y = lat, group = group)) +
  geom_path(alpha=0.3, lineend = "butt") +
  coord_map(projection = "mercator",
            ylim = c(45.5, 45.525),
            xlim = c(-122.68, -122.64)) + 
  facet_wrap(~ version)

Okay, so the version I actually want are the 'simplify' versions. Let's select version 2.

rides.final <- rides.final %>% filter(version == "2")

This seems to include locations outside of Portland, OR. So let's filter this data.

not.in.portland <- rides.final %>%
  filter(lat < 45.462 | lat > 45.549 | long > -122.577 | long < -122.722) %>%
  distinct(pk) %>%
  .$pk

rides.final %>%
  filter(!pk %in% not.in.portland) %>%
  ggplot(aes(x = long, y = lat, group = group)) +
  geom_path(alpha=0.3, aes(col=rating), lineend = "butt") +
  coord_map(projection = "mercator") + 
  scale_color_gradient2(low = "yellow", mid = "green", high = "red")

That's quite a bit of coverage for only five people!

How do they seem to rate their rides?

rides %>% 
  ggplot(aes(x = rating_text)) + 
  geom_histogram() + 
  facet_grid(~owner__pk)

Alright let's save this data for now.

save(rides.final, file = "../data/bikeroutes.RData")
save(rides, file = "../data/rides.RData")


wjones127/thesis documentation built on May 4, 2019, 7:34 a.m.