knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

Identifying meaningful locations, such as home and/or work locations from mobile technology is an essential step in the field of mobility analysis. The `homelocator` library is designed for identifying home locations of user based on the spatio-temporally features contained within the mobility data, such as social media data, mobile phone data, mobile app data and so on.

Although there is a myriad of different approaches to inferring meaningful locations in the last 10 years, a consensus or single best approach has not emerged in the current state-of-the-art. The actual algorithms used are not always discussed in detail in publications and the source codes are seldom released in public, which makes comparing algorithms as well as reproducing work difficult.

So, the main objective for developing this package is to provide a consistent framework and interface for the adoption of different approaches so that researchers are able to write structured, algorithmic 'recipes', or use the existing embedded "recipes" to identify meaningful locations according to their research requirements. With this package, users are able to achieve an apples-to-apples comparison across approaches.

We hope that through packages like `homelocator`, future work that relies on the inference of meaningful locations becomes less ‘custom’ (with each researcher writing their own algorithm) but instead will use common, comparable algorithms in order to increase transparency and reproducibility in the field.

Load Library

# Load homelocator library
library(homelocator)
# Load other needed libraries
library(tidyverse)
library(here)

Test Data

To explore the basic data manipulation functions of homelocator, we'll use a test_sample dataset.

# load data
data("test_sample", package = "homelocator")

# show test sample 
test_sample %>% head(3)

Note: if you use your own dataset, please make sure that you have converted the timestamps to the appropriate local time zone.

Usage

Functions

Validate input dataset

The validate_dataset() function makes sure your input dataset contains all three necessary variables: user, location and timestamp. There are 4 arguments in this function:

df_validated <- validate_dataset(test_sample, user = "u_id", timestamp = "created_at", location = "grid_id", keep_other_vars = FALSE)

# show result 
df_validated %>% head(3)

Nest and Unnest

The nest_verbose() and unnest_verbose() functions work in the same way as nest and unnest functions in tidyverse but with some additional status information such as the elapsed running time.

# nest data
df_nested <- nest_verbose(df_validated, c("created_at", "grid_id"))

# show result 
df_nested %>% head(3)

# show result 
df_nested$data[[1]] %>% head(3)

# unnest data
df_unnested <- unnest_verbose(df_nested, data)

# show result 
df_unnested %>% head(3)

Double nest and Double unnest

The nest_double_nest() and unnest_double_nested() functions work in a similar way as nest_verbose() and unnest_verbose() functions but they apply to an already nested data frame. In other words, they map nest and unnest function to each element of a list created a double-nested data frame, or vice versa. This is an essential step in many home location algorithms as they often operate 'per user, per location'.

# double nest data (e.g., nesting column: created_at)
df_double_nested <- nest_nested(df_nested %>% head(100), c("created_at"))

# show result 
df_double_nested %>% head(3)

# show result 
df_double_nested$data[[1]] %>% head(3)

df_double_nested$data[[1]]$data[[1]] %>% head(3)

# unnest nested data 
df_double_unnested <- df_double_nested %>% 
  head(100) %>%  ## take first 100 rows for example
  unnest_double_nested(., data) 

# show result 
df_double_unnested %>% head(3)

Enrich timestamp

The enrich_timestamp() function creates additional variables that are derived from the timestamp column. These include the year, month, day, day of the week and hour of the day. These are often used/needed as intermediate variables in home location algorithms.

#original variables 
df_nested[1, ] %>% unnest(cols = c(data)) %>% head(3)

# create new variables from "created_at" timestamp column 
df_enriched <- enrich_timestamp(df_nested, timestamp = "created_at")

# show result 
df_enriched[1, ] %>% unnest(cols = c(data)) %>% head(3)

Summarize in nested and double nested dataframe

The summarise_nested() function works similar to dplyr's regular summarise function but operates within a nested tibble. summarise_double_nested() conversely operates within a double-nested tibble.

# summarize in nested dataframe 
# e.g., summarise total number of tweets and total number of places per user
df_summarize_nested <- summarise_nested(df_enriched, 
                                        n_tweets = n(),
                                        n_locs = n_distinct(grid_id))

# show result 
df_summarize_nested %>% head(3)

# summarize in double nested dataframe 
# take first 100 users for example
# e.g summarise total number of tweets and totla number of distinct days 
df_summarize_double_nested <- summarise_double_nested(df_enriched %>% head(100), 
                    nest_cols = c("created_at", "ymd", "year", "month", "day", "wday", "hour"), 
                    n_tweets = n(), n_days = n_distinct(ymd))

# show result 
df_summarize_double_nested[1, ]
df_summarize_double_nested[1, ]$data[[1]] %>% head(3)

Remove active users

The remove_top_users() function allows to remove top N percent of active users based on the total number of data points per user. Although the majority of users are real people, some accounts are run by algorithms or 'bots', whereas others can be considered as spam accounts. Removing a certain top N percent of active users is an oft-used approach to remove such accounts and reduce the number of such users in the final dataset.

# remove top 1% active users (e.g based on the frequency of tweets sent by users)
df_removed_active_users <- remove_top_users(df_summarize_nested, user = "u_id", 
                                            counts = "n_tweets", topNpct_user = 1) 

# show result 
df_removed_active_users %>% head(3)

Filter

The filter_verbose() function works in the same way as filter function in tidyverse but with additional information about the number of (filtered) users remaining in the dataset. And the filter_nested() function works in the similar way as filter_verbose() but is applied within a nested tibble.

# filter users with less than 10 tweets sent at less than 10 places 
df_filtered <- filter_verbose(df_removed_active_users, user = "u_id", 
                              n_tweets > 10 & n_locs > 10)

# show result 
df_filtered %>% head(3)

# filter tweets that sent on weekends and during daytime (8am to 6pm)
df_filter_nested <- filter_nested(df_filtered, user = "u_id", 
                                  !wday %in% c(1, 7), # 1 means Sunday and 7 means Saturday
                                  !hour %in% seq(8, 18, 1)) 

# show result 
df_filter_nested %>% head(3)

df_filter_nested$data[[1]] %>% head(3)

Mutate

The mutate_verbose() function works in the same way as mutate function in tidyverse but with additional information about the elapsed running time.

Function mutate_nested() works in the similar way as mutate_verbose() but it adds new variables inside a nested tibble.

## let's use pre-discussed functions to filter some users first 
colnmaes_data <- df_filtered$data[[1]] %>% names()
colnmaes_to_nest <- colnmaes_data[-which(colnmaes_data == "grid_id")] 

df_cleaned <- df_filtered %>% 
  summarise_double_nested(., nest_cols = colnmaes_to_nest,
                          n_tweets_loc = n(), # number of tweets sent at each location
                          n_hrs_loc = n_distinct(hour), # number of unique hours of sent tweets 
                          n_days_loc = n_distinct(ymd), # number of unique days of sent tweets 
                          period_loc = as.numeric(max(created_at) - min(created_at), "days")) %>% # period of tweeting 
  unnest_verbose(data) %>% 
  filter_verbose(., user = "u_id",
                 n_tweets_loc > 10 & n_hrs_loc > 10 & n_days_loc > 10 & period_loc > 10) 

# show cleaned dataset 
df_cleaned %>% head(3)

# ok, then let's apply the mutate_nested function to add four new variables: wd_or_wk, time_numeric, rest_or_work, wk.am_or_wk.pm
df_expanded <- df_cleaned %>% 
  mutate_nested(wd_or_wk = if_else(wday %in% c(1,7), "weekend", "weekday"),
                time_numeric = lubridate::hour(created_at) + lubridate::minute(created_at) / 60 + lubridate::second(created_at) / 3600, 
                rest_or_work = if_else(time_numeric >= 9 & time_numeric <= 18, "work", "rest"), 
                wk.am_or_wk.pm = if_else(time_numeric >= 6 & time_numeric <= 12 & wd_or_wk == "weekend", "weekend_am", "weekend_pm")) 

# show result 
df_expanded %>% head(3)

The function prop_factor_nested() allows you to calculate the proportion of categories for a variable inside the list-column and convert each categories to a new variable adding to the dataframe. For example, inside the list-column, the variable wd_or_wk has two categories named weekend or weekday, when you call prop_factor_nested() function, it calculates the proportion of weekend and weekday separately, and the results are then expanded to two new columns called weekday and weekend adding to the dataframe.

# categories for a variable inside the list-column: e.g weekend or weekday
df_expanded$data[[1]] %>% head(3)

# calculate proportion of categories for four variables: wd_or_wk, rest_or_work, wk.am_or_wk.pm
df_expanded <- df_expanded %>% 
  prop_factor_nested(wd_or_wk, rest_or_work, wk.am_or_wk.pm) 

# show result 
df_expanded %>% head(3)

Score

The score_nested() function allows you to give a weighted value for one or more variables in a nested dataframe.

The score_summary() function summarises all scored columns and return one single summary score per row.

## let's add two more variables before we do the scoring 
df_expanded <- df_expanded %>% 
  summarise_nested(n_wdays_loc = n_distinct(wday),
                   n_months_loc = n_distinct(month))

df_expanded %>% head(3)

# when calculating scores, you can give weight to different variables, but the total weight should add up to 1
df_scored <- df_expanded %>% 
  score_nested(., user = "u_id", location = "grid_id", keep_original_vars = F,
               s_n_tweets_loc = 0.1 * (n_tweets_loc/max(n_tweets_loc)),
               s_n_hrs_loc = 0.1 * (n_hrs_loc/24), 
               s_n_days_loc = 0.1 * (n_days_loc/max(n_days_loc)),
               s_period_loc = 0.1 * (period_loc/max(period_loc)),
               s_n_wdays_loc = 0.1 * (n_wdays_loc/7),
               s_n_months_loc = 0.1 * (n_months_loc/12),
               s_weekend = 0.1 * (weekend),
               s_rest = 0.2 * (rest),
               s_wk_am = 0.1 * (weekend_am))

df_scored %>% head(3)
df_scored$data[[1]]

### we can replace the score function by mutate function 
df_scored_2 <- df_expanded %>% 
  nest_verbose(-u_id) %>% 
  mutate_nested(s_n_tweets_loc = 0.1 * (n_tweets_loc/max(n_tweets_loc)),
                s_n_hrs_loc = 0.1 * (n_hrs_loc/24), 
                s_n_days_loc = 0.1 * (n_days_loc/max(n_days_loc)),
                s_period_loc = 0.1 * (period_loc/max(period_loc)),
                s_n_wdays_loc = 0.1 * (n_wdays_loc/7),
                s_n_months_loc = 0.1 * (n_months_loc/12),
                s_weekend = 0.1 * (weekend),
                s_rest = 0.2 * (rest),
                s_wk_am = 0.1 * (weekend_am))

df_scored_2 %>% head(3)
df_scored_2$data[[1]]
# sum varialbes score for each location 
df_score_summed <- df_scored %>% 
  score_summary(., user = "u_id", location = "grid_id", starts_with("s_"))

df_score_summed %>% head(3)
df_score_summed$data[[1]]

Extract locations

The extract_location() function allows you to sort the locations of each each user based on descending value of selected column, and return the top N location(s) for each user. The extracted location(s) then represent the most likely home location of the user.

# extract homes for users based on score value (each user return 1 most possible home)
df_home <- df_score_summed %>% 
  extract_location(., user = "u_id", location = "grid_id", show_n_loc = 1, keep_score = F, desc(score))

df_home %>% head(3)

# extract homes for users and keep scores of locations
df_home <- df_score_summed %>% 
  extract_location(., user = "u_id", location = "grid_id", show_n_loc = 1, keep_score = T, desc(score))

df_home %>% head(3)
df_home$data[[3]]

Spread

The spread_nested() function works in the same way as spread function in tidyverse but works inside a nested data frame. It allows to spread a key-value pairs across multiple columns.

# let's add one timeframe variable first and calculate the number of data points at each timeframe
df_timeframe <- df_enriched %>% 
  mutate_nested(timeframe = if_else(hour >= 2 & hour < 8, "Rest", if_else(hour >= 8 & hour < 19, "Active", "Leisure")))

colnames_nested_data <- df_timeframe$data[[1]] %>% names()
colnames_to_nest <- colnames_nested_data[-which(colnames_nested_data %in% c("grid_id", "timeframe"))]

df_timeframe <- df_timeframe %>%
  head(20) %>% # take first 20 users as example
  summarise_double_nested(., nest_cols = colnames_to_nest, 
                            n_points_timeframe = n()) 
df_timeframe$data[[2]]

# spread timeframe in nested dataframe with key is timeframe and value is n_points_timeframe
df_timeframe_spreaded <- df_timeframe %>% 
    spread_nested(., key_var = "timeframe", value_var = "n_points_timeframe") 

df_timeframe_spreaded$data[[2]]

Arrange

The arrange_nested() function works in the same way as arrange function in tidyverse but works inside a nested data frame. It allows to sort rows by variables.

df_enriched$data[[3]]  

df_arranged <- df_enriched %>% 
  arrange_nested(desc(hour)) # arrange the hour in descending order

df_arranged$data[[3]]

The arrange_double_nested() function works in the similar way as arrange_nested() function, but it applies to a double nested data frame. You need to choose the columns to nest inside a nested dataframe and then sort rows by selected column.

- `df`: a nested dataframe 
- `nest_cols`: name of columns to nest in existing list-column
- `...`: comma separated list of unquoted variable names
# get the name of columns to nest 
colnames_nested_data <- df_enriched$data[[1]] %>% names()
colnmaes_to_nest <- colnames_nested_data[-which(colnames_nested_data %in% c("grid_id"))]

df_double_arranged <- df_enriched %>% 
  head(20) %>% # take the first 20 users for example
  arrange_double_nested(., nest_cols = colnmaes_to_nest, 
                        desc(created_at)) # sort by time in descending order


# original third user
df_enriched[3, ]
# third user data points 
df_enriched$data[[3]]
# arranged third user
df_double_arranged[3, ]
# arranged time 
df_double_arranged$data[[3]]$data[[2]]

Top N

The top_n_nested() function works in the same way as top_n function in tidyverse but for nested dataframe. It allows to select the top entries in each group, ordered by wt.

df_enriched$data[[2]]

## get the top 1 row based on hour 
df_top_1 <- df_enriched %>% 
  top_n_nested(., n = 1, wt = "hour")

df_top_1$data[[2]]

Identify location(s) with embedded recipes

To use the embedded recipes to identify the home location for users, you can use identify_location() function.

Current available recipes:

# recipe: homelocator -- HMLC
identify_location(test_sample, user = "u_id", timestamp = "created_at", 
                  location = "grid_id", show_n_loc = 1, recipe = "HMLC")

# recipe: Frequency -- FREQ
identify_location(test_sample, user = "u_id", timestamp = "created_at", 
                  location = "grid_id", show_n_loc = 1, recipe = "FREQ")

# recipe: Online Social Network Activity -- OSNA
identify_location(test_sample, user = "u_id", timestamp = "created_at", 
                  location = "grid_id", show_n_loc = 1, recipe = "OSNA")

# recipe: Online Social Network Activity -- APDM
## APDM recipe strictly returns the most likely home location
## It is important to load the neighbors table before you use the recipe!!
## example: st_queen <- function(a, b = a) st_relate(a, b, pattern = "F***T****")
##          neighbors <- st_queen(df_sf) ===> convert result to dataframe 
data("df_neighbors", package = "homelocator")
identify_location(test_sample, user = "u_id", timestamp = "created_at", 
                  location = "grid_id", recipe = "APDM", keep_score = F)


spatialnetworkslab/homelocator documentation built on May 7, 2023, 8:18 p.m.