knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Identifying meaningful locations, such as home and/or work locations from mobile technology is an essential step in the field of mobility analysis. The `homelocator` library is designed for identifying home locations of user based on the spatio-temporally features contained within the mobility data, such as social media data, mobile phone data, mobile app data and so on.
Although there is a myriad of different approaches to inferring meaningful locations in the last 10 years, a consensus or single best approach has not emerged in the current state-of-the-art. The actual algorithms used are not always discussed in detail in publications and the source codes are seldom released in public, which makes comparing algorithms as well as reproducing work difficult.
So, the main objective for developing this package is to provide a consistent framework and interface for the adoption of different approaches so that researchers are able to write structured, algorithmic 'recipes', or use the existing embedded "recipes" to identify meaningful locations according to their research requirements. With this package, users are able to achieve an apples-to-apples comparison across approaches.
We hope that through packages like `homelocator`, future work that relies on the inference of meaningful locations becomes less ‘custom’ (with each researcher writing their own algorithm) but instead will use common, comparable algorithms in order to increase transparency and reproducibility in the field.
# Load homelocator library library(homelocator)
# Load other needed libraries library(tidyverse) library(here)
To explore the basic data manipulation functions of homelocator
, we'll use a test_sample
dataset.
# load data data("test_sample", package = "homelocator") # show test sample test_sample %>% head(3)
Note: if you use your own dataset, please make sure that you have converted the timestamps to the appropriate local time zone.
The validate_dataset()
function makes sure your input dataset contains all three necessary variables: user, location and timestamp. There are 4 arguments in this function:
user
: name of column that holds a unique identifier for each user.timestamp
: name of column that holds specific timestamp for each data point. This timestamp should be in POSIXct
format.location
: name of column that holds a unique identifier for each location.keep_other_vars
: option to keep or remove any other variables of the input dataset. The default is FALSE
.df_validated <- validate_dataset(test_sample, user = "u_id", timestamp = "created_at", location = "grid_id", keep_other_vars = FALSE) # show result df_validated %>% head(3)
The nest_verbose()
and unnest_verbose()
functions work in the same way as nest and unnest functions in tidyverse but with some additional status information such as the elapsed running time.
df
: a dataframe...
: for nest_verbose()
refers to selected columns to nest and for unnest_verbose()
refers to the list column to unnest.# nest data df_nested <- nest_verbose(df_validated, c("created_at", "grid_id")) # show result df_nested %>% head(3) # show result df_nested$data[[1]] %>% head(3) # unnest data df_unnested <- unnest_verbose(df_nested, data) # show result df_unnested %>% head(3)
The nest_double_nest()
and unnest_double_nested()
functions work in a similar way as nest_verbose()
and unnest_verbose()
functions but they apply to an already nested data frame. In other words, they map nest
and unnest
function to each element of a list created a double-nested data frame, or vice versa. This is an essential step in many home location algorithms as they often operate 'per user, per location'.
df
: a nested dataframe...
: for nest_double_nest()
refers to selected columns to nest and for unnest_double_nested()
refers to list-column to unnest.# double nest data (e.g., nesting column: created_at) df_double_nested <- nest_nested(df_nested %>% head(100), c("created_at")) # show result df_double_nested %>% head(3) # show result df_double_nested$data[[1]] %>% head(3) df_double_nested$data[[1]]$data[[1]] %>% head(3) # unnest nested data df_double_unnested <- df_double_nested %>% head(100) %>% ## take first 100 rows for example unnest_double_nested(., data) # show result df_double_unnested %>% head(3)
The enrich_timestamp()
function creates additional variables that are derived from the timestamp column. These include the year, month, day, day of the week and hour of the day. These are often used/needed as intermediate variables in home location algorithms.
df
: a nested dataframetimestamp
: name of column that holds the specific timestamp for each data point. This timestamp should be in POSIXct
format.#original variables df_nested[1, ] %>% unnest(cols = c(data)) %>% head(3) # create new variables from "created_at" timestamp column df_enriched <- enrich_timestamp(df_nested, timestamp = "created_at") # show result df_enriched[1, ] %>% unnest(cols = c(data)) %>% head(3)
The summarise_nested()
function works similar to dplyr's regular summarise
function but operates within a nested tibble. summarise_double_nested()
conversely operates within a double-nested tibble.
df
: a nested dataframe nest_cols
: a selection of columns to nest in existing list-column...
: name-value pairs of summary functions # summarize in nested dataframe # e.g., summarise total number of tweets and total number of places per user df_summarize_nested <- summarise_nested(df_enriched, n_tweets = n(), n_locs = n_distinct(grid_id)) # show result df_summarize_nested %>% head(3) # summarize in double nested dataframe # take first 100 users for example # e.g summarise total number of tweets and totla number of distinct days df_summarize_double_nested <- summarise_double_nested(df_enriched %>% head(100), nest_cols = c("created_at", "ymd", "year", "month", "day", "wday", "hour"), n_tweets = n(), n_days = n_distinct(ymd)) # show result df_summarize_double_nested[1, ] df_summarize_double_nested[1, ]$data[[1]] %>% head(3)
The remove_top_users()
function allows to remove top N
percent of active users based on the total number of data points per user. Although the majority of users are real people, some accounts are run by algorithms or 'bots', whereas others can be considered as spam accounts. Removing a certain top N
percent of active users is an oft-used approach to remove such accounts and reduce the number of such users in the final dataset.
df
: a dataframe with columns of user id, and data point counts user
: name of column that holds unique identifier for each usercounts
: name of column that holds the data points frequency for each user topNpct_user
: a decimal number that represent the certain percentage of users to remove# remove top 1% active users (e.g based on the frequency of tweets sent by users) df_removed_active_users <- remove_top_users(df_summarize_nested, user = "u_id", counts = "n_tweets", topNpct_user = 1) # show result df_removed_active_users %>% head(3)
The filter_verbose()
function works in the same way as filter function in tidyverse but with additional information about the number of (filtered) users remaining in the dataset. And the filter_nested()
function works in the similar way as filter_verbose()
but is applied within a nested tibble.
df
: a dataframe with columns of user id, and variables that your are going to apply the filter function. If the column not in the dataset, you need to add that column before you apply the filteruser
: name of column that holds unique identifier for each user ...
: Logical predicates defined in terms of the variables in df. Only rows match conditions are kept.# filter users with less than 10 tweets sent at less than 10 places df_filtered <- filter_verbose(df_removed_active_users, user = "u_id", n_tweets > 10 & n_locs > 10) # show result df_filtered %>% head(3) # filter tweets that sent on weekends and during daytime (8am to 6pm) df_filter_nested <- filter_nested(df_filtered, user = "u_id", !wday %in% c(1, 7), # 1 means Sunday and 7 means Saturday !hour %in% seq(8, 18, 1)) # show result df_filter_nested %>% head(3) df_filter_nested$data[[1]] %>% head(3)
The mutate_verbose()
function works in the same way as mutate function in tidyverse but with additional information about the elapsed running time.
df
: a dataframe ...
: name-value pairs of expressionsFunction mutate_nested()
works in the similar way as mutate_verbose()
but it adds new variables inside a nested tibble.
df
: a nested dataframe ...
: name-value pairs of expressions## let's use pre-discussed functions to filter some users first colnmaes_data <- df_filtered$data[[1]] %>% names() colnmaes_to_nest <- colnmaes_data[-which(colnmaes_data == "grid_id")] df_cleaned <- df_filtered %>% summarise_double_nested(., nest_cols = colnmaes_to_nest, n_tweets_loc = n(), # number of tweets sent at each location n_hrs_loc = n_distinct(hour), # number of unique hours of sent tweets n_days_loc = n_distinct(ymd), # number of unique days of sent tweets period_loc = as.numeric(max(created_at) - min(created_at), "days")) %>% # period of tweeting unnest_verbose(data) %>% filter_verbose(., user = "u_id", n_tweets_loc > 10 & n_hrs_loc > 10 & n_days_loc > 10 & period_loc > 10) # show cleaned dataset df_cleaned %>% head(3) # ok, then let's apply the mutate_nested function to add four new variables: wd_or_wk, time_numeric, rest_or_work, wk.am_or_wk.pm df_expanded <- df_cleaned %>% mutate_nested(wd_or_wk = if_else(wday %in% c(1,7), "weekend", "weekday"), time_numeric = lubridate::hour(created_at) + lubridate::minute(created_at) / 60 + lubridate::second(created_at) / 3600, rest_or_work = if_else(time_numeric >= 9 & time_numeric <= 18, "work", "rest"), wk.am_or_wk.pm = if_else(time_numeric >= 6 & time_numeric <= 12 & wd_or_wk == "weekend", "weekend_am", "weekend_pm")) # show result df_expanded %>% head(3)
The function prop_factor_nested()
allows you to calculate the proportion of categories for a variable inside the list-column and convert each categories to a new variable adding to the dataframe. For example, inside the list-column, the variable wd_or_wk
has two categories named weekend or weekday, when you call prop_factor_nested()
function, it calculates the proportion of weekend and weekday separately, and the results are then expanded to two new columns called weekday
and weekend
adding to the dataframe.
df
: a nested dataframe var
: name of column to calculate inside the list-column# categories for a variable inside the list-column: e.g weekend or weekday df_expanded$data[[1]] %>% head(3) # calculate proportion of categories for four variables: wd_or_wk, rest_or_work, wk.am_or_wk.pm df_expanded <- df_expanded %>% prop_factor_nested(wd_or_wk, rest_or_work, wk.am_or_wk.pm) # show result df_expanded %>% head(3)
The score_nested()
function allows you to give a weighted value for one or more variables in a nested dataframe.
df
: a nested dataframe by user user
: name of column that holds unique identifier for each userlocation
: name of column that holds unique identifier for each locationkeep_ori_vars
: option to keep or drop original varialbes ...
: name-value pairs of expression The score_summary()
function summarises all scored columns and return one single summary score per row.
df
: a dataframe user
: name of column that holds unique identifier for each userlocation
: name of column that holds unique identifier for each location...
: name of scored columns ## let's add two more variables before we do the scoring df_expanded <- df_expanded %>% summarise_nested(n_wdays_loc = n_distinct(wday), n_months_loc = n_distinct(month)) df_expanded %>% head(3) # when calculating scores, you can give weight to different variables, but the total weight should add up to 1 df_scored <- df_expanded %>% score_nested(., user = "u_id", location = "grid_id", keep_original_vars = F, s_n_tweets_loc = 0.1 * (n_tweets_loc/max(n_tweets_loc)), s_n_hrs_loc = 0.1 * (n_hrs_loc/24), s_n_days_loc = 0.1 * (n_days_loc/max(n_days_loc)), s_period_loc = 0.1 * (period_loc/max(period_loc)), s_n_wdays_loc = 0.1 * (n_wdays_loc/7), s_n_months_loc = 0.1 * (n_months_loc/12), s_weekend = 0.1 * (weekend), s_rest = 0.2 * (rest), s_wk_am = 0.1 * (weekend_am)) df_scored %>% head(3) df_scored$data[[1]] ### we can replace the score function by mutate function df_scored_2 <- df_expanded %>% nest_verbose(-u_id) %>% mutate_nested(s_n_tweets_loc = 0.1 * (n_tweets_loc/max(n_tweets_loc)), s_n_hrs_loc = 0.1 * (n_hrs_loc/24), s_n_days_loc = 0.1 * (n_days_loc/max(n_days_loc)), s_period_loc = 0.1 * (period_loc/max(period_loc)), s_n_wdays_loc = 0.1 * (n_wdays_loc/7), s_n_months_loc = 0.1 * (n_months_loc/12), s_weekend = 0.1 * (weekend), s_rest = 0.2 * (rest), s_wk_am = 0.1 * (weekend_am)) df_scored_2 %>% head(3) df_scored_2$data[[1]]
# sum varialbes score for each location df_score_summed <- df_scored %>% score_summary(., user = "u_id", location = "grid_id", starts_with("s_")) df_score_summed %>% head(3) df_score_summed$data[[1]]
The extract_location()
function allows you to sort the locations of each each user based on descending value of selected column, and return the top N location(s) for each user. The extracted location(s) then represent the most likely home location of the user.
df
: a nested dataframe by user user
: name of column that holds unique identifier for each userlocation
: name of column that holds unique identifier for each locationshow_n_loc
: a single number that decides the number of locations to be extracted keep_score
: option to keep or remove columns with scored value ...
: name of column(s) that used to sort the locations # extract homes for users based on score value (each user return 1 most possible home) df_home <- df_score_summed %>% extract_location(., user = "u_id", location = "grid_id", show_n_loc = 1, keep_score = F, desc(score)) df_home %>% head(3) # extract homes for users and keep scores of locations df_home <- df_score_summed %>% extract_location(., user = "u_id", location = "grid_id", show_n_loc = 1, keep_score = T, desc(score)) df_home %>% head(3) df_home$data[[3]]
The spread_nested()
function works in the same way as spread function in tidyverse but works inside a nested data frame. It allows to spread a key-value pairs across multiple columns.
df
: a nested dataframe var_key
: column name or positionvar_value
: column name or position# let's add one timeframe variable first and calculate the number of data points at each timeframe df_timeframe <- df_enriched %>% mutate_nested(timeframe = if_else(hour >= 2 & hour < 8, "Rest", if_else(hour >= 8 & hour < 19, "Active", "Leisure"))) colnames_nested_data <- df_timeframe$data[[1]] %>% names() colnames_to_nest <- colnames_nested_data[-which(colnames_nested_data %in% c("grid_id", "timeframe"))] df_timeframe <- df_timeframe %>% head(20) %>% # take first 20 users as example summarise_double_nested(., nest_cols = colnames_to_nest, n_points_timeframe = n()) df_timeframe$data[[2]] # spread timeframe in nested dataframe with key is timeframe and value is n_points_timeframe df_timeframe_spreaded <- df_timeframe %>% spread_nested(., key_var = "timeframe", value_var = "n_points_timeframe") df_timeframe_spreaded$data[[2]]
The arrange_nested()
function works in the same way as arrange function in tidyverse but works inside a nested data frame. It allows to sort rows by variables.
df
: a nested dataframe ...
: comma separated list of unquoted variable namesdf_enriched$data[[3]] df_arranged <- df_enriched %>% arrange_nested(desc(hour)) # arrange the hour in descending order df_arranged$data[[3]]
The arrange_double_nested()
function works in the similar way as arrange_nested()
function, but it applies to a double nested data frame. You need to choose the columns to nest inside a nested dataframe and then sort rows by selected column.
- `df`: a nested dataframe - `nest_cols`: name of columns to nest in existing list-column - `...`: comma separated list of unquoted variable names
# get the name of columns to nest colnames_nested_data <- df_enriched$data[[1]] %>% names() colnmaes_to_nest <- colnames_nested_data[-which(colnames_nested_data %in% c("grid_id"))] df_double_arranged <- df_enriched %>% head(20) %>% # take the first 20 users for example arrange_double_nested(., nest_cols = colnmaes_to_nest, desc(created_at)) # sort by time in descending order # original third user df_enriched[3, ] # third user data points df_enriched$data[[3]] # arranged third user df_double_arranged[3, ] # arranged time df_double_arranged$data[[3]]$data[[2]]
The top_n_nested()
function works in the same way as top_n function in tidyverse but for nested dataframe. It allows to select the top entries in each group, ordered by wt.
df_enriched$data[[2]] ## get the top 1 row based on hour df_top_1 <- df_enriched %>% top_n_nested(., n = 1, wt = "hour") df_top_1$data[[2]]
To use the embedded recipes to identify the home location for users, you can use identify_location()
function.
df
: a dataframe with columns for the user id, location, timestampuser
: name of column that holds unique identifier for each usertimestamp
: name of timestamp column. Should be POSIXctlocation
: name of column that holds unique identifier for each locationrecipe
: embedded algorithms to identify the most possible home locations for users show_n_loc
: number of potential homes to extractkeep_score
: option to keep or remove calculated result/score per user per locationCurrent available recipes:
recipe_HMLC
: recipe_FREQ
recipe_OSNA
: Efstathiades et al.2015recipe_APDM
: Ahas et al. 2010# recipe: homelocator -- HMLC identify_location(test_sample, user = "u_id", timestamp = "created_at", location = "grid_id", show_n_loc = 1, recipe = "HMLC") # recipe: Frequency -- FREQ identify_location(test_sample, user = "u_id", timestamp = "created_at", location = "grid_id", show_n_loc = 1, recipe = "FREQ") # recipe: Online Social Network Activity -- OSNA identify_location(test_sample, user = "u_id", timestamp = "created_at", location = "grid_id", show_n_loc = 1, recipe = "OSNA") # recipe: Online Social Network Activity -- APDM ## APDM recipe strictly returns the most likely home location ## It is important to load the neighbors table before you use the recipe!! ## example: st_queen <- function(a, b = a) st_relate(a, b, pattern = "F***T****") ## neighbors <- st_queen(df_sf) ===> convert result to dataframe data("df_neighbors", package = "homelocator") identify_location(test_sample, user = "u_id", timestamp = "created_at", location = "grid_id", recipe = "APDM", keep_score = F)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.