In tjmahr/fillgaze: Interpolate Missing Eyetracking Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "fig/README-"
)

fillgaze

The goal of fillgaze is to provide helper functions for interpolating missing eyetracking data.

Installation

You can install fillgaze from github with:

# install.packages("devtools")
devtools::install_github("tjmahr/fillgaze")

Package overview

This package was created in response to a very strange file of eyetracking data.

df <- readr::read_csv("inst/test-gaze.csv")

Here is the problem with this file:

library(dplyr, warn.conflicts = FALSE)
library(ggplot2)

ggplot(head(df, 40)) + 
  aes(x = Time - min(Time)) + 
  geom_hline(yintercept = 0, size = 2, color = "white") + 
  geom_point(aes(y = GazeX, color = "GazeX")) +
  geom_point(aes(y = GazeY, color = "GazeY")) + 
  labs(x = "Time (ms)", y = "Screen location (pixels)", 
       color = "Variable")

Every second or third point is incorrectly placed offscreen, indicated by a negative pixel values for the gaze locations. It is physiologically impossible for a person's gaze to oscillate so quickly and with such magnitude (the gaze is tracked on a large screen display).

We would like to interpolate spans of missing data using neighboring points. That's the point of this package. The steps to solve the problem involve:

[x] Converting offscreen values into proper NA values.
[x] Identifying and describing gaps of missing values (streaks of successive NAs).
[x] Interpolating the values in a gap.
[x] Writing tests to confirm that everything works as expected. :innocent:

Setting values in several columns to `NA`

We need to mark offscreen points as properly missing data. set_values_to_na() takes a dataframe and named filtering predicates. Here's the basic usage.

set_values_to_na(dataframe, {col_name} = {function to determine NA values})

The values that return TRUE for each function are replaced with NA values. For example, set_values_to_na(df, var1 = ~ .x < 0) would:

look for the column var1 in the dataframe,
check which values of .x < 0 are true where .x is a placeholder/pronoun for the values in df$var1,
and replace those values where the test is TRUE with NA.

library(fillgaze)
original_df <- df

df <- df %>% 
  set_values_to_na(
    GazeX = ~ .x < -100, 
    GazeY = ~ .x < -100, 
    LEyeCoordX = ~ .x < -.1, 
    LEyeCoordY = ~ .x < -.1,
    REyeCoordX = ~ .x < -.1, 
    REyeCoordY = ~ .x < -.1)

# Before and after on some of the GazeX values
data_frame(before = head(original_df$GazeX), after = head(df$GazeX))

Now, those offscreen points will not be plotted because they are NA.

last_plot() %+% head(df, 40)

Finding gaps in the data

We can use find_gaze_gaps() to locate the gaps in a column of data. This function mostly is used internally. Users are not expectedly to routinely use this function, but I cover it here because the function for filling gaps relies on the data in this dataframe.

find_gaze_gaps(df, GazeX) %>% 
  print(width = 120)

Each row describes a gap in the column.

start_row and end_row contain row numbers of nearest non-NA values.
na_rows is the number of successive NAs in the gap.
start_valueand end_value contain the nearest non-NA values.
change_value is the difference between start_value and end_value.

The function also measure the duration of the gap (change_time). By default, it uses row numbers (.rowid) to measure duration. We can use an explicit column to use as the measure of time.

find_gaze_gaps(df, GazeX, time_var = Time) %>% 
  print(width = 120)

The function also respects dplyr grouping, so that e.g., false gaps are not found between trials.

df %>% 
  group_by(Trial) %>% 
  find_gaze_gaps(GazeX)

Interpolating values in gaps

fill_gaze_gaps() will fill in the gaps in selected columns. We can set limits on which gaps are filled:

max_na_rows: don't fill gaps with more than successive NA rows than max_na_rows
max_duration: don't fill gaps with a duration larger than max_duration
max_sd: don't fill gaps where the relative change in the variable is more than max_sd standard deviations in magnitude

df <- df %>% 
  group_by(Trial) %>% 
  fill_gaze_gaps(GazeX, time_var = Time, max_na_rows = 5)

In this example, only GazeX has been interpolated. The median value is used. We can compare the results with the GazeY column.

df %>% select(Time:GazeY)

fill_gaze_gaps() also works with the variable selection helpers from dplyr/tidyselect.

df <- df %>% 
  fill_gaze_gaps(GazeX, GazeY, matches("EyeCoord"), 
                 time_var = Time, max_na_rows = 5, max_sd = 2) %>% 
  ungroup()
df

was_offscreen <- (original_df$GazeX < -100) %>% 
  ifelse("Interpolated", "Original\nRaw Data")

last_plot() %+% 
  head(df, 40) + 
  aes(alpha = head(was_offscreen, 40), shape = head(was_offscreen, 40)) + 
  scale_alpha_discrete(name = "Point", range = c(.3, 1)) + 
  labs(shape = "Point")