Before using this package a number of steps are required: First, your eye gaze data must have been collected using an SR Research Eyelink eye tracker. Second, your data must have been exported using SR Research Data Viewer software. For this basic example, it is assumed that you have specified an interest period relative to the onset of the critical stimulus in Data Viewer. However, this package is also able to preprocess data without a specified relative interest period. If you have not aligned your data to a particular message in Data Viewer, please refer to the Message Alignment vignette for functions related to this.
The Sample Report should be exported along with all available columns (this will ensure that you have all of the necessary columns for the functions contained in this package to work). Additionally, it is preferable to export to a .txt file rather than a .xlsx file.
The following preprocessing assumes that, in your experiment, interest area IDs and Labels were assigned consistently to the object types displayed on the screen. For example, in a typical VWP experiment, the target was always in interest area 1, the competitor was always in interest area 2, et cetera. This is typically done by dynamically moving the interest areas trial-by-trial to correspond with the position of the objects. If, instead, your interest areas were static and you have columns indicating the location of each object for each trial, you will need to reassign your interest areas. Specific functions for this are available in this package; please see the Interest Areas vignette for illustration. Once that is complete, you can follow the preprocessing procedure below. Note that the functions presented here are capable of handling data with a maximum of 8 interest areas. If you have more than 8 interest areas, it is necessary to adjust the source code to accommodate the number needed (please contact the package maintainer for an example).
Lastly, the functions included here, internally make use of
dplyr for manipulating and restructuring data. For more information about
dplyr, please refer to its reference manual and extensive collection of vignettes.
knitr::opts_chunk$set(fig.width=6, fig.height=4, warning=FALSE)
First, load the sample report. By default, Data Viewer will assign "." to missing values; therefore it is important to include this in the na.strings parameter, so R will know how to handle any missing data.
library(VWPre) VWdat <- read.table("1000HzData.txt", header = T, sep = "\t", na.strings = c(".", "NA"))
However, for the purposes of this vignette we will use the sample dataset included in the package.
In order for the functions in the package to work appropriately, the data need to be in a specific format.
prep_data function examines the presence and class of specific columns (
TRIAL_INDEX) to ensure they are present in the data and appropriately assigned (e.g., categorical variables are encoded as factors).
It also checks for columns
LEFT_GAZE_Y, which are not required for basic preporcessing, but are needed to use the functions
Subject parameter is used to specify the column corresponding to the subject identifier.
Typical Data Viewer output contains a column called
RECORDING_SESSION_LABEL which is the name of the column containing the subject identifier.
The function will rename it
Subject and will ensure it is encoded as a factor.
If your data contain a column corresponding to an item identifier please specify it in the
In doing so, the function will standardize the name of the column to
Item and will ensure it is encoded as a factor. If you don't have an item identifier column, by default the value of this parameter is NA.
Lastly, a new column called
Event will be created which indexes each unique recording sequence and corresponds to the combination of
TRIAL_INDEX. This Event variable is required internally for subsequent operations. Should you choose to define the Event variable differently, you can override the default; however, do so cautiously as this may impact the performance of subsequent operations because it must index each time sequence in the data uniquely.
Upon completion, the function prints a summary indicating the results.
dat0 <- prep_data(data = VWdat, Subject = "RECORDING_SESSION_LABEL", Item = "itemid")
At this point, it is safe to remove the columns which were output by Data Viewer, but that are not needed for preprocessing in using this package.
Removing these will reduce the amount of system memory consumed and result in a final dataset that consume less disk space.
This is done straightforwardly using the function
By default it will remove all the Data Viewer columns that are not needed for preprocessing (if they are present in the data).
However, if desired, it is possible to keep specific columns from this set using the
Keep parameter, which accommodates a string or character vector.
If using the sample data set included in this package, it is not necessary to do this step, as these columns have already been removed.
dat0 <- rm_extra_DVcols(dat0, Keep = c("RIGHT_PUPIL_SIZE", "LEFT_PUPIL_SIZE"))
When the data were loaded, samples that were outside of any interest area were labeled as NA.
relabel_na function examines the interest area columns (
RIGHT_INTEREST_AREA_LABEL) for cells containing NAs.
It then assigns 0 to the ID columns and "Outside" to the LABEL columns) to indicate those eye gaze samples which fell outside of the interest areas defined in the study.
The number of interest areas you defined in your experiment should be supplied to the parameter
dat1 <- relabel_na(data = dat0, NoIA = 4)
Notice that the output informs us that the number of levels
LEFT_INTEREST_AREA_LABEL does not match the number of interest areas listed in
This is because we only have data from the right eye (hence, all samples in
LEFT_INTEREST_AREA_LABEL are listed as "Outside").
The subsequent preprocessing requires that the interest area IDs are numerically coded, with values ranging from 0 (i.e., outside all interest areas) up to a maximum of 8.
So, it's important to check that the IDs present in the data set, conform to this. The
check_ia functions does just this and indicates how those IDs are mapped to the interest area labels.
check_ia(data = dat1)
If your interest area IDs do not conform to the required coding, or you would like to create new labels for your existing interest areas, please consult the Interest Areas vignette. That vignette illustrates how to relabel existing interest area codings (as well as remap the gaze data to entirely new interest areas, should you so desire).
create_time_series creates a time series (a new column called
Time) which is required for subsequent processing, plotting, and modeling of the data.
It is common to export a period of time prior to the onset of the stimulus as a baseline. In this case, an adjustment (equal to the duration of the baseline period) must be applied to the time series, specified in the
In effect, the adjustment simply subtracts the given value to each time point.
So, a positive value will shift the zero point forward (making the initial zero a negative time value), while a negative value will shift the zero point backward (making the initial zero a positive time value).
An example illustrating this can be found in the Message Alignment vignette.
In the example below, the data were exported with a 100ms pre-stimulus interval.
dat2 <- create_time_series(data = dat1, Adjust = 100)
Note that if you have used the
align_msg function (illustrated in the Message Alignment vignette), you may need to specify a column name in
That column can be used to apply the recording event specific adjustment to each trial.
Consult that vignette for further details.
check_time_series can be used to verify the time series.
It outputs the unique start times present in the data.
These will be the same standardized time point relative to the stimulus if you have exported your data from Data Viewer with pre-defined interest period relative to a message.
By specifying the parameter
ReturnData = T, the function can return a summary data frame that can be used to inspect the start time of each event.
As you can see below, by providing
Adjust with a postive value, we have effectively shifted the zero point forward along the number line, causing the first sample to have a negative time value.
check_time_series(data = dat2)
Another way to check that your time series has been created correctly is to use the
By providing the appropriate message text, we can see that the onset of our target now occurs at Time = 0.
Note that the
Msg parameter can handle exact matches or matches based on regular expressions.
check_time_series, the parameter
ReturnData = T will return a summary data frame that can be used to inspect the message time of each event.
check_msg_time(data = dat2, Msg = "TargetOnset")
If you do not remember the messages in your data, you can output all existing messages and their corresponding timestamps using
Additionally and optionally, the output of the function can be saved using the parameter
ReturnData = T.
check_all_msgs(data = dat2)
Depending on the design of the study, right, left, or both eyes may have been recorded during the experiment.
Data Viewer outputs gaze data by placing it in separate columns for each eye (
However, it is preferable to have gaze data in a single set of columns, regardless of which eye was recorded during the experiment.
select_recorded_eye provides the functionality for this purpose, returning three new columns (
select_recorded_eye requires that the parameter
Recording be specified.
This parameter instructs the function about which eye(s) was used to record the gaze data.
It takes one of four possible strings: "LandR", "LorR", "L", or "R".
"LandR" should be used when any participant had both eyes recorded.
"LorR" should be used when some participants had their left eye recorded and others had their right eye recorded
"L" should be used when all participant had their left eye recorded.
"R" should be used when all participant had their right eye recorded.
If in doubt, use the function
check_eye_recording which will do a quick check to see if
RIGHT_INTEREST_AREA_ID contain data. It will then suggest the appropriate Recording parameter setting.
When in complete doubt, use "LandR".
The "LandR" setting requires an additional parameter (
WhenLandR) to be specified.
This instructs the function to select either the right eye or the left eye when data exist for both.
check_eye_recording(data = dat2)
After executing, the function prints a summary of the output.
While the function
check_eye_recording indicated that the parameter
Recording should be set to "R", the example below sets the parameter to "LandR", which can act as a "catch-all".
Consequently, in the summary, it can be seen that there were only recordings in the right eye.
dat3 <- select_recorded_eye(data = dat2, Recording = "R", WhenLandR = "Right")
Prior to binning the data, some researchers might prefer to remove trials with excessive trackloss.
Because Data Viewer does not provide a specific column for trackloss, it is possible to determine this using a combination of information, namely, the column
In_Blink and/or the X and Y coordinates (
mark_trackloss uses this information to determine the status of a given sample.
Type can be set to "Blink", "OffScreen", or "Both".
When set to "OffScreen" or "Both",
ScreenSize must be supplied as a numeric vector of the X and Y dimensions of the computer sceen used during the experiment.
dat3 <- mark_trackloss(dat3, Type = "Both", ScreenSize = c(1920, 1080))
Once the samples corresponding to trackloss have been identified, events with less than the required amount of quality data can be removed from the data set, using the function
RequiredData represents the percentage of data (non-trackloss) required in order to retain the event.
In the example below, each event must contain 75\% quality data, in other words, no more than 25\% trackloss.
dat3 <- rm_trackloss_events(dat3, RequiredData = 75)
In order to obtain proportion looks, it is necessary to bin the data.
That is, group samples in chunks of time, count the number of samples in each of the interest areas, and calculate the proportions based on the counts.
The sampling rate at which the eye gaze data were recorded must be provided.
For Eyelink trackers, this is typically 250Hz, 500Hz, or 1000Hz.
If in doubt, use the function
check_samplingrate to determine it.
The sampling rate can then be supplied to the function
Note that the
check_samplingrate function returns a printed message indicating the sampling rate(s) present in the data.
Optionally, it can return a new column called
SamplingRate by specifying the parameter
ReturnData as TRUE. In the event that data were collected at different sampling rates, this column can be used to subset the dataset by the sampling rate before proceeding to the next processing step.
bin_prop calculates the proportion of looks (samples) to each interest area in a particular span of time (bin size).
In order to do this, it is necessary to supply the parameters
BinSize should be specified in milliseconds, representing the chunk of time within which to calculate the proportions.
Not all bin sizes work for all sampling rates, due to downsampling constraints.
If unsure which are appropriate for your current sampling rate, use the
When provided with the current sampling rate in
SamplingRate (see above), the function will return a printed summary of the bin size options and their corresponding downsampled rate. By default, this returns the whole number downsampling rates users are likely to want; however, it can also return all possible (valid) downsampling rates, even if they are not round numbers.
ds_options(SamplingRate = 1000)
SamplingRate parameter in
bin_prop should be specified in Hertz (see
check_samplingrate), representing the original sampling rate of the data and the
BinSize should be specified in milliseconds (see
ds_options), representing the span of time over which to calculate the proportion.
bin_prop function returns new columns corresponding to each interest area ID (e.g.,
The extension '_C' indicates the count of samples in the bin and the extension '_P' indicates the proportion.
dat4 <- bin_prop(dat3, NoIA = 4, BinSize = 20, SamplingRate = 1000)
In performing the calculation, the function effectively downsamples the data.
To check this and to know the new sampling rate, simply call the function
Proportions are inherently bound between 0 and 1 and are therefore not suitable for many types of analysis. Logits provide a transformation resulting in an unbounded measure (as well as weights which estimate the variance). The calculations contained in this package are based on: Barr, D. J., (2008) Analyzing 'visual world' eyetracking data using multilevel logistic regression, Journal of Memory and Language, 59(4), 457--474. However, they have been modified to allow greater flexibility.
When using an empirical logit transformation it is important to keep two things in mind. The first is the number of observations (or samples) on which to base the calculation. Typically, this is the number of samples per bin, which varies depending on your original sampling rate and bin size.
To determine the number of samples per bin present in the data, use the function
However, a user may choose to define a different number of observations (because the number of samples is inherently linked to the sampling rate).
Though, it is important to note that changing this value can drastically impact the results of the transformation and weight calculations.
There are some safeguards within the transformation function to prevent users from choosing inadvisable values (though these safeguards can be overridden with the parameter
So, if in doubt, it is safest to use the number of samples present in your data (as indicated by
The second things to keep in mind is the constant to be added in the transformation.
Note that by default the calculation uses a constant of 0.5; however, the user can specify a different value to be used.
If you are interested in visualizing the effect of both number of observations and constant on the result of the empirical logit transformation and weight calculations, please refer to the Plotting vignette, which illustrates and discusses the function
transform_to_elogit transforms the proportions to empirical logits and also calculates a weight for each value.
The weight estimates the variance in each bin (because the variance of the logit depends on the mean).
This is particularly important for regression analyses and should be specified in the model call (e.g., weight = 1 /
As mentioned above, the function takes the number of observations in the parameter
ObsPerBin. Here we use the number of samples per bin present in the data.
dat5 <- transform_to_elogit(dat4, NoIA = 4, ObsPerBin = 20)
Some researchers may prefer to perform a binomial analysis.
Therefore the function
create_binomial uses (previously calculated) proportions and number of observations to create a success/failure column for each IA.
This column is then a suitable response variable for logistic regression of the time series.
As with the empirical logit transformation, a user may choose to define a number of observations that is different from the number of samples per bin.
Because this can create artifacts in the scaling or more samples than are present in the data, safeguards are in place to prevent users from choosing inadvisable values (though these safeguards can be overridden with the parameter
dat5a <- create_binomial(data = dat4, NoIA = 4, ObsPerBin = 20)
rm(dat4, dat5a) gc()
By default the function will create a success/failure column for each IA in the data; however, it is also possible to create a custom column comparing looks between two specific interest areas.
This is done by specifying the parameter
CustomBinom with a vector of two integers (e.g., CustomBinom = c(1,2)) in which the two integers correspond to the IDs of the desired interest areas.
For advanced users who have worked with the package functions before and who are familiar with the required steps and output, there is a meta-function, called
fasttrack, which runs through the previous functions and outputs a dataframe with either empirical logits or binomial data.
Note that using this function will still require the user to manually remove unneeded columns (see above).
This meta-function takes as parameters all the required arguments to the component functions.
Also, this function assumes that dynamic interest areas were used and do not need to be relabelled/reassigned.
It also assumes an interest period was defined in Data Viewer relative to the critical stimulus, thus not requiring separate message alignment.
Again, this is only recommended for users who have previously worked with visual world data, the functions contained in this package, and are confident that their data meet the requirements/assumptions of the
dat5b <- fasttrack(data = VWdat, Subject = "RECORDING_SESSION_LABEL", Item = "itemid", EventColumns = c("Subject", "TRIAL_INDEX"), NoIA = 4, Adjust = 100, Recording = "LandR", WhenLandR = "Right", BinSize = 20, SamplingRate = 1000, ObsPerBin = 20, Constant = 0.5, Output = "ELogit")
Some may wish to rename the interest area columns created by the functions to something more meaningful than the numeric coding scheme.
To do so, use the function
rename_columns. This will convert column names like
This will perform the operation on all the
IA_ columns for upto 8 interest areas.
dat6 <- rename_columns(dat5, Labels = c(IA1="Target", IA2="Rhyme", IA3="OnsetComp", IA4="Distractor"))
You can now check the column names in the data.
rm(dat5, dat6) gc()
Before embarking on a statistical analysis, it is probably necessary to take a couple steps, such as paring down the data to only include the columns which will be needed later and ensuring the data are ordered appropriately.
This is straightforward using
FinalDat <- dat5 %>% # Select just the columns you want select(Subject, Item, Time, starts_with("IA"), Event, TRIAL_INDEX, Rating, Exp) %>% # Order the data by Subject, Trial, and Time arrange(Subject, TRIAL_INDEX, Time)
Save the resulting dataset to a .rda file and use compression to make it more compact (though this will add to the amount of time it takes to save).
save(FinalDat, file = "FinalDat.rda", compress = "xz")
You are now ready to plot your data. Please refer to the Plotting vignette for details on the various plotting functions contained in the package.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.