knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
filelocation <- "/Users/erinbuchanan/GitHub/Research/2_projects/PSA_Projects/SPAML/semanticprimeR"

Libraries

library(dplyr)
library(semanticprimeR)
library(udpipe)
library(stopwords)
library(tibble)
library(rio)

Power Functions

During the process of this project, we proposed a new way to calculate "power" for adaptive sampling of items using accuracy in parameter estimation and bootstrapping/simulation methods. You can review the preprint here: https://osf.io/preprints/osf/e3afx

This package has many vignettes for different types of data you can examine by using vignette(package = "semanticprimeR") to review what is available and vignette(topic = "montefinese_vignette", package = "semanticprimeR") as a specific example to view a specific vignette.

The information presented here is a shortened version of the paper to show off the functionality of the package.

Simulate Population

Let's say we want to run a priming study, but do not have previous data. We can use simulate_population() to create some sample data to use in for estimating sample size for adaptive testing and stopping rules.

df <- simulate_population(mu = 25, # mean priming in ms
                    mu_sigma = 5, # standard deviation of the item means
                    sigma = 10, # population standard deviation for items
                    sigma_sigma = 3, # standard deviation of the standard deviation of items
                    number_items = 75, # number of priming items 
                    number_scores = 100, # a population of values to simulate
                    smallest_sigma = 1, # smallest possible item standard deviation 
                    min_score = -25, # min ms priming
                    max_score = 100, # max ms priming 
                    digits = 3)

head(df)

Calculate Cutoff

From this dataframe, we can calculate the standard error of the items, which is used to determine if the item has reached the stopping rule (i.e., if the standard error reaches a defined value, it is considered accurately measured, and we can stop sampling it). To get the defined value, we use the 40% decile of the standard errors of the items.

cutoff <- calculate_cutoff(population = df, # pilot data or simulated data
  grouping_items = "item", # name of the item indicator column
  score = "score", # name of the dependent variable column
  minimum = min(df$score), # minimum possible/found score
  maximum = max(df$score)) # maximum possible/found score

cutoff$se_items # all standard errors of items
cutoff$sd_items # standard deviation of the standard errors
cutoff$cutoff # 40% decile score
cutoff$prop_var # proportion of possible variance 

Simulate Samples

Next, we use simulation to pretend we have samples of a starting size (we suggest 20) up to a maximum size we are willing to collect. These samples will be used to determine the number of items that achieve our stopping rule to determine how many participants are needed to accurately measure most items.

samples <- simulate_samples(start = 20, # starting sample size
  stop = 400, # stopping sample size
  increase = 5, # increase bootstrapped samples by this amount
  population = df, # population or pilot data
  replace = TRUE, # bootstrap with replacement? 
  nsim = 500, # number of simulations to run
  grouping_items = "item") # item column label 

save(samples, file = "data/simulatePriming.RData")
# since that's a slow function, we wrote out the data and read it back in
load(paste0(filelocation,"/vignettes/data/simulatePriming.Rdata"))
head(samples[[1]])

Calculate Proportions of Items Measured

From those samples and our estimated cutoff score, we can calculate the number of items below our suggested stopping rule.

proportion_summary <- calculate_proportion(samples = samples, # samples list
  cutoff = cutoff$cutoff, # cut off score 
  grouping_items = "item", # item column name
  score = "score") # dependent variable column name 

head(proportion_summary)

Final Sample Size

As you will note in our paper, we determined that this simulation procedure needs a correction to approximate traditional interpretations of power. You can use "power" levels like 80 percent, 90 percent, etc. similarly to traditional power - use higher numbers if you want to be more stringent to make sure all items are "well measured" (and correspondingly you will get higher estimated sample sizes). We suggested using 80% as a "minimum" sample size, the cutoff stopping rule while running the study if you use adaptive sampling (or just overall to review the data), and a higher value like 90% for a maximum sample size to collect.

corrected_summary <- calculate_correction(
  proportion_summary = proportion_summary, # prop from above
  pilot_sample_size = 100, # number of participants in the pilot data 
  proportion_variability = cutoff$prop_var, # proportion variance from cutoff scores
  power_levels = c(80, 85, 90, 95)) # what levels of power to calculate 

corrected_summary

Create Stimuli Functions

Once you've determine your sample sizes, you will want to create stimuli. These functions show you how we created stimuli from the OpenSubtitles project using the subs2vec project.

Get Models

You can review the available models from subs2vec merged with the available data in OpenSubtitles by using the data function. You can use ?subsData to view the information about the columns and the dataset.

data("subsData")
head(tibble(subsData))

We only worked with languages in which we could use a part of speech tagger. We recommend udpipe as a great package that has many taggers. The language model necessary is shown in the udpipe_model column.

In this example, let's use Afrikaans as a smaller dataset for an example. The datasets can be very large - just a warning for downloading and using on your computer. Use the import_subs function to download and import the files you are interested in.

For language, please use the two letter language code in the language_code column of the subsData.

You then need to pick what to download:

You may see some warnings based on file formatting.

af_freq <- import_subs(
  language = "af",
  what = "subs_count"
)
af_freq <- import(paste0(filelocation,"/vignettes/subs_count/af/dedup.af.words.unigrams.txt"))
head(af_freq)

We then used udpipe to filter our possible options. You may have other criteria, but here's an example of how we tagged concepts (for their main part of speech, given no sentence context here). When you use this function, it will download the model necessary for tagging.

# tag with udpipe
af_tagged <- udpipe(af_freq$unigram, 
                    object = subsData$udpipe_model[subsData$language_code == "af"], 
                    parser = "none")

head(tibble(af_tagged))

We then:

We only used the top 10,000 words for the next section, but this selection will depend on your use case as well.

# word_choice
word_choice <- c("NOUN", "VERB", "ADJ", "ADV")

# lower case
af_tagged$lemma <- tolower(af_tagged$lemma)

# three characters 
af_tagged <- subset(af_tagged, nchar(af_tagged$lemma) >= 3)

# only nouns verbs, etc. 
af_tagged <- subset(af_tagged, upos %in% word_choice)

# removed stop words just in case they were incorrectly tagged
af_tagged <- subset(af_tagged, !(lemma %in% stopwords(language = "af", source = "stopwords-iso")))

# removed things with numbers
af_tagged <- subset(af_tagged, !(grepl("[0-9]", af_tagged$sentence)))

# merge frequency back into tagged list
# merge by sentence so one to one match
colnames(af_freq) <- c("sentence", "freq")
af_final <- merge(af_tagged, af_freq, by = "sentence", all.x = T)

head(tibble(af_final))

# eliminate duplicates by lemma
af_final <- af_final[order(af_final$freq, decreasing = TRUE) , ]
af_final <- af_final[!duplicated(af_final$lemma), ]

# grab top 10K
af_top <- af_final[1:10000 , ]

Next, we used import_subs() again import the embeddings for the subtitles.

af_dims <- import_subs(
  language = "af",
  what = "subs_vec"
)
af_dims <- read.table("subs_vec/af/subs.af.1e6.txt", quote = "\"")
head(tibble(af_dims))

In our case, we want to use the tokens as row names, so we want to move the first column to the row names and delete it to have a 300 dimension by tokens matrix.

# lower case
af_dims$V1 <- tolower(af_dims[ , 1]) # first column is always the tokens

# eliminate duplicates 
af_dims <- subset(af_dims, !duplicated(af_dims[ , 1]))

# make row names
rownames(af_dims) <- af_dims[ , 1]
af_dims <- af_dims[ , -1]
head(tibble(af_dims))

Calculate Similarity

We can then use the calculate_similarity() function to get the similarity values for all words based on the dimension matrix. The underlying function is cosine calculated between vectors of the two word dimensions.

af_cosine <- 
  calculate_similarity(
    words = af_final$sentence, # the tokens you want to filter
    dimensions = af_dims, # the matrix of items 
    by = 1 # 1 for rows, 2 for columns 
)

The top_n function can be used to calculate the top number of cosine values for each token in the similarity matrix. Please note: it will always return the token-token combination as 1 (the token related to itself), so you should ask for n+1 number of cosines to then filter out the token-token combinations. Big thanks to Brenton Wiernik who figured out how to make this computational efficient.

# get the top 5 related words 
af_top_sim <- semanticprimeR::top_n(af_cosine, 6)
af_top_sim <- subset(af_top_sim, cue!=target)

head(af_top_sim)

Create Pseudowords

We originally set up a function to create words by replacing the number of characters based on the bigrams in the token. We recommend you use the other function based on Wuggy, but you can also do simple replacement.

af_top_sim$fake_cue <- fake_simple(af_top_sim$cue)
# you'd want to also do this based on target depending on your study 
head(af_top_sim)

You can also use the Wuggy algorithm using fake_Wuggy(). This function is not fast. It is slower the larger the size of the words to create from. It returns a dataframe of options to use for pseudowords with the following columns:

af_wuggy <- fake_Wuggy(
  wordlist = af_final$sentence, # full valid options in language
  language_hyp = paste0(filelocation,"/inst/latex/hyph-af.tex"), # path to hyphenation.tex 
  lang = "af", # two letter language code
  replacewords <- unique(af_top_sim$cue[1:20]) # words you want to create pseudowords for  
)

head(tibble(af_wuggy))


getwd()

Get Priming Data

You can load one of the many files included in the SPAML release by using the primeData to see what we have available. The datasets are broken into a couple types:

Each of the following files have codebooks found at: https://github.com/SemanticPriming/SPAML/tree/master/05_Data/codebooks

Load Available Data

data("primeData")
head(primeData)

Once you decide what file you would like to download and import, you can use import_prime() to import that file. Note that some of the full_data datasets are quite large and may take a while download and/or import directly. You can also just use the direct links the primeData file to download them. Some files are heavily compressed in .gz format. I recommend 7-zip if you aren't familiar with the command line to unzip these: https://www.wikihow.com/Extract-a-Gz-File

You can also import them directly into R with the rio package (which is what this function does, but it does download the file each time, so I'd recommend one download and then put the import into your code directly with rio::import("filepath")).

Import Specific Data

In this example, we import the stimuli dataset for Spanish, which includes the trials, type of trial information, and the cosine calculated from subs2vec.

es_words <- import_prime("es_words.csv")
es_words <- import(paste0(filelocation, '/vignettes/procedure_stimuli/es/es_words.csv'))
head(es_words)

Match to LAB Data

Load Available Data

To review the available data from the Linguistic Annotated Bibliography, you can use data("labData"), which includes information about available datasets overall and which are included in our LAB data release for merging.

data("labData")
head(tibble(labData))

# import_lab() also loads this dataset 

# ?labData # use this to learn about the dataset 

Load Filtered Metadata

If you want to find specific types of LAB data, you can use the language and/or variables.

saved <- import_lab(language = "English", variables = c("aoa", "freq"))
# possible datasets that are English, aoa, and frequency
head(tibble(saved))

saved <- import_lab(language = "Spanish", variables = c("aoa"))

head(tibble(saved))

Load Specific Data

es_aos <- import_lab(bibtexID = "Alonso2015", citation = TRUE)
# save(es_aos, file = "lab_data/es_aos.Rdata")
load(paste0(filelocation, "/vignettes/lab_data/es_aos.Rdata"))
es_aos$citation

head(tibble(es_aos$loaded_data))
es_sim <- import_lab(bibtexID = "Cabana2024_R1", citation = TRUE)
# save(es_sim, file = "lab_data/es_sim.Rdata")
load(paste0(filelocation,"/vignettes/lab_data/es_sim.Rdata"))
es_sim$citation

head(tibble(es_sim$loaded_data))

Match To Prime Data

es_words_merged <- es_words %>%
  # merge with the cue word (will be .x variables)
  left_join(es_aos$loaded_data, 
            by = c("es_cue" = "word_spanish")) %>% 
  # merge with the target word (will be .y variables)
  left_join(es_aos$loaded_data, 
            by = c("es_target" = "word_spanish")) %>% 
  # merge with free association similarity
  left_join(es_sim$loaded_data, 
            by = c("es_cue" = "cue",
                   "es_target" = "response"))

head(tibble(es_words_merged))

Other Cool Stuff

We used labjs for this project. The datasets you get from labjs are in a SQLite file. It's not super fun to process. So, they wrote a function to do that. We included that function here as processData(), and you can see that we used it in our data processing files. It's here if you want to use it yourself on labjs projects.

df <- processData("data.sqlite")


SemanticPriming/semanticprimeR documentation built on Dec. 5, 2024, 7:59 p.m.