knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

semdistflow

Overview

‘semdistflow’ transforms any user-specified text into sequential bigrams (e.g. ‘The dog drinks the milk’ to dog-drink, drink-milk, etc.). The package computes two measures of semantic distance for every running bigram in a language transcript. Users have many options for how to structure their texts and tailor the output to their own unique constraints (e.g., omitting stopwords, lemmatizing tokens, dimensionality of word embeddings).

Installation

Install semdistflow from GitHub by typing the following in your console or script (make sure you have devtools installed):

# install.packages("devtools")
devtools::install_github("Reilly-ConceptsCognitionLab/semdistflow")

The main functions

  1. readme() reads the txt file into R, appends a document id based on its filename and formats the text as a dataframe.
  2. cleanme() uses many regular expressions to clean and format the text. These include omitting contractions, converting to lowercase, omitting numbers, omitting stopwords, etc.
  3. distme() computes two metrics of semantic distance for each running pair of words in the language sample you just cleaned. These are outputted as a vector of word pairs.

Example of Cleaning Function

This is a basic example which shows you how the cleanme function works:

library(semdistflow)
library(tidyverse)
doc_id <- "fox"
doc_text <- "The quick brown fox jumps over the lazy dog."
fox_text <-as.data.frame(cbind(doc_id,doc_text))
fox_text
fox_clean <- cleanme(fox_text)
fox_clean

Example of Semantic Distance Function

This is a basic example which shows you how the cleanme function works:

fox_token <-fox_clean %>%
  group_by(doc_id, doc_text) %>%
  tidytext::unnest_tokens(word, doc_clean, drop=F)
  fox_token$lemma<- textstem::lemmatize_words(fox_token$word)
fox_token
fox_dist <-  bigram_cos_sim(targetdf = fox_token, lookupdb = semdist15, colname1 = lemma, colname2 = word, flipped = T)

fox_dist
ggplot(fox_dist, aes(x=as.numeric(row.names(fox_dist)), y=flipped_cosine.dist)) +  geom_line(color="#02401BD9", size= 1) + theme_classic() + xlab(NULL) + ylab(NULL)  + geom_label(aes(label=pair), size=3, data=fox_dist)  


bzuck-temple/TextDistanceBeta documentation built on Jan. 29, 2023, 6:37 p.m.