Built with R
r getRversion()
on
r format(Sys.time(),'%B %d %Y')
These examples assume the lingmatch package is loaded:
library('lingmatch')
If the text you want to analyze is already in R, you can process it:
# individual words data = lma_process(texts) # with a dictionary data = lma_process(texts, dict = 'inquirer', dir = '~/Dictionaries') # with a latent semantic space data = lma_process(texts, space = 'glove', dir = '~/Latent Semantic Spaces')
Or, you can just calculate similarity:
# pairwise cosine similarity in terms of all words sims = lingmatch(texts)$sim # pairwise Canberra similarity in terms of function word categories sims = lingmatch(texts, type = 'lsm')$sim # pairwise cosine similarity in terms of latent semantic space dimensions sims = lingmatch(texts, type = 'lss')$sim
Or, if you have processed data (such as LIWC output), you can enter that:
# if all dictionary categories are found in the input, only those variables # will be used sims = lingmatch(data, dict = 'function')$sim # otherwise, enter just the columns you want a part of the comparison sims = lingmatch(data[, c('cat1', 'cat2', 'catn')])$sim
Continue for more about loading text into R, processing texts, and measuring similarity, or see the comparisons guide for more about defining comparisons.
You will need a path to the file containing your texts. You could...
file.choose()
, which returns the path./
or \\
as separators), e.g.,'c:/users/name/documents/texts.txt'
'/home/Name/Documents/texts.txt'
'/users/name/documents/texts.txt'
normalizePath('example')
to see the full path):path.expand('~')
), e.g., '~/texts.txt'
getwd()
, set it with setwd()
), e.g., 'texts.txt'
'../texts.txt'
In the following examples, just the relative path to the file will be shown. This would be like if you set the working directory to the folder containing the files.
When there is one entry per line:
texts = readLines('texts.txt')
When you want to segment a single file:
# with multiple lines between entries segs = read.segments('texts.txt') # into 5 even segments segs = read.segments('texts.txt', 5) # into 100 word chunks segs = read.segments('texts.txt', segment.size = 100) # then get texts from segs texts = segs$text
When you want to read in multiple files from a folder:
texts = read.segments('foldername')$text
When your files are just text, you could also enter the path into lingmatch functions, without first loading them:
results = lingmatch('texts.txt')
When texts are in a column of a spreadsheet, stored in a plain-text file:
# comma delimited data = read.csv('data.csv') # tab delimited (sometimes with extension .tsv) data = read.delim('data.txt') # Other delimiters; define with the sep argument. # might also need to change the quote or other arguments # depending on your file's format data = read.delim('data.txt', sep = 'delimiting character') # then get texts from data texts = data$name_of_text_column
Install and load the readtext
package:
install.packages('readtext') library('readtext')
From a .doc or .docx file:
texts = readtext('texts.docx')$text # this returns all lines in one, so you could # use read.segments to split them up if needed texts = read.segments(texts)$text
From a .xls or .xlsx file:
texts = readtext('data.xlsx')$name_of_text_column
Processing texts represents them numerically, and this representation defines matching between them.
For example, matching between structural features (e.g., number of words and their average length) gives a sense of how similar the text itself is between texts:
structural_features = lma_meta(texts)
You could also look at exact matching between words by making a document-term matrix:
# all words dtm = lma_dtm(texts) # excluding stop words (function words) and rare words (those appearing in # fewer than 3 texts) dtm = lma_dtm(texts, exclude = 'function', dc.min = 2)
The raw texts in the next examples are processed with the lma_dtm
function, using its defaults, but you could also
enter a document-term matrix in place of texts, processed separately as in the previous examples.
Other structural features are function word categories, which would give a sense of how stylistically similar texts are:
function_cats = lma_termcat(texts, lma_dict())
To get at similarity in something like tone, you could use a sentiment dictionary:
sentiment = lma_termcat(texts, 'huliu', dir = '~/Dictionaries')
To get at similarity in overall meaning, you could use a content analysis focused dictionary like the General Inquirer:
inquirer_cats = lma_termcat(texts, 'inquirer', dir = '~/Dictionaries')
Or a set of embeddings:
glove_dimensions = lma_lspace( lma_dtm(texts), 'glove', dir = '~/Latent Semantic Spaces' )
Once you have processed texts, you can measure matching between them.
You could calculate similarity between each of them with different metrics:
# Inverse Canberra distance can_sims = lma_simets(function_cats, metric = 'canberra') # Cosine similarity cos_sims = lma_simets(function_cats, metric = 'cosine')
Or between each text and the average across texts, with all available metrics:
sims_to_mean = lma_simets(function_cats, colMeans(function_cats))
Or just between the first and second text:
lma_simets(function_cats[1,], function_cats[2,])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.