In conjugateprior/rjca: R Interface to JCA Tools

This package consists of R functions to drive the java code in jca tools. You can find that here. It's a light wrapper and not thoroughly tested, but it might do what you want. The first task, which is probably the hardest, is to get Java and R working together. That's covered in the other vignette.

In this one we exercise the package on a small content analysis example.

Describing documents

Let's start by describing some documents. We'll use a replication data set from Bara et al.'s analysis of a parliamentary debate on relaxing the abortion laws that took place in Britain in the 60s. The data has been scraped from Hansard and each speaker's contributions to the debate concatenated. Consequently each document is named after the speaker whose contributions it contains.

First we load the package and find the data folder

library(rjca)
deb <- system.file("extdata", "debate-by-speaker", package="rjca")
dir(deb) ## list the files

In case you are wondering, the prefix on the speaker names indicates whether they abstained, voted yes or voted no after the debate.

Let's compute some summary statistics

desc <- jca_desc(deb)
head(desc) ## top of the file

This functions computes the number of words, number of word types, number of words that occurred exactly once, the proportion of all words used that were deployed in this document, the ratio of word types to word tokens, and the number of sentences, for each of the documents.

All the jca_ functions run their java in the background and drop the results into a file or folder. This function returned a data frame but also reported the temporary location where the output landed so you can look at it more closely if you want.

Also common to all the functions is the possibility to add information about the documents, e.g. their file encoding and their locale. Here we've taken the system default locale. If we wanted to specify that the document were encoded in KOI8-R (a Russian encoding) then we would instead use

desc <- jca_desc(deb, encoding='KOI8-R')

As it happens this would not affect much since the text is almost completely ASCII characters. For other documents, specifying the encoding will be vital for recognizing the words in the document.

It is worth noting that, whatever encoding the documents arrive in, the output of these functions will be UTF-8. This is taken into account by the package, but you may need to know that if you read in the file and folder output with other programs on a system that does not have a system UTF-8 encoding (due to mistaken setup, deliberate perversity, or MS Windows - but I repeat myself).

Counting words

To get a word frequency matrix for these documents

wfmat <- jca_word(deb)
dim(wfmat)

The output is a sparse matrix so we aren't storing the thousands of zero counts that inevitably occur. If you want to work with it further, remember to load the Matrix package.

Sometimes we'd prefer to do some filtering before constructing the matrix. This function allows you to remove numbers, currency amounts, stop words (if you provide a filename with the words in it), and to reduce to word stems in several languages using the snowball stemmer.

For example, if we wanted to remove numbers and currency and apply a stemmer

wfmat2 <- jca_word(deb, no.currency=TRUE, no.numbers=TRUE, stemmer='english')
dim(wfmat2) ## now rather smaller

Counting categories

Counting words is fun but counting categories are funner. Using a content analysis dictionary also allows us to define and count phrases.

We'll use the dictionary for the Bara et al. study. This lives next to the documents in the package

dict <- system.file("extdata", "bara-et-al.ykd", package="rjca")

The file is in Yoshikoder format, an XML format that represents hierarchically structured content analysis dictionaries and is designed to work in the Yoshikoder software. However, the jca_cat and jca_conc functions accept dictionary files in the following formats, recognizing each from the suffix of the file

Yoshikoder (suffix .ykd)
VBPro (suffix .vbpro)
Lexicoder (suffix .lcd)
LIWC (suffix: .dic)
Wordstat (suffix .CAT experimental)

The VBPro format is particularly convenient for throwing together a dictionary by hand in a hurry. Such dictionaries look like

>>Advocacy<<
object*
override
>>Religion<<
relig*
catholic church

where the category names are anything between >> and << and the words underneath are word or phrase patterns to match, one per line. You can write one of these in any text editor and hand it to the function as a dictionary (to be quite sure you're matching what you think you're matching, save the dictionary file as 'UTF-8').

This format is reasonably flexible; you can use wildcards and multiword patterns, but the dictionary has no nested categories so you'll have to fake those with your category naming convention.

OK, back to the content analysis. Let's run this dictionary over the debate documents

debca <- jca_cat(dict, deb)
head(debca) ## a bit messy

The root node of the dictionary gives it its name 'debate'. Hence the first column 'debate' contains the count of words or phrases (there are only words in this dictionary) that match any pattern in any category. The next column 'debate_advocacy' is named to indicate that 'advocacy' is a subcategory of 'debate'. Note that the counts in all the subcategories of 'debate' will add up to the count in the 'debate' category, and so on nestedly downwards if there were more structure in this dictionary.

The document word counts are included so that it is possible to say that e.g. McNamara deploys words in the medical category at a rate of about 32 words per thousand (because 96/3009 * 1000 = 31.9) in comparison to Deedes who use them at about 19 per thousand.

By dividing the subcategory counts by the 'debate' column we get the proportions of words each deploys on the medical topic, disregarding all the words that are not categorised by the dictionary.

For more complex dictionaries there are some subtleties to using this flattened output. For most purposes you can just use the convenience function dict_slice.

For example, here we'll slice the dictionary at level 1, that is: the top level categories, one level down from the root (any lower level category counts will be nested) in these

debca1 <- dict_slice(debca, level=1)

The top of the output now looks more friendly

knitr::kable(head(debca1))

The next section has a discussion of the ins and outs of hierarchical dictionaries and pattern matching. Alternatively you could just skip to the next major section to see how to make concordances.

Hierarchical dictionaries

Because dictionaries are hierarchically structured (and have an extra column with word counts) simple row totals will almost never add up to the number of matches in the document. This matters when you want to treat the output as a contingency table, and gets more extreme the more nesting there is among the dictionary categories.

For example, assume that the moral category has within it two subcategories that distinguish religious and church-related language church, from general ethical considerations ethical. Further assume that all moral patterns are in one or other of these two subcategories. Then the output would contain two extra columns corresponding to the subcategories

 debate, ... debate_moral, debate_moral_church, debate_moral_ethical, ...
 201,    ... 14,           6,                   8, ...

This structure is convenient if we want to be able to merge or separate the subcategories of moral in later analysis, but it does mean that the row of category counts now adds up to 215, not 201. On the other hand, we are guaranteed that 6 + 8 = 14.

But not even this will be true if there are patterns considered to be moral but neither church nor ethical, that is: patterns associated with moral but not with any of its subcategories. If these patterns were matched to 2 words in the first document then the output would look instead like

 debate, ... debate_moral, debate_moral_church, debate_moral_ethical, ...
 201, ...    16,           6,                   8, ...

The upshot is: be careful what parts of the output you pass on to other tools.

If you are planning to analyse your data using statistical tools that assume a contingency table structure, then the duplicated counts from a hierarchical dictionary are probably not what you want. In these cases, pick a level of analysis and then remove the columns counting the sub and super categories with dict_slice.

Pattern matching

Dictionaries with patterns containing wildcards (basically the *) and patterns matching multiple words bring some subtle issues with word counting.

The first issue is when multiple patterns match the same word token. For example, the bara dictionary has no wildcards and therefore lists as matches in the advocacy category objection, objectionable, and objections. If we replaced the first two with objection* but left objections then we would double count every instance of objections in a document - once as an exact match to objections and then again as a match to objection*. With some care this kind of overlap can be avoided, but it is not always knowable in advance of seeing a document whether there are words that would match one or more patterns.

The issue is avoided by the tools by counting 'token coverage' not 'token matches'. If, for the sake of the example, objection* and objections were the only two patterns in advocacy and they matched e.g. the 120th, 175th, and 210th word in some document, then we would record 3 tokens matched to advocacy, even though 6 word to pattern matches were actually made.

Counting coverage rather than matches also solves a similar problem generated by multiple word matches. If we counted matches using the patterns united kingdom and united then every instance of 'united' that preceded kingdom would be counted twice - once for matching the first half of united kingdom and once for matching united. Adding wildcards make this even harder to avoid the examining patterns. As before, if united kingdom matches the 25th and 26th word tokens in a document then united also matches the 25th, but only the number of tokens covered by patterns in the category are recorded - in this case 2.

Caveats about pattern matching

Counting token coverage rather than matches removes the risk of double counting within each category. It also removes the risk of double counting within a hierarchy of nested categories (technically the identities of the matched word tokens are 'percolated' up the category hierarchy from the lowest level). However, it does not prevent a double counting of a word by two categories at the same level.

For example, if the church and ethical subcategories of moral contained patterns that matched the same words, then each instance of those words in a document would increment both subcategories. This is not, in general, possible to prevent. (Sometimes it is not even problematic, depending on the purpose to which the dictionary is put). Note, however, that the counts associated with their supercategory moral would be correct and would not contain overcounts.

Old school pattern matching

If for some reason you liked the possibility of double-counting words, perhaps because you are replicating the analysis of someone who used a tool that did things that way, then you can set the old.matching parameter to TRUE.

Looking at words and categories in context

If we want to see what sorts of things the dictionary is picking up in the documents we can arrange all the category matches in their local context. This is a form of concordance, or keyword in context (KWIC). Let's have a look at the medical category.

debconc <- jca_conc(deb, dictionary=dict, category='medical')
head(debconc)

The words we are matching are aligned in the middle of the second column and the document in which they occur is noted for each instance in the left column. Here the words we are looking at in context are: 'doctor', 'physical', 'mental', 'medical' and 'doctors'.

If we are more interested in just one of these words, or in a completely new phrase we can skip the dictionary part and just hand in a pattern directly. Here we look at how each document uses a phrase

debconc2 <- jca_conc(deb, pattern='medical profession')
head(debconc2)

Searching instead for the pattern 'medical *' will show all pairs of words where the first is 'medical'. This might be useful to see how many uses of 'medical' are not about the profession itself but rather about medical reasoning, medical procedures, and other staff. Very few, it turns out.

The asterisk is a wildcard that can be embedded in strings too, e.g. 'profess*' will pick up all words starting with these letters.

If you just want to view the concordance, you can also ask to have it opened in a web browser by setting open.browser to TRUE. If you are actually more interested in sorting matches by pattern then you'll need to set prettyprint to FALSE so that the output has three columns not two, where the second starts with the words the collocation line is centred on.

conjugateprior/rjca documentation built on May 13, 2019, 10:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com