README.md

Supreme

Supreme is a set of useful functions for my PhD thesis on applying LDA topic models to a corpus of Italian Supreme Court decisions. It's built on tm and topicmodels packages and its development is currently 50% complete.

It has three main goals:

A topic model is a generative model that specifies a simple probabilistic procedure by which documents in a corpus can be generated. In the Latent Dirichlet Allocation approach, a topic is a probability distribution over a fixed vocabulary of terms, each document is modeled as a mixture of k topics, and the mixing coefficients can be used for representing the documents as points on k-􀀀1 dimensional simplex spanned by the topics. Approximate posterior inference is performed in order to learn the hidden topical structure from the observed data, i.e., the words in the documents.

LDA is the simplest topic model and many more models can be built from it.

Full reference can be found in:

  1. D.M. Blei and J.D. Lafferty, Text mining: Classification, clustering, and applications, ch. Topic Models, Chapman & Hall/CRC Press, 2009.

  2. D.M. Blei and A.Y. Ng and M.I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning Research, 3, 2003, 993-1022 (http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf).

  3. T.L. Griffiths and M. Steyvers, Finding scientific topics, Proceedings of the National Academy of Sciences of the United States of America, vol. 101, National Academy of Sciences, 2004, Suppl. 1, pp. 5228–5235.

You can install the latest development version from github with:

install.packages("devtools")
devtools::install_github("paolofantini/Supreme")

Input data

Data frame of class dfciv

A data frame with 15,485 rows and 15 variables relating to final decisions (sentences) in civil matters delivered by the Italian Supreme Court during the year 2013.

Description of the variables

You can obtain the data frame dfciv:

  1. loading the compressed xml file from here

  2. and running the following code

library(Supreme)

# Unzip and load in memory the xml input file. 
xml_input <- unzip("../ISC_Civil_2013.zip")
dfciv_original <- xmlciv2df(xml_input)

# Select the subset of final decisions. 
dfciv_final <- subset(dfciv_original, tipoProv == "S")

# Select only the 10 larger classes.
cl <- names(sort(table(dfciv_final$Idmateria), decreasing = TRUE))[1:10]
dfciv <- subset(dfciv_final, Idmateria %in% cl)

str(dfciv)

rm(dfciv_original, dfciv_final)

Labels: classes

Labels that are assigned to cases when they come to the Supreme Court from lower courts.

We consider only the first 10 larger classes:

Functions

Supreme implements the following functions:

Output data

Hard document-term matrix

A document-term matrix dtm with 15,485 rows and 52,504 columns. The rows in this matrix correspond to the documents and the columns to the terms. The entry $f_{i,j}$ indicates the frequency of j-th term in the i-th document. The number of rows is equal to the size of the corpus and the number of columns to the size of the vocabulary.

dtm contains the original document-term matrix as obtained by applying the corpus2dtm() function to the original ISC corpus. Each row in this matrix represents a document as a simple bag of words after removing punctuation, numbers, stopwords and white spaces.

You can obtain the original corpus and the hard dtm by running the following code:

library(Supreme)

# Get (and save) the corpus from data frame dfciv.
data("dfciv")
corpus <- dfciv2corpus(dfciv, TRUE)

# Get dtm from corpus. 
data("italianStopWords")  # for removing italian stop words
dtm <- corpus2dtm(corpus)

Dimensionality reduction of (hard) dtm

reduce_dtm() reduces the number of columns (terms) of dtm by applying two different methods:

You can obtain the reduced document-term matrices dtm.tfidf and dtm.lognet by running the following code:

library(Supreme)

### tfidf method
data("dtm")
dtm.tfidf <- reduce_dtm(dtm, method = "tfidf", export = TRUE)

### lognet method
data("dtm")
data("classes")
dtm.lognet <- reduce_dtm(dtm, method = "lognet", target = classes, export = TRUE)

Multicore parallel runs of LDA models and best model selection

mcLDA() runs multiple parallel LDA models by varying the number of topics k over a predefined grid of values and performs model selection by applying the logClass() function to each model. A vector of misclassification error on the test set (e1.test) is returned and the best model is selected with minimum misclassification error.

In the following example, we run 8 different models in parallel on a 8-cores CPU by assigning one model to each core.

library(Supreme)

data("dtm")
data("classes")
dtm.lognet <- reduce_dtm(dtm, method = "lognet", target = classes)

# 8 cores: one model for each core.
mc.lda.models <- mcLDA(dtm.lognet$reduced, lda.method = "VEM", k.runs = list(from = 10, to = 45, steps = 5), target = classes)

mcLDA() uses parallel computing functionalities provided by packages parallel, doParallel and foreach. They should work fine on both Unix-like and Windows-like systems.



paolofantini/Supreme documentation built on May 24, 2019, 6:14 p.m.