In haukelicht/manifestoEnhanceR:

knitr::opts_chunk$set(
  echo = TRUE
  , eval = TRUE
  , warning = FALSE
  , message = FALSE
  , fig.align = 'center'
  , fig.height = 4
  , fig.width = 8
)

In this short tutorial, I show how to use functions provided by the manifestoEnhanceR package can be used to convert a large batch of manifesto documents into XML documents enriched with document structure (headers, sentences, quasi-sentence nesting) and meta data (e.g, manifesto title or quasi-sentence CMP code).

I use the entire Swiss manifesto corpus and you'll be guided step by step through the process of

querying manifestos from the Manifesto Project API,
covnerting the manifestoR package's custom data objects into a tidy data format (i.e., a data frame),
enhancing manifesto data frames with text-level information (e.g., quasi-sentence and chapter numbers), and
converting the enhanced manifesto data frames to XML documents.

Before we beginn with this, a short note on why one would want to convert manifestoR data object to XML files.

Background

By enhancing manifesto data frames, we reconstruct a manifestos document structure. If you look at a manifesto in its print or online version, it starts with a title, then comes an introduction or summary of key points, following by a number of chapters that comprise many more sentences.

This document structure is not reflected in the way how the Manifesto Project makes its data available. While this is not a problem if you are merely interested in their codings, it can be a problem in case you want to use the manifesto test for other puporses, such as topic modeling or for training a word embedding model. For these latter puproses, quasi-sentences are not an ideal unit of (dis)aggregation. Sentences, paragraphs or chapters seem more appropriate.

Now, it turns out while paragraph structure cannot be recovered, we can infer sentences and chapter structure by combining the text of quasi-sentences and their codings. This is the purpose of the manifestoEnhanceR package.

Step 0: Loading required pacakges

# if not installed: `install.packages("manifestoR")`
library(manifestoR)
# if not installed: `devtools::install_github("haukelicht/manifestoEnhanceR)`
library(manifestoEnhanceR)
library(dplyr)
library(tidyr)
library(purrr)
library(xml2)
library(rvest)

Step 1: Load manifesto data

# set Manifesto Project API key (see `?mp_setapikey`)
mp_setapikey(file.path(Sys.getenv("SECRETS_PATH"), "manifesto_apikey.txt"))

# get CMP party-year data
cmp <- mp_maindataset(south_america = FALSE)

# subset to Swiss configurations
ch_configs <- cmp %>% 
  filter(countryname == "Switzerland") %>% 
  select(1:10)

# get Swiss manifestos
ch <- mp_corpus(countryname == "Switzerland", cache = TRUE)

Conversion to tidy data

The object ch is a 'ManifestoCorpus' object, which is essentially a list 'ManifestoDocument' objects with some additional fancy attributes.

class(ch)
str(head(ch, 2), 1)
lapply(head(ch, 2), class)

We can tidy this up (i.e., convert it to a long tibble) by applying the first manifestoEnhanceR function we encounter in this tutorial: as_tibble.ManifestoCorpus

man_dfs <- as_tibble(ch)
head(man_dfs)

man_dfs is a tibble (a fancy dataframe) with a list-column called 'data'. For each row, column data contains a manifesto tibble.

To this manifesto-level tibble, we can left-join manifesto- and party-level data:

man_sents <- man_dfs %>% 
  left_join(
    ch_configs %>% 
      transmute(manifesto_id = paste0(party, "_", date), party, partyname, partyabbrev) %>% 
      unique()
  )

Step 2: Define some test cases

To illustrate the other manifestoEnhanceR functions, we take four manifestos from the above corpus, each with some particular features (as commented below).

test1_df <- unnest(filter(man_sents, manifesto_id == "43120_201110"), data)
test2_df <- unnest(filter(man_sents, manifesto_id == "43110_198710"), data)
test3_df <- unnest(filter(man_sents, manifesto_id == "43110_199910"), data)
test4_df <- unnest(filter(man_sents, manifesto_id == "43120_201510"), data)

test2_df has just one row, and the single cell of column 'text' contains the entire, uncoded text of the manifesto. This is typical for very old manifestos or manifestos of very small parties, where the Manifesto Project team assigned no expert to split the manifesto in quasi-sentences and code it. We say that this manifesto has not been 'annotated'.
All other test cases have multiple rows. Each row corresponds to one coded quasi sentence. Codes are recorded in column 'cmp_code'.
However, in contrast to test3_df, test2_df and test_df have only numerical codes, that is, no 'H' code indicating document headers and titles.
But test3_df and test4_df have both titles, but in the former case which rows hold the manifesto title needs to be inferred from the ordering of text lines.

This sounds confusing? It is. But no worries. Function enhance_manifesto_df() handles all these intricacies for you as shown below.

Step 3: Enhance manifesto data frames

enhance_manifesto_df() returns a manifesto.df object that inherits from the input tibble, and simply adds four columns indicating document structure:

'qs_nr' (running quasi-sentence counter),
'sent_nr' (running sentence counter),
'role' (indicator, here 'qs' for all rows), and
'bloc_nr' (enumerates consecutive rows by 'role').

In addition, the returned manifesto.df obejct has two additional attributes:

'annotated': indicates wehtehr or not the input o has been annotated/coded by CMP experts.
'extra_cols': names of columns added by enhancing the ta frame.

By enhancing manifesto data frames, we add bloc, sentence and quasi-sentence (qs) counters, as well as a role indicators ('qs', 'header', 'title', or 'meta').

As one natrual sentence may contain multiple quasi-sentences, the latter map m:1 to the former. A bloc, in turn, is a number of consecutive rolws that all have the same 'role'. The following rows are defined:

'qs': A quasi-sentence
'title': The manifesto title. This is/are the first row(s) with CMP code 'H' or \code{NA}.
'header': A chapter header. Rows after title (if any) with CMP code 'H' or \code{NA}.
'meta': In annotated manifestos containing 'H' codes (e.g. test case 4), the row(s) between 'title' and the first 'header' rows.

This information is important if you want to count the number of chapters in a manifesto (= No. headers), or print only the title (if exists). And as noted above, having at hand this information, one could easily aggregate quasi-sentences at the sentence or chapter level while omitting title and other meta text.

test1_res <- enhance_manifesto_df(test1_df)
select(test1_res, 1:4, 17:25)
class(test1_res)
attr(test1_res, "annotated")
attr(test1_res, "extra_cols")
"title" %in% test1_res$role

test2_res <- enhance_manifesto_df(test2_df)
select(test2_res, 1:4, 17:25)
class(test2_res)
attr(test2_res, "annotated")

test3_res <- enhance_manifesto_df(test3_df)
class(test3_res)
attr(test3_res, "annotated")
"title" %in% test3_res$role

test4_res <- enhance_manifesto_df(test4_df)
class(test4_res)
attr(test4_res, "annotated")
"title" %in% test4_res$role

Step 4: Convert manifesto data frames to XML documents

Having enhanced our test case data frames, we can now use manifesto_df_to_xml() to convert them to XML. You have basically two options

setting parse = FALSE returns the XML-formatted string
setting parse = TRUE (the default) returns an {xml2} xml_document object.

# test case 1: as string
test1_xml <- manifesto_df_to_xml(test1_res, parse = FALSE)
class(test1_xml)
length(test1_xml)
test1_xml <- manifesto_df_to_xml(filter(test1_res, bloc_nr < 3), parse = FALSE)
# cat(test1_xml)

# as XML document
test1_xml <- manifesto_df_to_xml(test1_res)
test1_xml <- manifesto_df_to_xml(man.df = filter(test1_res, bloc_nr %in% 9:9))
class(test1_xml)
test1_xml

# test case 2
test2_xml <- manifesto_df_to_xml(test2_res, parse = FALSE)
cat(test2_xml)
test2_xml <- manifesto_df_to_xml(test2_res)
class(test2_xml)
test2_xml

# test case 3
test3_xml <- manifesto_df_to_xml(man.df = test3_res, parse = FALSE)
class(test3_xml); length(test3_xml)
test3_xml <- manifesto_df_to_xml(test3_res)
class(test3_xml)
test3_xml
html_node(test3_xml, "title")

# test case 4
test4_xml <- manifesto_df_to_xml(man.df = test4_res, parse = FALSE)
class(test4_xml); length(test4_xml)
test4_xml <- manifesto_df_to_xml(test4_res)
class(test4_xml)
test4_xml
html_node(test4_xml, "title")

As shown in the last code block, you can access the different nodes and attributes of the returned XML documents using {rvest}'s html_*.

Step 5 (optional): Apply to all manifestos in one pipeline

The below code simply combines the above steps in a single pipeline.

man_xmls <- man_sents %>% 
  # drop unneccessary columns
  select(-has_eu_code, -may_contradict_core_dataset, -md5sum_text, -md5sum_original, -annotations, -id) %>% 
  # unnest data
  unnest(data) %>% 
  # split by manifesto ID (-> list od DFs)
  split(.$manifesto_id) %>% 
  # apply enhance
  map(enhance_manifesto_df) %>% 
  # apply converter
  map(safely(manifesto_df_to_xml)) %>% 
  # gather all in list-column data frame 
  enframe() %>% 
  transmute(
    manifesto_id = name
    , xml = map(value, "result")
    , errors = map(map(value, "error"), "message")
    , n_errors = lengths(errors)
  )

Calling this code returns a tibble with list-column 'xml' containing the individual manifesto XML documents.

# the returned tibble contains xml_documents in the 'xml' list-column
head(man_xmls, 3)

To make sure that all covnersions were successful, you can look at column 'n_errors':

# there were no errors raised
man_xmls %>% 
  filter(n_errors > 0) %>% 
  unnest(errors)

Step 6 (optional): write manifestos to disk

You can alos loop over XML documents and write them to disk.

# write to disk
map2(
  man_xmls$xml
  , man_xmls$manifesto_id
  , function(xml, id, path = file.path("your", "local", "path")) {
  write_xml(xml, file.path(path, paste0(id, ".xml")))  
})