knitr::opts_chunk$set(
  echo = TRUE
  , eval = TRUE
  , warning = FALSE
  , message = FALSE
  , fig.align = 'center'
  , fig.height = 4
  , fig.width = 8
)

In this short tutorial, I show how to use functions provided by the manifestoEnhanceR package can be used to convert a large batch of manifesto documents into XML documents enriched with document structure (headers, sentences, quasi-sentence nesting) and meta data (e.g, manifesto title or quasi-sentence CMP code).

I use the entire Swiss manifesto corpus and you'll be guided step by step through the process of

  1. querying manifestos from the Manifesto Project API,
  2. covnerting the manifestoR package's custom data objects into a tidy data format (i.e., a data frame),
  3. enhancing manifesto data frames with text-level information (e.g., quasi-sentence and chapter numbers), and
  4. converting the enhanced manifesto data frames to XML documents.

Before we beginn with this, a short note on why one would want to convert manifestoR data object to XML files.

Background

By enhancing manifesto data frames, we reconstruct a manifestos document structure. If you look at a manifesto in its print or online version, it starts with a title, then comes an introduction or summary of key points, following by a number of chapters that comprise many more sentences.

This document structure is not reflected in the way how the Manifesto Project makes its data available. While this is not a problem if you are merely interested in their codings, it can be a problem in case you want to use the manifesto test for other puporses, such as topic modeling or for training a word embedding model. For these latter puproses, quasi-sentences are not an ideal unit of (dis)aggregation. Sentences, paragraphs or chapters seem more appropriate.

Now, it turns out while paragraph structure cannot be recovered, we can infer sentences and chapter structure by combining the text of quasi-sentences and their codings. This is the purpose of the manifestoEnhanceR package.

Step 0: Loading required pacakges

# if not installed: `install.packages("manifestoR")`
library(manifestoR)
# if not installed: `devtools::install_github("haukelicht/manifestoEnhanceR)`
library(manifestoEnhanceR)
library(dplyr)
library(tidyr)
library(purrr)
library(xml2)
library(rvest)

Step 1: Load manifesto data

# set Manifesto Project API key (see `?mp_setapikey`)
mp_setapikey(file.path(Sys.getenv("SECRETS_PATH"), "manifesto_apikey.txt")) 
# get CMP party-year data
cmp <- mp_maindataset(south_america = FALSE)

# subset to Swiss configurations
ch_configs <- cmp %>% 
  filter(countryname == "Switzerland") %>% 
  select(1:10)

# get Swiss manifestos
ch <- mp_corpus(countryname == "Switzerland", cache = TRUE)

Conversion to tidy data

The object ch is a 'ManifestoCorpus' object, which is essentially a list 'ManifestoDocument' objects with some additional fancy attributes.

class(ch)
str(head(ch, 2), 1)
lapply(head(ch, 2), class)

We can tidy this up (i.e., convert it to a long tibble) by applying the first manifestoEnhanceR function we encounter in this tutorial: as_tibble.ManifestoCorpus

man_dfs <- as_tibble(ch)
head(man_dfs)

man_dfs is a tibble (a fancy dataframe) with a list-column called 'data'. For each row, column data contains a manifesto tibble.

To this manifesto-level tibble, we can left-join manifesto- and party-level data:

man_sents <- man_dfs %>% 
  left_join(
    ch_configs %>% 
      transmute(manifesto_id = paste0(party, "_", date), party, partyname, partyabbrev) %>% 
      unique()
  )

Step 2: Define some test cases

To illustrate the other manifestoEnhanceR functions, we take four manifestos from the above corpus, each with some particular features (as commented below).

test1_df <- unnest(filter(man_sents, manifesto_id == "43120_201110"), data)
test2_df <- unnest(filter(man_sents, manifesto_id == "43110_198710"), data)
test3_df <- unnest(filter(man_sents, manifesto_id == "43110_199910"), data)
test4_df <- unnest(filter(man_sents, manifesto_id == "43120_201510"), data)

This sounds confusing? It is. But no worries. Function enhance_manifesto_df() handles all these intricacies for you as shown below.

Step 3: Enhance manifesto data frames

enhance_manifesto_df() returns a manifesto.df object that inherits from the input tibble, and simply adds four columns indicating document structure:

In addition, the returned manifesto.df obejct has two additional attributes:

By enhancing manifesto data frames, we add bloc, sentence and quasi-sentence (qs) counters, as well as a role indicators ('qs', 'header', 'title', or 'meta').

As one natrual sentence may contain multiple quasi-sentences, the latter map m:1 to the former. A bloc, in turn, is a number of consecutive rolws that all have the same 'role'. The following rows are defined:

This information is important if you want to count the number of chapters in a manifesto (= No. headers), or print only the title (if exists). And as noted above, having at hand this information, one could easily aggregate quasi-sentences at the sentence or chapter level while omitting title and other meta text.

test1_res <- enhance_manifesto_df(test1_df)
select(test1_res, 1:4, 17:25)
class(test1_res)
attr(test1_res, "annotated")
attr(test1_res, "extra_cols")
"title" %in% test1_res$role
test2_res <- enhance_manifesto_df(test2_df)
select(test2_res, 1:4, 17:25)
class(test2_res)
attr(test2_res, "annotated")
test3_res <- enhance_manifesto_df(test3_df)
class(test3_res)
attr(test3_res, "annotated")
"title" %in% test3_res$role
test4_res <- enhance_manifesto_df(test4_df)
class(test4_res)
attr(test4_res, "annotated")
"title" %in% test4_res$role

Step 4: Convert manifesto data frames to XML documents

Having enhanced our test case data frames, we can now use manifesto_df_to_xml() to convert them to XML. You have basically two options

# test case 1: as string
test1_xml <- manifesto_df_to_xml(test1_res, parse = FALSE)
class(test1_xml)
length(test1_xml)
test1_xml <- manifesto_df_to_xml(filter(test1_res, bloc_nr < 3), parse = FALSE)
# cat(test1_xml)

# as XML document
test1_xml <- manifesto_df_to_xml(test1_res)
test1_xml <- manifesto_df_to_xml(man.df = filter(test1_res, bloc_nr %in% 9:9))
class(test1_xml)
test1_xml
# test case 2
test2_xml <- manifesto_df_to_xml(test2_res, parse = FALSE)
cat(test2_xml)
test2_xml <- manifesto_df_to_xml(test2_res)
class(test2_xml)
test2_xml
# test case 3
test3_xml <- manifesto_df_to_xml(man.df = test3_res, parse = FALSE)
class(test3_xml); length(test3_xml)
test3_xml <- manifesto_df_to_xml(test3_res)
class(test3_xml)
test3_xml
html_node(test3_xml, "title")
# test case 4
test4_xml <- manifesto_df_to_xml(man.df = test4_res, parse = FALSE)
class(test4_xml); length(test4_xml)
test4_xml <- manifesto_df_to_xml(test4_res)
class(test4_xml)
test4_xml
html_node(test4_xml, "title")

As shown in the last code block, you can access the different nodes and attributes of the returned XML documents using {rvest}'s html_*.

Step 5 (optional): Apply to all manifestos in one pipeline

The below code simply combines the above steps in a single pipeline.

man_xmls <- man_sents %>% 
  # drop unneccessary columns
  select(-has_eu_code, -may_contradict_core_dataset, -md5sum_text, -md5sum_original, -annotations, -id) %>% 
  # unnest data
  unnest(data) %>% 
  # split by manifesto ID (-> list od DFs)
  split(.$manifesto_id) %>% 
  # apply enhance
  map(enhance_manifesto_df) %>% 
  # apply converter
  map(safely(manifesto_df_to_xml)) %>% 
  # gather all in list-column data frame 
  enframe() %>% 
  transmute(
    manifesto_id = name
    , xml = map(value, "result")
    , errors = map(map(value, "error"), "message")
    , n_errors = lengths(errors)
  )

Calling this code returns a tibble with list-column 'xml' containing the individual manifesto XML documents.

# the returned tibble contains xml_documents in the 'xml' list-column
head(man_xmls, 3)

To make sure that all covnersions were successful, you can look at column 'n_errors':

# there were no errors raised
man_xmls %>% 
  filter(n_errors > 0) %>% 
  unnest(errors)

Step 6 (optional): write manifestos to disk

You can alos loop over XML documents and write them to disk.

# write to disk
map2(
  man_xmls$xml
  , man_xmls$manifesto_id
  , function(xml, id, path = file.path("your", "local", "path")) {
  write_xml(xml, file.path(path, paste0(id, ".xml")))  
})


haukelicht/manifestoEnhanceR documentation built on March 30, 2020, 3:15 a.m.