title: 'tipitaka: An R package for analyzing the Pali Canon' tags: - R - Tipitaka - Pali - Buddhism authors: - name: Dan Zigmond orcid: 0000-0001-7138-5612 affiliation: 1 affiliations: - name: Jikoji Zen Center, Los Gatos, CA index: 1 date: September 26, 2020 bibliography: references.bib
The Tipiṭaka or Pali Canon is the canonical scripture of Theravadin Buddhists worldwide. It purports to record the direct teachings of the historical Buddha. It was first recorded in written form in what is now Sri Lanka, likely around 100 BCE, in the Pali language, a Middle Indo-Aryan dialect. The goal of the tipitaka
package is to make these texts available to students and researchers and allow them to apply the tools of computational linguistics using R.
The title itself of the Pali Canon, Tipiṭaka, can be translated roughly as “three baskets” and the Canon is composed of three distinct sets of scriptures:
Each of these is composed of several books, which in turn are often divided into chapters and verses. The Sutta Piṭaka is the most widely studied and so its divisions have particular significance. It contains four major collections of suttas or discourses (Dīgha Nikāya, Majjhima Nikāya, Saṃyutta Nikāya, and Aṅguttara Nikāya), plus a fifth collection (Khuddaka Nikāya) with a wide variety of generally shorter material.
Although the Tipiṭaka has been studied for nearly 2,000 years, it is not widely available in electronic form. The Pali Text Society (PTS) began publishing the Canon in Roman-script editions in the late 19th century [@PTS]. While these have become the standard reference for Western scholarship, they are not available electronically. In part for this reason, there are few published studies using the techniques of modern text mining and computational linguistics applied to these texts.
The version of the Tipiṭaka included here is based the Chattha Sangāyana Tipiṭaka version 4.0 (CST4) published by the Vipassana Research Institute and received from them in April 2020 [@VRI]. This edition originated at the Sixth Buddhist Council, held in Burma from 1954 to 1956. Originally published after the Council meetings in Burmese script, the Vipassana Research Institute in India began printing this edition in Devanagari and eventually Roman (and several other) scripts in 1990 and later published the results electronically as well. While the Vipassana Research Institute maintains interactive web-based access to these files, they cannot easily be downloaded for computational analysis. The tipitaka
aims to rectify this.
The tipitaka
package primarily consists of the texts of the Tipitaka in various electronic forms, plus a few simple functions and data structures for working with the Pali language.
I have made a few edits to the CST4 files in creating this package:
Where volumes were split across multiple files, they are here are combined as a single volume.
Where volume numbering was inconsistent with the widely-used PTS scheme, I have tried to conform with PTS (but see below for exceptions).
A very few typos that were found while processing have been corrected.
There is no universal script for Pali. Traditionally each Buddhist country uses its own script to write Pali phonetically: in Thai script in Thailand, Burmese in Burma, Sinhalese in Sri Lanka, etc. This package uses the Roman script and the diacritical system developed by the PTS, based on the system commonly used for transliterating Sanskrit.
The contents are organized into the following data structures:
tipitaka_raw
: the complete text of the Tipiṭaka.tipitaka_long
: the complete Tipiṭaka in "long" formtipitaka_wide
: the complete Tipiṭaka in "wide" formtipitaka_names
: the names and abbreviation of each book of the Tipiṭakasutta_pitaka
: the names and abbreviations of each volume of the Sutta Pitakavinaya_pitaka
: the names and abbreviation of each volume of the Vinaya Pitakaabhidhamma_pitaka
: the names of each volume of the Abhidhamma Pitakasati_sutta_raw
: the complete text of the Mahāsatipatthāna Suttasati_sutta_long
: the Mahāsatipatthāna Sutta in "long" formThe _raw
forms are the unparsed text of the Tipiṭaka, with each volume provided as a separate row. The _long
forms process the texts such that each row provides the count of one unique Pali word in one volume of the Tipiṭaka. For example, the first three rows are:
| book | word | n | total | freq | |-----------|-------|------|--------|-----------| | Abh.VII | paccayo | 13836 | 377230 | 0.03667789 | | Abh.VII | pe | 12912 | 377230 | 0.03422845 | Abh.VII | dhammo | 12880 | 377230 | 0.03414363
This tell us that the word paccayo (cause; motive) occurs in the seventh volume of the Abhidhamma 13,836 times and represents roughly 3.7% of all words in that volume. This can be useful in creating "word clouds" and other representations of word frequency per volume.
The _wide
forms transpose this data such that each row is a volume of the Tipiṭaka and each column is a unique Pali word, such that every cell (x, y) gives the count of x word in y volume. This is useful for computing the "distance" between various volumes by word frequency and for clustering volumes using these measures.
The Mahāsatipatthāna Sutta is provided separately although it is also included as part of the Sutta Piṭaka, simply to give an example of one complete discourse. This is a particularly well-known discourse on the foundations of mindfulness.
Note that the Pali alphabet does not follow the alphabetical ordering of English or other Roman-script languages. For this reason, tipitaka
includes pali_alphabet
giving the full Pali alphabet in order, and the functions, pali_lt
, pali_gt
, pali_eq
, and pali_sort
for comparing and sorting Pali strings. Although pali_sort
is based on Quicksort, this does not mean it is quick. Because of R's copy semantics, pali_sort
creates many intermediate data structures and is quite slow for large word sets. It is provided primarily for sorting short lists of words for glossaries and the like.
This package also includes pali_stop_words
, a preliminary set of "stop words" for Pali, which is based on the words labeled as "indeclinable" or "participle" in the PTS Pali-English Dictionary [@PED], as well as the most common pronouns [@Geiger]. This is useful in semantic analysis where such very common words should be excluded.
The following are examples of simple analyses using the tiptaka
package.
The tipitaka_wide
structure is particularly well-suited to clustering applications. A simple cluster dendrogram of the Pali Canon can be created with just a few lines of R:
library(tipitaka)
dist_m <- dist(tipitaka_wide)
cluster <- hclust(dist_m)
plot(cluster)
We can also perform basic k-means clustering using this same distance measure, illustrated here using the excellent factoextra
package [@factoextra]:
library(factoextra)
km <- kmeans(dist_m, 2, nstart = 25, algorithm = "Lloyd")
fviz_cluster(km, dist_m, labelsize = 12, repel = TRUE)
The _long
forms are well-suited to creating word clouds based on word frequency, shown here with stop words removed and illustrated with the wordcloud
package @wordcloud:
library(wordcloud)
library(dplyr)
sati_sutta_long %>%
anti_join(pali_stop_words, by = "word") %>%
with(wordcloud(word, n, max.words = 40))
````

## Frequency by rank
Finally, we can plot Pali word frequency by rank across the entire *Tipiṭaka*, revealing a classic power law distribution:
```R
library(dplyr)
freq_by_rank <- tipitaka_long %>%
group_by(word) %>%
add_count(wt = n, name = "word_total") %>%
ungroup() %>%
distinct(word, .keep_all = TRUE) %>%
mutate(tipitaka_total =
sum(distinct(tipitaka_long, book,
.keep_all = TRUE)$total)) %>%
transform(freq = word_total/tipitaka_total) %>%
arrange(desc(freq)) %>%
mutate(rank = row_number()) %>%
select(-n, -total, -book)
freq_by_rank %>%
ggplot(aes(rank, freq)) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()
This is intended to be the first, preliminary release of tipitaka
. Much more work remains to be done. The following are still in progress:
As mentioned above, tipitaka
attempts to match the structure of the PTS edition of the Tipiṭaka, but it does not do so perfectly. The PTS and CST4 editions differ in the way they divide the Tipiṭaka into volumes. The resulting numbering in tipitaka
is as follows:
A future revision to tipitaka
will correct these inconsistencies and fully conform to PTS volume numbering.
Several features of Pali make it a somewhat tricky language for computational analysis:
Taken together, this means that words often appear in the Pali Canon in a vast array of different forms. For example, the Tipiṭaka contains 270 variations on the word bhikkhu (monk) if one counts all words beginning with the root “bhikkh”.
There are advantages and disadvantages to using the exact Pali syntax found in the Canon as the basis for linguistic analysis. By way of analogy, the English words monk and monks are obviously distinct, and different texts may vary in the relative frequency of each. On the other hand, something is clearly lost if we treat the two as entirely unrelated, with no more connection than that between monk and mouse. Yet that is exactly what we are doing when we treat, for example, bhikkhu and bhikkhū as entirely distinct words.
It would be very useful to provide a function to convert Pali words to their stem forms in addition to having every variant form available. However, developing an accurate Pali stemming algorithm will be a substantial undertaking. Some progress has been made by others (see, for example, @Basapur, @Elwert, and @Alfter), but no complete algorithm appears yet publicly available. This will be tackled in a future release.
Finally, a more efficient pali_sort
would probably be useful. The current implementation is as much as two orders of magnitude slower than R's native sort. Rewriting the current algorithm in C++ would probably be sufficient to improve the performance substantially.
This package draws heavily on the tidyverse
[@tidy] and tidytext
[@tidytext] package as well, of course, on the R statistical programming language [@R].
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.