README.md

README

dnddata

This is a weekly updated dataset of character that are submitted to my web applications printSheetApp and interactiveSheet. It is a superset of the dataset I previously released under oganm/dndstats with a much larger sample (7946 characters) size and more data fields. It was inspired by the FiveThirtyEight article on race/class proportions and the data seems to correlate well with those results (see my dndstats article).

Along with a simple table (an R data.frame in package), the data is also present in json format (an R list in package). In the table version some data fields encode complex information that are represented in a more readable manner in the json format. The data included is otherwise identical.

Usage/installation

If you are an R user, you can simply install this package and load it to access the dataset

devtools::install_github('oganm/dnddata')
library(dnddata)

Try ?tables, ?lists to see available objects and their descriptions

If you are not an R user, access the files within the data-raw directory. The files are available as JSON and TSV. You can find the field descriptions below. dnd_chars_all files contain all characters that are submitted while dnd_chars_unique files are filtered to include unique characters.

Examples

I will be using the list form of the dataset as a basis here.

Let’s replicate that plot from fivethirtyeight as I did in my original article.

library(purrr)
library(ggplot2)
library(magrittr)
library(dplyr)
library(reshape2)

# find all available races
races = dnd_chars_unique_list %>% 
    purrr::map('race') %>% 
    purrr::map_chr('processedRace') %>% trimws() %>% 
    unique %>% {.[.!='']}

# find all available classes
classes = dnd_chars_unique_list %>% 
    purrr::map('class') %>%
    unlist(recursive = FALSE) %>%
    purrr::map_chr('class') %>% trimws() %>%  unique

# create an empty matrix
coOccurenceMatrix = matrix(0 , nrow=length(races),ncol = length(classes))
colnames(coOccurenceMatrix) = classes
rownames(coOccurenceMatrix) = races
# fill the matrix with co-occurences of race and classes
for(i in seq_along(races)){
    for(j in seq_along(classes)){
        # get characters with the right race
        raceSubset = dnd_chars_unique_list[dnd_chars_unique_list %>% 
                          purrr::map('race') %>% 
                          purrr::map_chr('processedRace') %>% {.==races[i]}]

        # get the characters with the right class. Weight multiclassed characters based on level
        raceSubset %>% purrr::map('class') %>% 
            purrr::map_dbl(function(x){
                x  %>% sapply(function(y){
                    (trimws(y$class) == classes[j])*y$level/(sum(map_int(x,'level')))
                }) %>% sum}) %>% sum -> coOcc

        coOccurenceMatrix[i,j] = coOcc
    }
}

# reorder the matrix a little bit
coOccurenceMatrix = 
    coOccurenceMatrix[coOccurenceMatrix %>% apply(1,sum) %>% order(decreasing = FALSE),
                            coOccurenceMatrix %>% apply(2,sum) %>% order(decreasing = TRUE)]

# calculate percentages
coOccurenceMatrix = coOccurenceMatrix/(sum(coOccurenceMatrix))* 100

# remove the rows and columns if they are less than 1%
coOccurenceMatrixSubset = coOccurenceMatrix[,!(coOccurenceMatrix %>% apply(2,sum) %>% {.<1})]
coOccurenceMatrixSubset = coOccurenceMatrixSubset[!(coOccurenceMatrixSubset %>% apply(1,sum) %>% {.<1}),]

# add in class and race sums
classSums = coOccurenceMatrix %>% apply(2,sum) %>% {.[colnames(coOccurenceMatrixSubset)]}
raceSums = coOccurenceMatrix %>% apply(1,sum) %>% {.[rownames(coOccurenceMatrixSubset)]}
coOccurenceMatrixSubset = cbind(coOccurenceMatrixSubset,raceSums)
coOccurenceMatrixSubset = rbind(Total = c(classSums,NA), coOccurenceMatrixSubset)
colnames(coOccurenceMatrixSubset)[ncol(coOccurenceMatrixSubset)] = "Total"

# ggplot
coOccurenceFrame = coOccurenceMatrixSubset %>% reshape2::melt()
names(coOccurenceFrame)[1:2] = c('Race','Class')
coOccurenceFrame %<>% mutate(fillCol = value*(Race!='Total' & Class!='Total'))
coOccurenceFrame %>% ggplot(aes(x = Class,y = Race)) +
    geom_tile(aes(fill = fillCol),show.legend = FALSE)+
    scale_fill_continuous(low = 'white',high = '#46A948',na.value = 'white')+
    cowplot::theme_cowplot() + 
    geom_text(aes(label = value %>% round(2) %>% format(nsmall=2))) + 
    scale_x_discrete(position='top') + xlab('') + ylab('') + 
    theme(axis.text.x = element_text(angle = 30,vjust = 0.5,hjust = 0)) 

Or try something new. Wonder which fighting style is more popular?

dnd_chars_unique_list %>% purrr::map('choices') %>% 
    purrr::map('fighting style') %>% 
    unlist %>%
    table %>% 
    sort(decreasing = TRUE) %>% 
    as.data.frame %>% 
    ggplot(aes(x = ., y = Freq)) +
    geom_bar(stat= 'identity') +
    cowplot::theme_cowplot() +
    theme(axis.text.x= element_text(angle = 45,hjust = 1))

About the data

Column/element description

The list version of this dataset contains all of these fields but they are organised a little differently, keeping fields like spells and processedSpells together.

Caveats

Possible Issues with data fields

Some data fields are more reliable than others. Below is a summary of all potential problems with the data fields

74% of all spells parsed did not require any modification. 21% of were only able to be matched through the heuristics. A manual examination of a random seleciton of these matches revealed 2/200 mistakes. 5% of the spell entries were not matched to an official spell. Manual observation of these entries revealed that the common reasons for a failure to match are users writing the spell under the wrong spell level, writing some class/race features such as blindsight as spells or adding/removing more than 10 charters when writing the spells either through abbreviation or adding additional information about the spell.

80% of all weapons parsed did not require any modification. 14% of were only able to be matched through the heuristics. A manual examination of a random seleciton of these matches revealed 1/200 mistake. 6% of the weapon entries were not matched to an official weapon.

Possible issues with detection of unique characters

Identification of unique characters rely on some heuristics. I assume any character with the same name and class is potentially the same character. In these cases I pick the highest level character. Race and other properties are not considered so some unique characters may be lost along the way. I have chosen to be less exact to reduce the nubmer of possible test characters since there were examples of people submitting essentially the same character with different races, presumably to test things out. For multiclassed characters, if a lower level character with the same name and a subset of classes exist, they are removed, again leaving the character with the highest level.

Possible issues with selection bias

This data comes from characters submitted to my web applications. The applications are written to support a popular third party character sheet app for mobile platforms. I have advertised my applications primarily on Reddit r/dndnext and r/dnd. I have seen them mentioned in a few other platforms by word of mouth. That means we are looking at subsamples of subsamples here, all of which can cause some amount of selection bias. Some characters could be thought experiments or for testing purposes and never see actual game play.



oganm/dnddata documentation built on Aug. 28, 2022, 9:30 a.m.