In rOpenGov/bibliographica: Bibliographic Data Analysis

ntop <- 20
#opts_chunk$set(comment=NA, fig.width=6, fig.height=6)
opts_chunk$set(fig.path = paste0(output.folder, "figure/"))
theme_set(theme_bw(20))

Document size comparisons

Some dimension info is provided in the original raw data for altogether r sum(!is.na(df.orig$physical_dimension)) documents (r round(100*mean(!is.na(df.orig$physical_dimension)),1)%) but could not be interpreted for r sum(!is.na(df.orig$physical_dimension) & (is.na(df$gatherings))) documents (ie. dimension info was successfully estimated for r round(100 - 100 * sum(!is.na(df.orig$physical_dimension) & (is.na(df$gatherings)))/sum(!is.na(df.orig$physical_dimension)), 1) % of the documents where this field was not empty).
Document size (area) info was obtained in the final preprocessed data for altogether r sum(!is.na(df$area)) documents (r round(100*mean(!is.na(df$area)))%). For the remaining documents, critical dimension information was not available or could not be interpreted: List of entries where document surface could not be estimated
Document gatherings info is originally available for r sum(!is.na(df$gatherings.original)) documents (r round(100*mean(!is.na(df$gatherings.original)))%), and further estimated up to r sum(!is.na(df$gatherings)) documents (r round(100*mean(!is.na(df$gatherings)))%) in the final preprocessed data.
Document height info is originally available for r sum(!is.na(df$height.original)) documents (r round(100*mean(!is.na(df$height.original)))%), and further estimated up to r sum(!is.na(df$height)) documents (r round(100*mean(!is.na(df$height)))%) in the final preprocessed data.
Document width info is originally available for r sum(!is.na(df$width.original)) documents (r round(100*mean(!is.na(df$width.original)))%), and further estimated up to r sum(!is.na(df$width)) documents (r round(100*mean(!is.na(df$width)))%) in the final preprocessed data.

These tables can be used to verify the accuracy of the conversions from the raw data to final estimates:

The estimated dimensions are based on the following auxiliary information sheets:

Document dimension abbreviations
Standard sheet size estimates
Document dimension estimates (used when information is partially missing)

Left: final gatherings vs. final document dimension (width x height). Right: original gatherings versus original heights where both are available. The point size indicates the number of documents for each case. The red dots indicate the estimated height that is used when only gathering information is available.

df <- df.preprocessed
dfs <- df %>% filter(!is.na(area) & !is.na(gatherings))
dfs <- dfs[, c("gatherings", "area")]
dfm <- melt(table(dfs)) # TODO switch to gather here
names(dfm) <- c("gatherings", "area", "documents")
dfm$gatherings <- factor(dfm$gatherings, levels = levels(df$gatherings))
p <- ggplot(dfm, aes(x = gatherings, y = area)) 
p <- p + scale_y_continuous(trans = "log2")
p <- p + geom_point(aes(size = documents))
p <- p + scale_size(trans="log10")
p <- p + ggtitle("Gatherings vs. area")
p <- p + xlab("Size (gatherings)")
p <- p + ylab("Size (area)")
p <- p + coord_flip()
print(p)

# Compare given dimensions to gatherings
# (not so much data with width so skip that)
df2 <- filter(df, !is.na(height) | !is.na(width))
df2 <- df2[!is.na(as.character(df2$gatherings)),]
df3 <- filter(df2, !is.na(height))
ss <- sheet_sizes()
df3$gathering.height.estimate <- ss[match(df3$gatherings, ss$gatherings),"height"]
df4 <- df3 %>% group_by(gatherings, height) %>% tally()
p <- ggplot(df4, aes(y = gatherings, x = height))
p <- p + geom_point(aes(size = n))
p <- p + geom_point(data = unique(df3), aes(y = gatherings, x = gathering.height.estimate), color = "red")
p <- p + ylab("Gatherings (original)") + xlab("Height (original)") 
p <- p + ggtitle("Gatherings vs. height")
print(p)

Left: Document dimension histogram (surface area); Right: title count per gatherings.

p <- ggplot(df, aes(x = area))
p <- p + geom_histogram() 
p <- p + xlab("Document surface area (log10)")
p <- p + ggtitle("Document dimension (surface area)")
p <- p + scale_x_log10()
print(p)

p <- ggplot(df, aes(x = gatherings)) 
p <- p + geom_bar()
n <- nchar(max(na.omit(table(df$gatherings))))
p <- p + scale_y_log10(breaks=10^(0:n))
p <- p + ggtitle("Title count")
p <- p + xlab("Size (gatherings)")
p <- p + ylab("Title count")
p <- p + coord_flip()
print(p)