In rOpenGov/bibliographica: Bibliographic Data Analysis

Publication year

Publication year is available for r sum(!is.na(df$publication_year)) documents (r round(100*mean(!is.na(df$publication_year)))%). The publication years span r paste(range(na.omit(df$publication_year)), collapse = "-").

# Title count per decade
df <- df.preprocessed
df2 <- df %>% group_by(publication_decade) %>% filter(publication_decade < 2010) 
p <- ggplot(df2, aes(publication_decade)) +
     geom_bar() + scale_y_log10() +
     ggtitle("Title count timeline")
print(p)

Publication frequency

Publication frequency information is available for r sum(!is.na(df$publication_frequency_text)) documents (r round(100*mean(!is.na(df$publication_frequency_text)))%). The links are invalid if the lists are empty. The (estimated annual) frequencies are converted to plain text according to their closest match in this table.

Publication frequency accepted

Publication frequency conversions

Publication frequency discarded

Publication interval

Publication interval is available for r sum(!is.na(df$publication_interval_from) | !is.na(df$publication_interval_till)) documents (r round(100*mean(!is.na(df$publication_interval_from) | !is.na(df$publication_interval_till)))%).

Publication interval accepted

Publication interval conversions

Publication interval discarded

Editions

Automated detection of potential first editions (first_edition field) identifies unique author-title pairs, and proposes the first occcurrence (earliest publication_year) as the first edition. If there are multiple instances from the same earliest year, they are all marked as potential first editions. Later need to check if this information is readily available in MARC.

There are r nrow(unique(df[, c("title", "author")])) unique documents with an identical title and author and r nrow(df %>% group_by(title, author) %>% tally() %>% filter(n > 1)) of those have multiple occurrences, sometimes with different publication years. The earliest occurrence is suggested as the first edition.

This figure shows the number of first editions per decade.

df <- df.preprocessed
df <- df %>% group_by(publication_decade) %>%
             summarise(total = n(), first = sum(first_edition, na.rm = TRUE))
df2 <- melt(df, id = "publication_decade")
theme_set(theme_bw(20))
p <- ggplot(df2, aes(x = publication_decade, y = value, group = variable)) +
       geom_point(aes(shape = variable), size = 5) +
       geom_line(aes(shape = variable)) +       
       geom_smooth(aes(col = variable, fill = variable)) + 
       xlab("Publication year") +
       ylab("First editions (n)") +
       #scale_y_log10() +
       ggtitle("First editions vs. total title count")
print(p)