In rOpenGov/bibliographica: Bibliographic Data Analysis

Preprocessing summary

The data spanning years r min(na.omit(df$publication_year))-r max(na.omit(df$publication_year)) has been included and contains r nrow(df.preprocessed) documents (also other filter may apply depending on the data collection, see the source code for details.

Specific fields

Annotated documents

r nrow(df.orig) documents in the original raw data
r nrow(df) documents in the final preprocessed data (r round(100 * nrow(df)/nrow(df.orig), 2)%)

Fraction of documents with data:

availability <- field_availability(df)
print(availability$plot)

Same in exact numbers: documents with available/missing entries, and number of unique entries for each field. Sorted by missing data:

tab <- availability$table %<>% arrange(n)
names(tab) <- gsub("missing", "missing (%)", names(tab))
names(tab) <- gsub("available", "available (%)", names(tab))
names(tab) <- gsub("^n$", "available (n)", names(tab))
names(tab) <- gsub("unique_entries", "unique (n)", names(tab))
names(tab) <- gsub("field_name", "field name", names(tab))
kable(tab[, c(1, 3, 2, 4, 5)], digits = 1, caption = "Data availability")
rm(tab);gc()

Field conversions

This documents the conversions from raw data to the final preprocessed version (accepted, discarded, conversions). Only some of the key tables are explicitly linked below. The complete list of all summary tables is here.

Brief description of the fields:

Histograms of all entries for numeric variables

num <- c();
for (field in names(df)) {num[[field]] <- is.numeric(df[[field]])}
numeric.fields <- setdiff(names(which(num)), c("row.index", "original_row", "unity"))
for (field in numeric.fields) {
  x <- log10(min(df[[field]], na.rm = TRUE)/2 + df[[field]])
  x <- x[!is.na(x) & !is.nan(x)]
  if (length(x) > 0) {
    hist(x, 30,
       main = paste(field, "histogram"),
       ylab = "Documents",
       xlab = paste(field, "(log10)")
       )
  }
}

Histograms of the top entries for factor variables

Non-trivial factors with at least 2 levels are shown.

fac <- c(); for (field in names(df)) {fac[[field]] <- is.factor(df[[field]])}
factor.fields <- names(which(fac))
for (field in factor.fields) {
  n <- min(length(unique(df[[field]])), ntop)
  if (length(n) > 1) {
    p <- top_plot(df, field, n)  
    p <- p + ggtitle(paste("Top ", field))
    p <- p + scale_y_log10()
    p <- p + ylab("Documents (Log10)")
    print(p)
  }
}