The data spanning years r min(na.omit(df$publication_year))
-r max(na.omit(df$publication_year))
has been included and contains r nrow(df.preprocessed)
documents (also other filter may apply depending on the data collection, see the source code for details.
r nrow(df.orig)
documents in the original raw datar nrow(df)
documents in the final preprocessed data (r round(100 * nrow(df)/nrow(df.orig), 2)
%)Fraction of documents with data:
availability <- field_availability(df) print(availability$plot)
Same in exact numbers: documents with available/missing entries, and number of unique entries for each field. Sorted by missing data:
tab <- availability$table %<>% arrange(n) names(tab) <- gsub("missing", "missing (%)", names(tab)) names(tab) <- gsub("available", "available (%)", names(tab)) names(tab) <- gsub("^n$", "available (n)", names(tab)) names(tab) <- gsub("unique_entries", "unique (n)", names(tab)) names(tab) <- gsub("field_name", "field name", names(tab)) kable(tab[, c(1, 3, 2, 4, 5)], digits = 1, caption = "Data availability") rm(tab);gc()
This documents the conversions from raw data to the final preprocessed version (accepted, discarded, conversions). Only some of the key tables are explicitly linked below. The complete list of all summary tables is here.
Brief description of the fields:
num <- c(); for (field in names(df)) {num[[field]] <- is.numeric(df[[field]])} numeric.fields <- setdiff(names(which(num)), c("row.index", "original_row", "unity")) for (field in numeric.fields) { x <- log10(min(df[[field]], na.rm = TRUE)/2 + df[[field]]) x <- x[!is.na(x) & !is.nan(x)] if (length(x) > 0) { hist(x, 30, main = paste(field, "histogram"), ylab = "Documents", xlab = paste(field, "(log10)") ) } }
Non-trivial factors with at least 2 levels are shown.
fac <- c(); for (field in names(df)) {fac[[field]] <- is.factor(df[[field]])} factor.fields <- names(which(fac)) for (field in factor.fields) { n <- min(length(unique(df[[field]])), ntop) if (length(n) > 1) { p <- top_plot(df, field, n) p <- p + ggtitle(paste("Top ", field)) p <- p + scale_y_log10() p <- p + ylab("Documents (Log10)") print(p) } }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.