inst/examples/new/analyses/pagecount.md

title: "Pagecount preprocessing summary" author: "Leo Lahti / Computational History Group" date: "2018-07-31" output: markdown_document

Page counts

Average page counts

Mean and median page counts calculated based on the documents where the page count information was readily available. Also see the correponding numerical tables with page count estimates:

These estimates are used to fill in page count info for the remaining documents where page count info is missing.

The multi-volume documents average page counts are given per volume.

The page count estimates are calculated without plates. Plate information is added separately for each document on top of the page count estimate.

plot of chunk size-pagecountsmulti2

Document size distribution

plot of chunk pagecountstat

Left: Gatherings vs. overall pagecounts (original + estimated). Right: Only the estimated page counts (for the 384 documents that have missing pagecount info in the original data):

## Error in grouped_indices_grouped_df_impl(.data): Column `pagecount` is unknown
## Error in FUN(X[[i]], ...): object 'documents' not found

plot of chunk size-estimated

Documents with missing pages over years

## Error in grouped_df_impl(data, unname(vars), drop): Column `publication_decade` is unknown
## Error in `$<-.data.frame`(`*tmp*`, na, value = logical(0)): replacement has 0 rows, data has 9616
## Error in FUN(X[[i]], ...): object 'na' not found
## Error in grouped_df_impl(data, unname(vars), drop): Column `publication_decade` is unknown
## Error in df2b$na[df2b$na == 0] <- NA: object 'df2b' not found
## Error in ggplot(df2b, aes(x = publication_decade, y = gatherings, size = na)): object 'df2b' not found
## Error in na.omit(df2b$na): object 'df2b' not found
## Error in FUN(X[[i]], ...): object 'na' not found

plot of chunk missingpages

Estimated paper consumption

Note: there are 0 documents that have some dimension info but sheet area information could not be calculated.

## Error in grouped_df_impl(data, unname(vars), drop): Column `publication_year` is unknown
## Error in arrange_impl(.data, dots): incorrect size (1) at position 1, expecting : 9616
## Error in FUN(X[[i]], ...): object 'na' not found
## Error in arrange_impl(.data, dots): incorrect size (1) at position 1, expecting : 9616
## Error in FUN(X[[i]], ...): object 'na' not found

plot of chunk paperconsumption

## Error in grouped_df_impl(data, unname(vars), drop): Column `publication_year` is unknown
## Error in FUN(X[[i]], ...): object 'publication_year' not found
## Error in FUN(X[[i]], ...): object 'publication_year' not found

plot of chunk paperconsumption2b

## Error in grouped_df_impl(data, unname(vars), drop): Column `publication_decade` is unknown
## Error in FUN(X[[i]], ...): object 'publication_decade' not found
## Error in FUN(X[[i]], ...): object 'publication_decade' not found

plot of chunk pagecounts-gatherings-relab

## Error in grouped_df_impl(data, unname(vars), drop): Column `publication_decade` is unknown
## Error in FUN(X[[i]], ...): object 'publication_decade' not found

plot of chunk paperconsumption2

Pamphlets vs. Books

## Error in grouped_df_impl(data, unname(vars), drop): Column `publication_year` is unknown
## Error in FUN(X[[i]], ...): object 'publication_year' not found
## Error in FUN(X[[i]], ...): object 'publication_year' not found

plot of chunk doctypes

## Error in grouped_df_impl(data, unname(vars), drop): Column `publication_decade` is unknown
## Error in FUN(X[[i]], ...): object 'publication_decade' not found
## Error in FUN(X[[i]], ...): object 'publication_decade' not found

plot of chunk doctypes2

Nature of the documents over time

Estimated paper consumption by document size

## Error in grouped_df_impl(data, unname(vars), drop): Column `publication_year` is unknown
## Error in FUN(X[[i]], ...): object 'publication_year' not found

plot of chunk 20150611paris-paper6

Gatherings height: does it change over time? How increased printing activity is related to book size trends? Alternatively, we could use area (height x width), or median over time. Note that only original (not augmented) dimension info is being used here.

## Error in grouped_df_impl(data, unname(vars), drop): Column `publication_year` is unknown

Page counts: does it change over time? Also suggested we could calculate some kind of factor for each time period based on this ? In principle, we could calculate this separately for any given publication place as well but leẗ́s discuss this later. Would help to specify some specific places of interest.

## Error in grouped_df_impl(data, unname(vars), drop): Column `publication_year` is unknown

Same for documents that have a sufficient number of pages:

## Error in grouped_df_impl(data, unname(vars), drop): Column `publication_year` is unknown


COMHIS/estc documentation built on April 7, 2022, 4:53 p.m.