bibliographica: Bibliographic Data Analysis

title: "bibliographica vignette" author: "Leo Lahti, Niko Ilomaki, Mikko Tolonen" date: "2016-11-01" bibliography: - bibliography.bib - references.bib output: html_document

R tools for bibliographic data analysis

The bibliographica R package provides tools for automated extraction and analysis of bibliographic metadata collections (library catalogues) such as the ESTC. The toolkit includes functions to extract and clean up information from plain text library catalogue annotation fields following common standards such as the Bibliographic Processing Cataloging Rules. This package can be used in combination with any library catalogue that follows these common standards.

Your contributions, bug reports and suggestions are welcome!

We assume you have installed R. If you use RStudio, change the default encoding to UTF-8.

Installing the stable release version in R:

library(devtools)
install_github("ropengov/bibliographica")

Load the tools:

library(bibliographica)
library(knitr)
Sys.setlocale(locale="UTF-8")

## [1] ""

Below, you will find simple examples on the package functionality. In real studies the tools can be used to preprocess collections with millions of documents.

Estimate the total page count for two documents:

# unlist(polish_physical_extent(c("50 p.", "[6],viii,386p. ;"))$estimated.pages)

Extract and print document dimension information in a harmonized format:

res <- polish_dimensions("1/2fo (37 cm)")
knitr::kable(res)

|gatherings.original | width.original| height.original|obl.original |original |gatherings | width| height| obl| area| |:-------------------|--------------:|---------------:|:------------|:---------|:----------|-----:|------:|---:|----:| |2fo | NA| 37|FALSE |2fo 37 cm |2fo | 25| 37| 0| 925|

The missing fields can be estimated with the 'fill' argument:

res <- polish_dimensions("1/2fo (37 cm)", fill = TRUE)
knitr::kable(res)

Estimate the total page count:

unlist(polish_physical_extent(c("50 p.", "[6],viii,386p. ;"))$estimated.pages)

## NULL

Estimation of the missing information (gatherings, width, and/or height) is based on a ready-made dimension mapping table. This table can be changed by the user if necessary (see function arguments). The default table can be retrieved in R with:

dtab <- dimension_table()
kable(head(dtab)) # just show the first rows

|height |NA |1to |2long |2fo |2small |4long |4to |4small |8long |8vo |8small |12long |12mo |16long |16mo |18mo |24long |24mo |32mo |40mo |48mo |64mo |80mo |84mo |1to | |:------|:--|:---|:-----|:---|:------|:-----|:---|:------|:-----|:---|:------|:------|:----|:------|:----|:----|:------|:----|:----|:----|:----|:----|:----|:----|:---| |90 |60 |60 |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x | |110 |66 |73 |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x | |109 |66 |72 |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x | |108 |65 |72 |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x | |107 |64 |71 |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x | |106 |64 |70 |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |x |

Conversions between different versions of the gatherings names:

res <- gatherings_table()
knitr::kable(res)

|Alternate |Standard |Symbol |Name | |:---------|:--------|:------|:-------------------| |1to |1to |1° |sheet | |bs |bs |bs |broadside | |2long |2long |2° |folio | |2to |2fo |2° |folio | |2small |2small |2° |folio | |4long |4long |4° |quarto | |4to |4to |4° |quarto | |4small |4small |4° |quarto | |8long |8long |8° |octavo | |8to |8vo |8° |octavo | |8small |8small |8° |octavo | |12long |12long |12° |duodecimo | |12to |12mo |12° |duodecimo | |16to |16mo |16° |sextodecimo | |16long |16long |16° |sextodecimo | |18to |18mo |18° |octodecimo | |24to |24mo |24° |vigesimo-quarto | |24long |24long |24° |vigesimo-quarto | |32to |32mo |32° |trigesimo-segundo | |40to |40mo |40° |quadraquinto | |48to |48mo |48° |quadragesimo-octavo | |64to |64mo |64° |sexagesimo-quarto | |80to |80mo |80° |octogentesimo | |84to |84mo |84° |octoginta-quarto |

Pick information on the volume numbers:

# Volume number 3 from multi-volume document
unname(polish_physical_extent("v.3, 50 p"))

##             
## 1 50 3 NA NA

Pick information on the total volume count:

# Document with 4 volumes and missing page information
unlist(polish_physical_extent("4v."))

## pagecount volnumber  volcount     parts 
##        NA        NA         4        NA

Extract and print dimension information:

res <- polish_dimensions("1/2fo (37 cm)")
knitr::kable(res)

Also the missing fields can be estimated:

res <- polish_dimensions("1/2fo (37 cm)", fill = TRUE)
knitr::kable(res)

Estimation of the missing information (gatherings, width, and/or height) is based on a ready-made approximation table. The table can be changed by the user (see function arguments). The default table can be retrieved in R with:

dtab <- dimension_table()
kable(head(dtab)) # just print the first rows

Check approximated sheet area for folio. The area units are in cm2. Also other sheet types are available.

sheet_area("folio")

## [1] 1350

The sheet sizes are calculated in the above example according to the sheet size table. The table can be changed by the user (see the function arguments) but by default the functions use this mapping:

kable(sheet_area())

|format |gatherings | width| height| area| |:--------------------|:----------|-----:|------:|-------:| |sheet |1to | 60.0| 90.0| 5760.00| |broadside |bs | 60.0| 64.0| 3840.00| |folio-large |2long | 30.0| 53.0| 1749.00| |folio |2fo | 30.0| 45.0| 1350.00| |folio-small |2small | 25.0| 38.0| 950.00| |quarto-long |4long | 27.0| 35.0| 945.00| |quarto |4to | 22.0| 28.0| 616.00| |quarto-small |4small | 17.0| 20.0| 340.00| |octavo-small |8small | 11.0| 17.0| 187.00| |octavo |8vo | 13.0| 19.0| 247.00| |octavo-long |8long | 13.0| 22.0| 286.00| |duodecimo |12mo | 12.5| 19.0| 237.50| |duodecimo-long |12long | 13.0| 20.0| 260.00| |sextodecimo |16mo | 12.0| 15.0| 180.00| |sextodecimo-long |16long | 14.0| 18.0| 252.00| |octodecimo |18mo | 10.0| 16.0| 160.00| |vigesimo-quarto |24mo | 8.5| 12.5| 106.25| |vigesimo-quarto-long |24long | 9.5| 14.0| 133.00| |trigesimo-secundo |32mo | 7.0| 12.0| 84.00| |quadraquinto |40mo | 6.0| 11.0| 66.00| |quadragesimo-octavo |48mo | 6.5| 10.0| 65.00| |sexagesimo-quarto |64mo | 5.0| 7.0| 35.00| |octogentesimo |80mo | 4.0| 6.0| 24.00| |octoginta-quarto |84mo | 3.5| 5.0| 17.50|

Removing stopwords is often necessary in text analysis. The stopwords form multiple categories, such as individual letters, conjugates, special characters, or particular expressions. The definition of a stopword may also depend on a context.

The following example removes the term "well" and individual letters from the input vector:

remove_terms(c("a", "well", "james", "30 year war"), terms = "well")

## [1] "a"           NA            "james"       "30 year war"

For a full list of stopword and related functions, see the function documentation. We also provide some ready-made stopword lists that can be easily downloaded in R with the 'read.csv' function.

Several person name lists are available for validation purposes, including also gender information. These can be read with the functions firstnames, lastnames, and notnames (pseudonymes also coming). The name lists are collected by combining manually constructed lists and openly available material from various sources listed in the respective data folders.

This work can be freely used, modified and distributed under the Two-clause BSD license.

citation("bibliographica")

## 
## Kindly cite this R package as follows:
## 
##   (C) Leo Lahti, Hege Roivainen, Niko Ilomaki, Mikko Tolonen
##   (rOpenGov 2015-2016).  bibliographica R package URL:
##   http://github.com/ropengov/bibliographica
## 
## A BibTeX entry for LaTeX users is
## 
##   @Misc{,
##     title = {bibliographica: tools for bibliographic analysis},
##     author = {Leo Lahti and Hege Roivainen and Niko Ilomaki and Mikko Tolonen},
##     year = {2015-2016},
##   }
## 
## Many thanks for all contributors! See: http://ropengov.github.io

The generic tools of this package can be used in combination with packages that provide more specific tools for targeted data collections such as the ESTC or Fennica.

This vignette was created with

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rmarkdown_1.0.9016    bibliographica_0.2.30 knitr_1.14           
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7        magrittr_1.5       munsell_0.4.3     
##  [4] tm_0.6-2           colorspace_1.2-7   R6_2.2.0          
##  [7] highr_0.6          stringr_1.1.0      plyr_1.8.4        
## [10] dplyr_0.5.0        tools_3.3.1        babynames_0.2.1   
## [13] parallel_3.3.1     grid_3.3.1         data.table_1.9.6  
## [16] gtable_0.2.0       genderdata_0.5.0   DBI_0.5-1         
## [19] htmltools_0.3.5    yaml_2.1.13        digest_0.6.10     
## [22] lazyeval_0.2.0     assertthat_0.1     tibble_1.2        
## [25] NLP_0.1-9          tidyr_0.6.0        reshape2_1.4.1    
## [28] ggplot2_2.1.0      formatR_1.4        stringdist_0.9.4.2
## [31] slam_0.1-38        evaluate_0.10      stringi_1.1.2     
## [34] sorvi_0.7.47       gender_0.5.1.9000  scales_0.4.0      
## [37] chron_2.3-47

rOpenGov/bibliographica documentation built on April 10, 2022, 8:51 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

rOpenGov/bibliographica
Bibliographic Data Analysis

vignettes/tutorial.md
In rOpenGov/bibliographica: Bibliographic Data Analysis

R tools for bibliographic data analysis

Installation

Harmonizing textual annotation fields in library catalogues

Page information

Dimension information

Volume information

Dimension information

Sheet size table

Stopwords

Person names

Licensing and Citations

Related work

Session info

R Package Documentation

Browse R Packages

We want your feedback!

rOpenGov/bibliographica Bibliographic Data Analysis

vignettes/tutorial.md In rOpenGov/bibliographica: Bibliographic Data Analysis

R tools for bibliographic data analysis

Installation

Harmonizing textual annotation fields in library catalogues

Page information

Dimension information

Volume information

Dimension information

Sheet size table

Stopwords

Person names

Licensing and Citations

Related work

Session info

R Package Documentation

Browse R Packages

We want your feedback!

rOpenGov/bibliographica
Bibliographic Data Analysis

vignettes/tutorial.md
In rOpenGov/bibliographica: Bibliographic Data Analysis