In jl5000/tidygedcom: Handle GEDCOM Files Using Tidyverse Principles

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Why tidyged?

One of the main characteristics I wanted for this package was to hide the complexity of the GEDCOM specification. I began with what I wanted the user interface to look like and worked backwards from there. I felt that adopting a tidyverse feel to the package would enhance readability and would significantly flatten the learning curve for those who are familiar with ggplot2.

Data structures

I spent a significant amount of time before writing any code considering my options for how the data would be stored under the hood. In one blog post I considered storing genealogical data in a relational table format as it is easier to deal with, but discounted it very quickly as it is not well suited to nested data (list columns are not ideal for this application).

I toyed with the idea of using an off-the-shelf open source product like GRAMPS but I found it awkward to use and wanted something where I was in complete control, taking full advantage of the strengths of R.

I also considered using data structures more suited to this type of data, such as JSON or graphs (using the igraph or data.tree package). However, I discovered it would be quite difficult representing some of the structures in the GEDCOM specification to my satisfaction.

It was some time before a better solution occurred to me, and that was to store the GEDCOM file almost as is, and just to split out the components of each line into their own columns, creating a tidy version of a GEDCOM file (hence the name tidyged). This would allow easy manipulation using existing tidyverse infrastructure, and some additional processing would be needed to deal with the ordered and nested nature of the file. Ironically, I had returned to the idea of a tidy dataframe that I had dismissed early on.

The tidyged object

The tidyged object is a tibble representation of a GEDCOM file. Before we see how this is structured, let's see an example of a GEDCOM file:

readLines(system.file("extdata", "555SAMPLE.GED", package = "tidyged.io"))

Lines in a GEDCOM file can have a number of components:

Level: The level in the hierarchical structure. This appears for every line;
Cross-reference identifier: A string (which looks like @XYZ@) that signals the beginning of a new record (apart from header and trailer);
Tag: A short string given immediately after the level or cross-reference identifier that indicates the type of information being provided on the line. These are controlled values. User-defined tags have been allowed in other GEDCOM programs, but they are discouraged here;
Cross-reference pointer: This links to another record in the file (which looks like @XYZ@). In the above example, the Family Group record beginning on line r which(readLines(system.file("extdata", "555SAMPLE.GED", package = "tidyged.io")) == "0 @F1@ FAM") references other Individual records who are members of the family;
Line value: The value associated with the tag. For example, on line 6, the CHARacter encoding for the file is given as UTF-8.

There will not be any lines that have all of these components. For example, the first line of records do not contain a line value. Also, a line will never have a cross-reference pointer and a line value defined. For this reason, the tidyged object treats cross-reference pointers as just another line value.

The tidyged representation of the above GEDCOM file is given below, using a sample dataset built into the package:

knitr::kable(tidyged::sample555)

The cross-reference identifier is given in the record column. Event though these aren't given for the header and trailer records they are assigned as "HD" and "TR" respectively in the tidyged object.

GEDCOM files can be imported and exported using the tidyged.io package. Many GEDCOM processors modify the data on import, perhaps ignoring custom tags. This package reads the file as-is, and the only modification of the file is in line with that recommended in the GEDCOM 5.5.5 specification, specifically; replacing double '@' symbols with single ones, merging CONC/CONT lines, and ensuring appropriate capitalisation of certain values. In general, the package sticks rigidly to the GEDCOM specification and does not use custom tags. There is a real issue with other desktop applications creating their own flavours of file and diverging from the specification; tidyged does not and will never do this.

If you want to modify single values in a GEDCOM or do a simple Find/Replace, then this is straightforward to achieve with a simple text editor.

Next article: The gedcompendium >

jl5000/tidygedcom documentation built on June 22, 2022, 5:16 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com