knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Why tidyged?

One of the main characteristics I wanted for this package was to hide the complexity of the GEDCOM specification. I began with what I wanted the user interface to look like and worked backwards from there. I felt that adopting a tidyverse feel to the package would enhance readability and would significantly flatten the learning curve for those who are familiar with ggplot2.

Data structures

I spent a significant amount of time before writing any code considering my options for how the data would be stored under the hood. In one blog post I considered storing genealogical data in a relational table format as it is easier to deal with, but discounted it very quickly as it is not well suited to nested data (list columns are not ideal for this application).

I toyed with the idea of using an off-the-shelf open source product like GRAMPS but I found it awkward to use and wanted something where I was in complete control, taking full advantage of the strengths of R.

I also considered using data structures more suited to this type of data, such as JSON or graphs (using the igraph or data.tree package). However, I discovered it would be quite difficult representing some of the structures in the GEDCOM specification to my satisfaction.

It was some time before a better solution occurred to me, and that was to store the GEDCOM file almost as is, and just to split out the components of each line into their own columns, creating a tidy version of a GEDCOM file (hence the name tidyged). This would allow easy manipulation using existing tidyverse infrastructure, and some additional processing would be needed to deal with the ordered and nested nature of the file. Ironically, I had returned to the idea of a tidy dataframe that I had dismissed early on.

The tidyged object

The tidyged object is a tibble representation of a GEDCOM file. Before we see how this is structured, let's see an example of a GEDCOM file:

readLines(system.file("extdata", "555SAMPLE.GED", package = "tidyged.io"))

Lines in a GEDCOM file can have a number of components:

There will not be any lines that have all of these components. For example, the first line of records do not contain a line value. Also, a line will never have a cross-reference pointer and a line value defined. For this reason, the tidyged object treats cross-reference pointers as just another line value.

The tidyged representation of the above GEDCOM file is given below, using a sample dataset built into the package:

knitr::kable(tidyged::sample555)

The cross-reference identifier is given in the record column. Event though these aren't given for the header and trailer records they are assigned as "HD" and "TR" respectively in the tidyged object.

GEDCOM files can be imported and exported using the tidyged.io package. Many GEDCOM processors modify the data on import, perhaps ignoring custom tags. This package reads the file as-is, and the only modification of the file is in line with that recommended in the GEDCOM 5.5.5 specification, specifically; replacing double '@' symbols with single ones, merging CONC/CONT lines, and ensuring appropriate capitalisation of certain values. In general, the package sticks rigidly to the GEDCOM specification and does not use custom tags. There is a real issue with other desktop applications creating their own flavours of file and diverging from the specification; tidyged does not and will never do this.

If you want to modify single values in a GEDCOM or do a simple Find/Replace, then this is straightforward to achieve with a simple text editor.

Next article: The gedcompendium >



jl5000/tidygedcom documentation built on June 22, 2022, 5:16 p.m.