Home

/

GitHub

/

In EarthCubeGeochron/geodiveR: Utilities For GeoDeepDive In R

knitr::opts_chunk$set(echo = TRUE)

Setting Up the `geoDiveR` Environment

Using geoDiverR efficiently requires the use of a postgres server, and a local DeepDive database. Postgres is a free and open source database with support for a number of extensions, including geospatial analysis with PostGIS. Postgres is an application that runs in the background on your computer, or on a server in the cloud. It is accessed through a connection, using a username and a password, and a single instance of a Postgres server can contain one or many different databases.

The geodiveR package assumes you have a Postgres database initialized on your computer that you will be using for your work. If you do not yet have one, follow these instructions, otherwise, skip ahead to the Creating a new Database section.

Getting Set up with postgreSQL

Installing postgreSQL

Postgres can be installed on almost every platform. Follow the installation instructions on the official postgres webpage to install postgres for your particular OS.

Once you have installed Postgres you will need to create a new user for the server. This will let you give the user a different set of permissions, and provide some security if you choose to run this project elsewhere.

Local Development Tools

Although postgres comes with a commandline tool psql, a number of people find graphical interfaces much less foreboding. There are a number of software applications that can interact with your postgres installation. pgAdmin4 is a browser-based application that can be used with postgres installations. Because postgres is so widely supported, there is a long list of applications that can be found on the postgres wiki. Among them, DBeaver is often recommended, along with dbForge, and JetBrains.

With Postgres GUI tools, database management (such as creating a new database, or adding a new user) may be relatively straightforward. A drawback is that some implementations may not be as feature rich or as extensible as direct commandline interaction. We leave this choice to the user.

Creating a new user

This step is optional. If you expect others to work collaboratively, or are hosting the project in the cloud, it may be worthwhile creating a new user for your server.

By default postgres is set up with a user postgres with no password. To create a new user you can use the command:

psql -U postgres -c "CREATE ROLE newuser WITH PASSWORD 'newpassword' LOGIN NOSUPERUSER NOINHERIT NOCREATEDB NOCREATEROLE;"

In this case you will have logged in using the user postgres to create a new user newuser. CREATE ROLE creates a new user who must LOGIN. The newuser is required to LOGIN using the PASSWORD defined in quotes. Here we are using very restrictive settings for the new user. They cannot bypass other restrictions, nor create a new database or create other roles.

The R implementation for most databases will require you to connect to your database, using your username, password, through a host port. By default Postgres is installed on localhost port 5432.

Creating a new database

Since you are required to create a database for use, we need to create a database on the Postgres server. We will create a database called deepdive for the purposes of this exercise:

psql -U postgres -c "CREATE DATABASE geodeepdive;"

If you are using a GUI you can use it to create new roles and new databases. If you are using the GUI, you should now see it in the browser window for the database.

Connecting to `geodeepdive` using R

When you connect to DeepDive for the first time you will obtain two files, one is a JSON formatted bibliography file, the other is the Stanford NLP output file. Presuming you have set up a new Postgres Database (here as a local server), we can load the database and connect. I am using the default postgres user and connecting to a database I created called deepdive.

The R package geodiveR includes a test dataset and associated bibliographic information for 150 papers that can be loaded using data(). These objects are similar to the files you would obtain from GeoDeepDive.

library(geodiveR)
library(RPostgreSQL)

# Connect to a database:
con <- dbConnect(drv = "PostgreSQL",
                 user = "postgres",
                 password = "postgres",
                 host = "localhost",
                 port = "5432",
                 dbname = "deepdive")

# Replace these with text strings pointing to file locations, or JSON files.
data("nlp")
data("publications")

con <- load_dd(con, bib = publications, sent = nlp, clean = TRUE)

After the database is loaded (or connected), we can look to make sure that things have been loaded as we might expect:

summaryGdd(con)

Looking over the database structure you can see that geodiveR creates four different tables: publications contains the list of unique bibliographic information for each paper within the GDD resource; authors contains the names of all the authors of each publication (and is a one to many table, there can be many authors for any one paper); links includes all the link types for a paper, for example DOIs or URLs (one to many); sentences provides the full NLP output for the record.

These tables become the basis for our future analysis.

Viewing Data

Our test set contains r (summaryGdd(con) %>% dplyr::filter(table == "publications"))$rows[1] publications and r (summaryGdd(con) %>% dplyr::filter(table == "sentences"))$rows[1] total sentences. This can be an overwhelming amount of data, particularly with the complex outputs from the NLP data.

Much of the GDD workflow relies on feature detection. To begin to understand the data we may wish to view small subsets. First it might be useful to take a look at some of the tables, or fields to see what the database actually looks like internally:

skim(con, table = 'authors', n =14)

The skim() function returns a random subset of rows within a table. You can apply it to any table in the database, or a data.frame. In the case where columns might be very long (for example the sentences column), the skim() function can be applied to single columns:

skim(con, table = 'sentences', column = 'words', n = 5)

The setseed flag is used if there is a need to establish entirely reproducible workflows, where results must remain static from one iteration to another. While R supports a seed range of +/-2*10^9, Postgres only supports a seed range of -1 -- 1.

Applying Queries

With the data placed into the database it is possible to apply queries to the database, and, possibly chain them. For example, we are interested in knowing which papers have the term "pollen" in them as a stand-alone sentence. We also want to find papers that discuss pollen in the context of summer months:

coords <- gddMatch(con,
                   table = 'sentences',
                   col = 'words',
                   pattern = '[\\{,][-]?[1]?[0-9]{1,2}\\.[0-9]{1,}[,]?[NESWnesw],',
                   name = "decdeg",
                   rows = TRUE)

dates <- gddMatch(con,
                  table = 'sentences',
                  col = 'words',
                  pattern = '(\\d+(?:[.]\\d+)*),((?:-{1,2})|(?:to)),(\\d+(?:[.]\\d+)*),([a-zA-Z]+,BP),',
                  name = "dates",
                  rows = TRUE)

gddMatch() returns subsets of data, managed as gddMatch objects that are modified list() objects. For example, dates contains the original query (as dates$query) and a vector of TRUE/FALSE values (dates$boolean) indicating whether a particular sentence matched the text pattern.

It is possible to apply matches at either the sentence level or at the paper level. For example, if data processing is specific to a certain subset of the literature, it may be worth applying a test at the paper level. For example, we wish to apply our age matching only to papers that include the word "pollen":

pollen <- gddMatch(con,
                   table = "sentences",
                   col = "words",
                   pattern = ",pollen,",
                   name = "pollenQuery",
                   paper = TRUE)

If you want the function to return the explicit matches then you can use the rows flag.

To examine the data, we can again apply the skim() function:

skim(dates, n=10, column = 'words', clean='replace', setseed=-0.5) %>% 
  DT::datatable()

The package should combine results as well. Currently there is an and() function implemented, but right now it simply binds the results:

combines <- and(coords, dates)

And so we can now put these together to start looking at the objects. We might think of operating row-wise ("I want an sentence with both dates and coordinates"), or publication wise ("I want a paper that includes both dates and coordinates").

This is work that needs to be done. We also need to think about how this would then be scaled.

EarthCubeGeochron/geodiveR documentation built on May 25, 2019, 8:29 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com