R package for PostgreSQL and Redshift metadata explorations
Imagine the situation when you are given a poorly (or too-well) documented PostgreSQL-stored data to analyze. Datatables lacks the constraints, some are empty or residual, all is a bit messy. The rpostgrexplorer package is to help to navigate the analysis on the sea of the messy data.
devtools::install_github("cezden/rpostgrexplorer")
library(rpostgrexplorer)
control.connection <- db.connection(user="username", password="userpassword",
host="database.server.name", port=5439,
dbname="databasename")
meta.inf <- rpostgrexplorer::load.metadata(control.connection = control.connection, schemaname = "schema1")
head(db.attributes(meta.inf))
head(db.tables(meta.inf))
The data.frame listing attribute names along with the counts of their occurences in the tables:
#without DB type information
head(db.attributes.counts(meta.inf))
#with DB type information (important if some attributes share name but not the type)
head(db.attributes.counts(meta.inf), typeinfo = TRUE)
Description of the attribute consists of: number of instances along with db type the most similar names of other attributes
describe.attribute(meta.inf, "cookieid")
producing:
$instances.count
Source: local data frame [1 x 2]
attname typename tablecount
1 cookieid int8 38
$similar
Source: local data frame [3 x 2]
attname tablecount
1 ccookieid 1
2 cookieiid 1
3 coookieid 1
Selecting all instances of the attributes with some name:
attribute.instances(meta.inf, c("cookieid", "consumerid"))
Selecting all tables containing all attributes from given set:
tables.with.attributes(meta.inf, c("cookieid", "consumerid"))
The contents of the metadata object can be filtered by filtering tables or attributes according to arbitrary criteria using the restrict function. Since the results of the query functions (db.attributes, db.attributes.counts and db.tables) are data.frame objects the restricting mechanism is quite flexible.
library(dplyr)
restriction <- db.attributes.counts(meta.inf) %>% dplyr::filter(tablecount>250)
meta.inf.restricted <- restrict(meta.inf, restriction)
db.attributes.counts(meta.inf.restricted)
library(dplyr)
library(stringi)
atts.id <- db.attributes(meta.inf) %>% dplyr::filter(stringi::stri_endswith_fixed(attname, "id"))
meta.inf.id <- restrict(meta.inf, atts.id)
cookie.sketch <- experimental.db.sketch.attribute(control.connection, meta.inf, "cookieid", "31415")
In order to check the type of the relationship between instances of the attribute you may use
cookieid.atts <- attribute.instances(meta.inf, "cookieid")
cookieid.atts.from <- cookieid.atts[1:3,]
experimental.db.infer.relation.simple(control.connection, cookieid.atts.from, cookie.atts)
Based on its output you can infer the type of relationship (0+ -- 0+, 1 -- 0+, etc) and check their quality.
All the functions with the name starting with "experimental" should be treated as a concepts rather than the complete implementation -- they will probably changed and refined in near future.
Identification of frequently co-occuring attributes (indicating some "domains" in the data, e.g. attributes involved in record audit) can be aided by the experimental.infer.domains and experimental.infer.domains2 functions.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.