README.md

rpostgrexplorer

R package for PostgreSQL and Redshift metadata explorations

Motivation

Imagine the situation when you are given a poorly (or too-well) documented PostgreSQL-stored data to analyze. Datatables lacks the constraints, some are empty or residual, all is a bit messy. The rpostgrexplorer package is to help to navigate the analysis on the sea of the messy data.

Installation

devtools::install_github("cezden/rpostgrexplorer")

Sample session

Initialization

library(rpostgrexplorer)

control.connection <- db.connection(user="username", password="userpassword",
                          host="database.server.name", port=5439, 
                          dbname="databasename")  

meta.inf <- rpostgrexplorer::load.metadata(control.connection = control.connection, schemaname = "schema1")

Querying the metadata object

Listing the attributes and tables

head(db.attributes(meta.inf))
head(db.tables(meta.inf))

The data.frame listing attribute names along with the counts of their occurences in the tables:

#without DB type information
head(db.attributes.counts(meta.inf)) 

#with DB type information (important if some attributes share name but not the type)
head(db.attributes.counts(meta.inf), typeinfo = TRUE) 

Describing an attribute

Description of the attribute consists of: number of instances along with db type the most similar names of other attributes


describe.attribute(meta.inf, "cookieid")

producing:

$instances.count
Source: local data frame [1 x 2]

                attname typename tablecount
1      cookieid             int8         38

$similar
Source: local data frame [3 x 2]

                attname tablecount
1      ccookieid                 1
2      cookieiid                 1
3      coookieid                 1
Tables and attributes search

Selecting all instances of the attributes with some name:

attribute.instances(meta.inf, c("cookieid", "consumerid"))

Selecting all tables containing all attributes from given set:

tables.with.attributes(meta.inf, c("cookieid", "consumerid"))

Filtering the metadata object

The contents of the metadata object can be filtered by filtering tables or attributes according to arbitrary criteria using the restrict function. Since the results of the query functions (db.attributes, db.attributes.counts and db.tables) are data.frame objects the restricting mechanism is quite flexible.

Selecting attributes with more than 100 instances
library(dplyr)
restriction <- db.attributes.counts(meta.inf)  %>% dplyr::filter(tablecount>250)
meta.inf.restricted <- restrict(meta.inf, restriction) 
db.attributes.counts(meta.inf.restricted)

Selecting all attributes ending with "id"
library(dplyr)
library(stringi)
atts.id <- db.attributes(meta.inf)  %>% dplyr::filter(stringi::stri_endswith_fixed(attname, "id"))
meta.inf.id <- restrict(meta.inf, atts.id)

Querying the database based on the metadata object

Selecting data from all tables containing given attribute

cookie.sketch <- experimental.db.sketch.attribute(control.connection, meta.inf, "cookieid", "31415")

Mining the relationships beetween attribute instances

In order to check the type of the relationship between instances of the attribute you may use


cookieid.atts <- attribute.instances(meta.inf, "cookieid")
cookieid.atts.from <- cookieid.atts[1:3,]

experimental.db.infer.relation.simple(control.connection, cookieid.atts.from, cookie.atts)

Based on its output you can infer the type of relationship (0+ -- 0+, 1 -- 0+, etc) and check their quality.

Fluid concepts

All the functions with the name starting with "experimental" should be treated as a concepts rather than the complete implementation -- they will probably changed and refined in near future.

Frequent attribute sets

Identification of frequently co-occuring attributes (indicating some "domains" in the data, e.g. attributes involved in record audit) can be aided by the experimental.infer.domains and experimental.infer.domains2 functions.



cezden/rpostgrexplorer documentation built on May 13, 2019, 3:08 p.m.