The ideadata package utilizes the simple concept of get_* functions, which is named after the table that you want to connect to in the warehouse in one line. Functions like [get_students] are useful because they lower your cognitive load ("What DB and schema do I pull students from and what's the SQL look like") and they ensure that anyone using this package (read: R&A) are always pulling the same data when they start an analysis, save any processing logic.

One big downside: they need to be created one at a time and so if a new set of tables is added to the data warehouse, there can a lag between when those tables are added and when there is a corresponding get_* function in this package.

Introducing get_table()

To mitigate against the time Chris has to develop new get_* functions and the ever evolving nature of data warehouse, ideadata now has a new workhourse function: get_table().

This function does exactly what it says: it get's tables from the warehouse by looking up table location in the MetaData table in the Documentation schema of R&A's SQL Server instance. That table updates frequently and contains a comprehensive view of the data warehouse at any time.

When you load ideadata (i.e., when you run library(ideadat) the package makes a connection to the MetaData table and saves it as a dataframe names warehouse_meta_data that you can access by running data(warehouse_meta_data).

Not that you might need to detach and reload ideadata if you know the warehouse has a new table but get_table() can't find it. This will force the warehouse_meta_data table to update.

Usage

Simple usage

If the table is uniquely named in the data warehouse then the usage is very straightforward. You need to simply the table name to function:

library('ideadata')
#> ℹ Loading ideadata
#> ── Gathering warehouse metadata ─────────────────────────────────────────────── ideadata 3.0.1 ──
#> ℹ Checking credentials ...
#> christopher.haid's Kereberos TGT is current
#> ✓ Success: Warehouse metadata gathered!
#> ℹ The warehouse currently houses:
#>      2,226 tables
#>   in 134 databases
#>   on 17 servers
#> WHEE! ideadata and you are healthily neat!
#library(ideadata)
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.2
#> ── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.3.1 ──
#> ✓ ggplot2 3.3.6     ✓ purrr   0.3.4
#> ✓ tibble  3.1.6     ✓ stringr 1.4.0
#> ✓ tidyr   1.2.0     ✓ forcats 0.5.1
#> ✓ readr   2.1.2
#> Warning: package 'tibble' was built under R version 4.0.2
#> Warning: package 'tidyr' was built under R version 4.0.5
#> Warning: package 'readr' was built under R version 4.0.5
#> Warning: package 'forcats' was built under R version 4.0.2
#> ── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> x lubridate::as.difftime() masks base::as.difftime()
#> x lubridate::date()        masks base::date()
#> x dplyr::filter()          masks stats::filter()
#> x lubridate::intersect()   masks base::intersect()
#> x dplyr::lag()             masks stats::lag()
#> x lubridate::setdiff()     masks base::setdiff()
#> x lubridate::union()       masks base::union()

oe_responses <- get_table("OEResponses")
#> christopher.haid's Kereberos TGT is current

head(oe_responses)
#> # Source:   lazy query [?? x 7]
#> # Database: Microsoft SQL Server 15.00.4198[dbo@RGVPDRA-DASQL/GPTW]
#>        ID AcademicYear OEQuestionKey OEResponseKey GroupType Group  Response                     
#>   <int64> <chr>                <int>         <int> <chr>     <chr>  <chr>                        
#> 1       1 2020-2021                5             1 Region    Austin "I would not change a thing;…
#> 2       2 2020-2021                5             2 Region    Austin "As a new teacher I had hear…
#> 3       3 2020-2021                5             3 Region    Austin "Choose locations more caref…
#> 4       4 2020-2021                5             4 Region    Austin "At my previous campus, I wo…
#> 5       5 2020-2021                5             5 Region    Austin "To treat it's employees lik…
#> 6       6 2020-2021                5             6 Region    Austin "More check-ins with manager"

Not too bad.

Less simple usage, or failing informatively

This function will fail if the tables is not uniquely named in the warehouse.

Why?

Because there is no way to know which table you really mean. One solution, if you know the location, is to provide the table name, database name, and schema to the .table_name, database_name, .schema, and .server_name arguments of get_table(). That will locate the right server and make the connection and avoid failure.

But here is something cool: What if you don't know that info? When get_table fails in this scenario it fails informatively. get_table will show you all the tables it's found in the warehouse and give you code you can copy and paste to get the table you want.

Let's say you ask for the Schools table

schools <- get_table("Schools")
#> ! There are 22 tables with that name in our warehouse
#> ℹ You'll need to specify the database and schema name with db target.
#> ✓ Any of these should work:
#>   get_table(.table_name = "Schools", .database_name = "CSIInstruction", .schema = "dbo", .server_name = "1064618-SQLDI")
#>   get_table(.table_name = "Schools", .database_name = "CSIInstructionStage", .schema = "dbo", .server_name = "1064618-SQLDI")
#>   get_table(.table_name = "Schools", .database_name = "IDEAInstruction", .schema = "dbo", .server_name = "1064618-SQLDI")
#>   get_table(.table_name = "Schools", .database_name = "PROD1_2021ss", .schema = "Schools", .server_name = "1065574-SQLPRD1")
#>   get_table(.table_name = "Schools", .database_name = "PROD1_2020ss", .schema = "Schools", .server_name = "1065574-SQLPRD1")
#>   get_table(.table_name = "Schools", .database_name = "PROD1", .schema = "Schools", .server_name = "RGVPDSD-DWPRD1")
#>   get_table(.table_name = "Schools", .database_name = "SRC_Florida_Schools", .schema = "Powerschool", .server_name = "RGVPDSD-DWSRC1")
#>   get_table(.table_name = "Schools", .database_name = "SRC_Louisiana_Schools", .schema = "Powerschool", .server_name = "RGVPDSD-DWSRC1")
#>   get_table(.table_name = "Schools", .database_name = "SRC_Texas_Travis_Schools", .schema = "Skyward", .server_name = "RGVPDSD-DWSRC1")
#>   get_table(.table_name = "Schools", .database_name = "TravisSnapshotData", .schema = "MidLand.PM", .server_name = "RGVPDSD-DWSRC1")
#>   get_table(.table_name = "Schools", .database_name = "SRC_EA", .schema = "dbo", .server_name = "RGVPDSD-DWSRC2")
#>   get_table(.table_name = "Schools", .database_name = "SRC_Ohio_Schools", .schema = "Powerschool", .server_name = "RGVPDSD-DWSRC1")
#>   get_table(.table_name = "Schools", .database_name = "SRC_Florida_Schools_Focus_Dev", .schema = "Focus", .server_name = "RGVPDSD-DWSRC1")
#>   get_table(.table_name = "Schools", .database_name = "SRC_Ohio_Assessments", .schema = "Edcite", .server_name = "RGVPDSD-DWSRC2")
#>   get_table(.table_name = "Schools", .database_name = "SRC_StateAssessments", .schema = "dbo", .server_name = "RGVPDSD-DWSRC2")
#>   get_table(.table_name = "Schools", .database_name = "CSIInstructionDebug", .schema = "dbo", .server_name = "1064618-SQLDI")
#>   get_table(.table_name = "Schools", .database_name = "IDEAInstructionDebug", .schema = "dbo", .server_name = "1064618-SQLDI")
#>   get_table(.table_name = "Schools", .database_name = "IDEAInstructionStage", .schema = "dbo", .server_name = "1064618-SQLDI")
#>   get_table(.table_name = "Schools", .database_name = "PROD1", .schema = "Schools", .server_name = "1065574-SQLPRD1")
#>   get_table(.table_name = "Schools", .database_name = "Dashboard", .schema = "dbo", .server_name = "791150-HQVRA")
#>   get_table(.table_name = "Schools", .database_name = "Dashboard", .schema = "dbo", .server_name = "RGVPDRA-DASQL")
#>   get_table(.table_name = "Schools", .database_name = "SRC_Texas_Schools", .schema = "Powerschool", .server_name = "RGVPDSD-DWSRC1")

Oh no! There are 21 tables named Schools in the warehouse on 17 databases. Yikes.

But look at that error: It's got what you need!

You can copy and paste any one of those lines to get the unique table you are seeking (in this case the Schools table in Schools on Prod1).

schools <- get_table(.table_name = "Schools", .database_name = "PROD1", .schema = "Schools")
#> ! There are 2 tables with that name in our warehouse
#> ℹ You'll need to specify the database and schema name with db target.
#> ✓ Any of these should work:
#>   get_table(.table_name = "Schools", .database_name = "PROD1", .schema = "Schools", .server_name = "RGVPDSD-DWPRD1")
#>   get_table(.table_name = "Schools", .database_name = "PROD1", .schema = "Schools", .server_name = "1065574-SQLPRD1")

head(schools)
#> NULL


idea-analytics/ideadata documentation built on Feb. 1, 2024, 5:40 a.m.