knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(rolap)
Once we developed a star database in R using the rolap
package, in addition to exporting it to exploit it with other tools, we can perform multidimensional queries from R: The rolap
package provides the functionality to define geographic attributes and formulate and execute simple queries on a multidimensional schema that includes them.
The main objective of this document is to show the multidimensional query formulation and execution functionality offered by this package. First, the data model is briefly discussed: the possibility of defining stars and constellations. Then, the geographic attributes definition functionality is also shown. Next, the functions defined to support multidimensional queries are presented. Finally, finish with the conclusions.
Strictly speaking, a star is composed of a fact table and several associated dimension tables. A constellation is made up of several stars that can share dimensions. In the rolap
package they are treated in a unified way under the star_database
class: It is used both to define stars and constellations.
The variable mrs_db
, obtained in the vignette titled Obtaining and transforming flat tables, vignette("v05-flat-table-op")
, contains an object of class star_database
that we will use in the example.
class(mrs_db)
We can see a representation of the tables it contains using the draw_tables()
function, as shown below.
mrs_db |> draw_tables()
We can see that it is a constellation because it contains more than one fact table.
The rolap
package allows us to include layers of geographic information associated with dimension attributes. The objective is to be able to obtain layers of geographic information with the data contained in the multidimensional database.
In the considered case, the table of dimension where contains geographical information in the form of the coordinates (latitude and longitude) of each city. We can define the city
field as a geographic attribute.
mrs_db_geo <- mrs_db |> define_geoattribute( dimension = "where", attribute = "city", from_attribute = c("long", "lat") )
We can also associate the geographic information of a vector layer of points or polygons to an attribute (or set of attributes), using the same function.
mrs_db_geo <- mrs_db_geo |> define_geoattribute( dimension = "where", attribute = "state", from_layer = us_layer_state, by = "STUSPS" )
The vector layer used is defined at the state level. If there is another field at another level of detail in the layer, the same layer can be used to define other attributes of coarser granularity.
mrs_db_geo <- mrs_db_geo |> define_geoattribute( dimension = "where", attribute = "region", from_layer = us_layer_state, by = "DIVISION" )
If there is no field in the layer that allows us to establish the relationship, the information associated with other attributes of the dimension can be used.
mrs_db_geo_2 <- mrs_db_geo |> define_geoattribute( dimension = "where", attribute = "region", from_attribute = "state" )
If there are still unrelated instances of the dimension, the define_geoattribute()
function warns and the instances can be consulted using the check_geoattribute_geometry()
function.
Through these functions we have defined relationships between the attributes and vector layers of geographic information, at the level of detail we need, which we can later take advantage of.
A query is defined on a star_database
object and the result of executing it is another star_database
object.
This section presents the functions available to define queries.
star_query()
From a star_database
object, an empty star_query
object is created where we can select fact measures, dimension attributes and filter dimension rows.
Example:
sq <- mrs_db_geo |> star_query()
At least one fact table with one dimension must be included in each query.
select_fact()
To define the fact table to be consulted, its name is indicated, optionally, a vector of names of selected measures and another of aggregation functions are also indicated. If the name of any of the measures is not indicated, the measure corresponding to the number of rows added will be included, which is always included. If no aggregation function is included, those defined for the measures are considered.
Examples:
sq_1 <- sq |> select_fact( name = "mrs_age", measures = "all_deaths", agg_functions = "MAX" )
The measure is considered with the indicated aggregation function. In addition, the measure corresponding to the number of grouped records that make up the result is automatically included.
sq_2 <- sq |> select_fact(name = "mrs_age", measures = "all_deaths")
The measure is considered with the aggregation function defined in the multidimensional scheme.
sq_3 <- sq |> select_fact(name = "mrs_age")
Only the measure corresponding to the number of grouped records is included.
sq_4 <- sq |> select_fact(name = "mrs_age", measures = "all_deaths") |> select_fact(name = "mrs_cause")
In a query we can select several fact tables, at least we have to select one.
select_dimension()
To include a dimension in a star_query
object, we have to define its name and a subset of the dimension attributes. If only the name of the dimension is indicated, it is considered that all its attributes should be added.
Example:
sq_1 <- sq |> select_dimension(name = "where", attributes = c("city", "state"))
Only the indicated attributes of the dimension will be included.
sq_2 <- sq |> select_dimension(name = "where")
All attributes of the dimension will be included.
filter_dimension()
Allows us to define selection conditions for dimension rows. Conditions can be defined on any attribute of the dimension, not only on attributes selected in the query for the dimension. They can also be defined on unselected dimensions. Filtering is done using the function dplyr::filter()
. Conditions are defined in exactly the same way as in that function.
Example:
sq <- sq |> filter_dimension(name = "when", week <= " 3") |> filter_dimension(name = "where", city == "Bridgeport")
run_query()
Once we have selected the facts, dimensions and defined the conditions on the instances of dimensions, we can execute the query to obtain the result.
The query can be executed on any star_database
object that has in its structure the elements that appear in it. If the star_database
has geographic information associated with it, this will be filtered according to its conditions.
Example:
sq <- star_query(mrs_db_geo) |> select_dimension(name = "where", attributes = c("region", "state")) |> select_dimension(name = "when", attributes = "year") |> select_fact(name = "mrs_age", measures = "all_deaths") |> select_fact(name = "mrs_cause", measures = "all_deaths") |> filter_dimension(name = "when", week <= " 3" & year >= "2010") mrs_db_geo_3 <- mrs_db_geo |> run_query(sq) class(mrs_db_geo_3)
The result of running a query is an object of the star_database
class that meets the conditions defined in the query: Other queries can continue to be defined on this object.
We can see a representation of the tables of the result, as shown below.
mrs_db_geo_3 |> draw_tables()
This section shows an example of how to exploit the result of the multidimensional query.
The first thing we do is transform it into flat tables.
ft <- mrs_db_geo_3 |> as_single_tibble_list() ft_age <- ft[["mrs_age"]]
Below are the rows of one of the result tables.
pander::pandoc.table(ft_age, split.table = Inf)
From the results in the form of flat tables, pivottabler
package can be used to present it in the form of pivot tables.
pt <- pivottabler::qpvt( ft_age, c("=", "region"), c("year"), c("Number of Deaths" = "sum(all_deaths)") ) pt$renderPivot()
If in a star_database
object we have attributes to which geographic information has been associated, we can obtain geographic information layers that include fact and dimension data.
To include only the data we need, using query operations we can previously filter the star_database
object. Once the geographic information layer is obtained, it can also be filtered to select the variables represented.
The following sections show how to obtain a geographic information layer and the operations that we can perform on it.
geolayer
objectFrom a star_database
object with some attribute to which we have associated geographic information (geoattribute), we can obtain a geolayer
object. If there is more than one, we have to indicate the layer granularity geoattribute.
gl_state <- mrs_db_geo_3 |> as_geolayer(attribute = "state")
The geolayer
object is composed of a layer of geographic information at the level of detail of the geoattribute and another where the variables of the recorded data are described.
layer_state <- gl_state |> get_layer() layer_state var_state <- gl_state |> get_variables() var_state
We can generate the geographic information layer so that it contains only the objects for which we have additional information, as we have done.
plot(sf::st_geometry(layer_state)) text( sf::st_coordinates(sf::st_centroid(sf::st_geometry(layer_state))), labels = layer_state$state, pos = 3, cex = 0.5 )
We can also generate it for all objects in the original layer, even if they do not contain information.
layer_state_all <- gl_state |> get_layer(keep_all_variables_na = TRUE) plot(sf::st_shift_longitude(sf::st_geometry(layer_state_all)))
For the variables, in addition to being able to obtain them in tibble
format, we can consult for those whose name we indicate or for all of their meaning, as shown below.
gl_state |> get_variable_description(c("var_01", "var_10")) vd <- gl_state |> get_variable_description() vd[c("var_01", "var_10")]
The variables are a tibble
and we can select them using the dplyr::filter()
function.
var_state_2 <- var_state |> dplyr::filter(year == '2016')
Once the variables we need have been filtered, we can filter the geographic information layer so that it will only contain these variables.
gl_state_2 <- gl_state |> set_variables(var_state_2) layer_state_2 <- gl_state_2 |> get_layer()
plot(sf::st_geometry(layer_state_2)) text( sf::st_coordinates(sf::st_centroid(sf::st_geometry(layer_state_2))), labels = layer_state_2$state, pos = 3, cex = 0.5 )
For example, for each state we are going to represent the percentage of pneumonia and influenza deaths registered, starting in 2010. First of all, we must define the query.
sq_2 <- star_query(mrs_db_geo) |> select_dimension(name = "where", attributes = "state") |> select_fact(name = "mrs_cause", measures = c("pneumonia_and_influenza_deaths", "all_deaths")) |> filter_dimension(name = "when", year >= "2010")
We run the query and get a star_database
object as a result.
mrs_db_geo_3 <- mrs_db_geo |> run_query(sq_2)
We obtain a geolayer
object.
gl_state_3 <- mrs_db_geo_3 |> as_geolayer(attribute = "state")
Finally, we represent it.
gl_state_3 |> get_variable_description() layer <- gl_state_3 |> get_layer() layer$tpc_deaths <- (layer$var_2 / layer$var_1) * 100 plot(layer[, "tpc_deaths"], main = "% pneumonia and influenza")
To maintain the definition of the new variable in the geolayer
object, we can define that this is the new geographic information layer of it.
gl_state_3 <- gl_state_3 |> set_layer(layer)
If we want to treat the geographic information layer with another tool, we can export it along with the variable definition table in GeoPackage format.
f <- gl_state_3 |> as_GeoPackage(dir = tempdir()) sf::st_layers(f)
This document presents some of the querying possibilities that offers the rolap
package. The queries are formulated on an object of class star_database
and the result is another object of the same class on which additional queries can be made.
Queries can be formulated about a star or set of stars or constellation.
We can define attributes that have associated geographic information. If any of these geographic attributes are included in the result of a query, we obtain a geographic information layer in which the result of the query is defined in the form of variables that can be filtered and queried.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.