knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%", eval = FALSE )
Dataverse (formerly Common Data Service - CDS) is Microsoft’s unified data storage for all business applications marketed under the Power Apps umbrella. As many companies are using Microsoft’s technology stack, Dataverse is an important data source for analysts and data scientists. The goal of this package is to make it dead simple to get your data from (and to) your Dataverse Web API instance and leverage its rich metadata to automate data analysis tasks. The package offers a suite of pipeable, simple to use functions shielding the user from the unmemorizable syntax of OData requests and ensuring correct conversion of data types (looking at you picklists!).
To install simply run:
``` {r, eval = FALSE} devtools::install_github("jarekkupisz/cdsR")
As you may notice the package is not on CRAN, there are no runnable examples and no tests folder. The reason is that for the package to work you need to initialize a connection to the [Dataverse’s Web API](https://docs.microsoft.com/en-us/powerapps/developer/data-platform/webapi/overview). This is not currently possible outside your organization’s Microsoft Power App’s tenant and without creating an application user with unique and not sherable credentials. Despite this, the package is fully featured, documented, and is ready for production-grade tasks. I am not affiliated with Microsoft in any way and this is not an official API by any means! However, as I use it daily in my work, I thought that other #rstats souls working with MS stack might find it very useful. Why `cdsR`? As the package was created by me months before MS decided to rebrand Common Data Service, I just stayed with the old name. What is more, Dataverse just has no sensible abbreviation which would work nicely with the package’s naming convention. ## Connecting to your Dataverse If you want to get your data and use other functions in the package, first you need to authenticate. Simply run: ``` {r, eval = FALSE} library(cdsR) initialize_cds_token( cds_instance_name = "prod", #choose any name you like for a Dataverse instance cds_instance_url = "https://yourorg.crm3.dynamics.com", tenant_id = "your tenant guid", client_id = "your user guid", client_secret = "your user pass", web_api_version = "9.1" #choose any you like, 9.1 is the default for now )
This function will create a cds_token
S3 object in the options()
of your R session. You can set as many tokens as you like, as long as you give each a unique cds_instance_name
. I typically create one for “prod”, “testing” and “dev” instances. Get all initialized tokens with getOption("cdsR.tokens")
.
They also automatically refresh! You don’t need to keep calling initialize_cds_token()
. Just do it once at the start of your work or in .Rprofile
. Before each function call the token checks how much time passed before it got its credentials, and if too much, it will get fresh ones!
Ready to get some data? If you know your entity’s logical plural name then:
``` {r, eval = FALSE} download_cds("contacts")
Wait, that’s it? The short answer is yes! If you want more features though follow along. First, how come `download_cds()` knows where to connect? If there is only one cdsR token set it will use it automatically. That’s just so convenient in interactive sessions where you toy with your data. But what if you have scripts that need to run on multiple Dataverse instances or you want to write a readable, production-ready code? That’s where the `cds_instance_name` argument comes in. A `cds_token` object from `options()` with the specified name will be selected: ```r download_cds("contacts", cds_instance_name = "prod")
The pattern below is something I use all the time if I want to download the same entity from multiple instances:
c("prod", "testing") %>% purrr::map(~ download_cds("contacts", .x))
Okay then but this query will simply pull all possible columns and records but typically that’s not what we want. We want either a subset of columns, apply a specific filter or just to download a few rows to play with the data. Hopefully, you want one thing at a time, because writing your own OData HTTP requests is one of those moments that make you hate your job. Fortunately, other arguments of download_cds()
come to the rescue:
download_cds( "contacts", select = c("fullname", "pa_salesscore", "_owninguser_value"), filter = cds_filter_fun("LastXMonths", "createdon", PropertyValue = 3), order_by = "pa_salesscore desc, fullname", top = 100 )
Woah a complex OData query using only understandable R expressions! There’s a couple of things to unpack, so let’s dive in:
$select
, $filter
, $order_by
and $top
. You can pass any character vector that would be a valid part of an OData operand. If such vector had a longer length than 1, it would be collapsed to an OData expression (see above select
). In the case of filter
the expressions will be pasted with and
similarly to dplyr’s filter
. top
of course requires a single integer. cdsR
provides a helper cds_filter_fun()
that takes typically 2 strings – a Web API filter function name and a field on which you wish to subset. For certain conditions you’ll need more than 2, then pass named arguments as the …
. cds_df
– A Bridge Between Dataverse and RSo the result that you get from download_cds()
loos like a tibble
but when you investigate closely you can see some pleasant surprises. This all comes down to the fact that the package strips off the weirdness of some of the OData data types and converts everything into human and R readable format:
GUID
s character vectors and one with a display attribute of the target entity suffixed with .display
. If you want to know which user is a record’s owner look for an owninguser
column to get their GUID and owninguser.display
for their name. lubridate
magic. Remember that the Dataverse works only in UTC!NA
s. cds_ds
's Dataverse MetadataApart from sensible type conversions that allow you to easily use the data, there is more utility hiding in data frames returned by download_cds()
. Actually, I lied to you as technically what you get is a cds_df
object, which is a tibble with some addons. In the attributes of cds_df
s’ you’ll find 3 entries that contain the whole Dataverse metadata of the entity you just downloaded!
There is a helper function get_cds_md()
which allows you to specify which definition you wish to extract with which_md
argument:
”entity”
the default, represents a tibble (an actual one this time!) with MD about the entity itself”attributes”
gets you a tibble with MD containing all information about entity’s fields”picklists”
returns a list of tibbles for each enum/picklist type field in the cds_df
holding pair of keys and labels for each optionLet’s give this rich metadata a spin. Let’s assume you forgot what is the display name for the entity you are working with:
contacts <- download_cds("contacts") contacts %>% get_cds_md() %>% dplyr::select(DisplayName)
This time you want to select all picklists for your model to run on:
contacts %>% select( get_cds_md(., "attributes") %>% filter(AttributeType == "Picklist") %>% pull(LogicalName) )
As long as the info you need hides somewhere in the Dataverse’s metadata you can use it in your code.
Manipulating data in your session is fun for our analytical minds, but the most potent way we can contribute to our organizations is to return the results of our numerical work to those who will use it. As all Power Apps work on Dataverse, updating records in your instance will benefit all of your business applications at once! I hate the word, but this is a powerful concept from Microsoft.
You’ve just trained a customer churn prediction algorithm and want to help your sales reps by sorting their contact list by a propensity to go away. update_cds_records
is your buddy this time. We’ll use contacts
entity we downloaded earlier:
contacts %>% inner_join(churn_algo_results, by = "contactid") %>% select(contactid, ap_churnrisk = prob_to_churn) %>% update_cds_records( cds_instance_name = "prod", fail_safe_entity_name = "contact", force_operation = TRUE )
This call will even give you an ETA for the end of the update job and warn about any errors no the way. This function though works a bit differently:
cds_instance_name
has to be always specified this time. Because we are launching potentially a destructive operation (and anyone who tried to recover data from Dataverse gets MS hell at its “finest”) the package needs to make sure you know what you are doing. fail_safe_entity_name
serves basically as another defence to save you from dire consequences. This function is easy to call at the end of the pipe and I got used to providing the first argument too quickly. This argument forces you to explicitly type which Dataverse table you want to update and the function will confirm with cds_df
’s metadata that indeed you are good to go. force_operation
is yet another fail-safe. By default, it is FALSE
and it uses Dataverse Web API optimistic concurrency control to prevent you from updating records that might have been changed before you downloaded them. If you don’t wish to overwrite any changes you’ll need to include the @odata.etag
column which is included in every download_cds()
result. update_cds_records()
Ok then, simple enough for small requests like the one above. What about state
column? Picklists, multipicklists, dates? They all need to be parsed to integer keys or specially formatted strings for the Web API to accept a POST call, right? Well update_cds_recrods()
does everything for you! Simply use the human-readable labels that were returned to you with download_cds()
and pass the mutated and filtered cds_df
you are working with to update_cds_records()
! There are a few quirks though:
lookup_relationship_name@odata.bind
and have a character vector with the following template for values: target_entity_plural_name(GUID _of_the_target_record)
. An example would be contacts %>% mutate( `ownerid@odata.bind`="systemusers(2zcaa784-0dd5-dr11-81e4-123c2981575g)" )
update_cds_records()
a provided data frame has to be a cds_df
with all the metadata in its attributes()
, otherwise, the function errors. dplyr
’s verbs, unfortunately, get rid of attributes. That’s why cdsR
registers S3 generic methods for mutate
, *_join
, select
and all else that was possible to register to preserve attributes. In my daily work I rarely stumble onto this problem, but if you do then you can use is.cds_df()
to troubleshoot if you are still dealing with a cds_df
. parse_cds_enums(cds_df, to = "key")
Cleaning time! Duplicates and legacy records are a common plague in any CRM-like application. As a data scientist, you’ve just trained an ML model that recognizes low utility entries. Just pipe away:
contacts %>% filter(contactid %in% cleaning_algo_results$junk_contacts_guids) %>% delete_cds_records("prod", "contact")
All the fail-safe features are the same as in the update_cds_recrods()
. For example, the above code would not remove any rows that were changed before you used download_cds()
to get your contacts. Please refer to the function’s documentation or previous section for the details.
This seems like the least useful operation, however, if you worked with any CRM-type system in the past you know that some kind of data migration/integration awaits you. That’s why I decided to include a simple function create_cds_recrods()
. It allows you to create new records from a fresh data frame. This function does not include any of the fancy type conversions of update_cds_records()
. I suggest leveraging get_cds_md()
on a cds_df
representing a Dataverse table you wish to produce records in.
new_contacts <- tribble( ~firstname, ~lastname, ~department, "Andy", "Doe", 1, "John", "Smith", 3 ) create_cds_records( new_contacts, cds_instance_name = "prod", target_entity_set_name = "contacts" )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.