README.md

cdsR - R Interface to Microsoft’s Dataverse (formerly Common Data Service - CDS)

Dataverse (formerly Common Data Service - CDS) is Microsoft’s unified data storage for all business applications marketed under the Power Apps umbrella. As many companies are using Microsoft’s technology stack, Dataverse is an important data source for analysts and data scientists. The goal of this package is to make it dead simple to get your data from (and to) your Dataverse Web API instance and leverage its rich metadata to automate data analysis tasks. The package offers a suite of pipeable, simple to use functions shielding the user from the unmemorizable syntax of OData requests and ensuring correct conversion of data types (looking at you picklists!).

Installation and Usage Disclaimers

To install simply run:

devtools::install_github("jarekkupisz/cdsR")

As you may notice the package is not on CRAN, there are no runnable examples and no tests folder. The reason is that for the package to work you need to initialize a connection to the Dataverse’s Web API. This is not currently possible outside your organization’s Microsoft Power App’s tenant and without creating an application user with unique and not sherable credentials.

Despite this, the package is fully featured, documented, and is ready for production-grade tasks. I am not affiliated with Microsoft in any way and this is not an official API by any means! However, as I use it daily in my work, I thought that other #rstats souls working with MS stack might find it very useful.

Why cdsR? As the package was created by me months before MS decided to rebrand Common Data Service, I just stayed with the old name. What is more, Dataverse just has no sensible abbreviation which would work nicely with the package’s naming convention.

Connecting to your Dataverse

If you want to get your data and use other functions in the package, first you need to authenticate. Simply run:

library(cdsR)
initialize_cds_token(
  cds_instance_name = "prod", #choose any name you like for a Dataverse instance
  cds_instance_url = "https://yourorg.crm3.dynamics.com",
  tenant_id = "your tenant guid",
  client_id = "your user guid",
  client_secret = "your user pass",
  web_api_version = "9.1" #choose any you like, 9.1 is the default for now
)

This function will create a cds_token S3 object in the options() of your R session. You can set as many tokens as you like, as long as you give each a unique cds_instance_name. I typically create one for “prod”, “testing” and “dev” instances. Get all initialized tokens with getOption("cdsR.tokens").

They also automatically refresh! You don’t need to keep calling initialize_cds_token(). Just do it once at the start of your work or in .Rprofile. Before each function call the token checks how much time passed before it got its credentials, and if too much, it will get fresh ones!

Download Data from Your Dataverse as a tibble

Ready to get some data? If you know your entity’s logical plural name then:

download_cds("contacts")

Wait, that’s it? The short answer is yes! If you want more features though follow along.

First, how come download_cds() knows where to connect? If there is only one cdsR token set it will use it automatically. That’s just so convenient in interactive sessions where you toy with your data. But what if you have scripts that need to run on multiple Dataverse instances or you want to write a readable, production-ready code? That’s where the cds_instance_name argument comes in. A cds_token object from options() with the specified name will be selected:

download_cds("contacts", cds_instance_name = "prod")

The pattern below is something I use all the time if I want to download the same entity from multiple instances:

c("prod", "testing") %>% purrr::map(~ download_cds("contacts", .x))

Okay then but this query will simply pull all possible columns and records but typically that’s not what we want. We want either a subset of columns, apply a specific filter or just to download a few rows to play with the data. Hopefully, you want one thing at a time, because writing your own OData HTTP requests is one of those moments that make you hate your job. Fortunately, other arguments of download_cds() come to the rescue:

download_cds(
  "contacts",
  select = c("fullname", "pa_salesscore", "_owninguser_value"),
  filter = cds_filter_fun("LastXMonths", "createdon", PropertyValue = 3),
  order_by = "pa_salesscore desc, fullname",
  top = 100
)

Woah a complex OData query using only understandable R expressions! There’s a couple of things to unpack, so let’s dive in:

cds_df – A Bridge Between Dataverse and R

So the result that you get from download_cds() loos like a tibble but when you investigate closely you can see some pleasant surprises. This all comes down to the fact that the package strips off the weirdness of some of the OData data types and converts everything into human and R readable format:

cds_ds’s Dataverse Metadata

Apart from sensible type conversions that allow you to easily use the data, there is more utility hiding in data frames returned by download_cds(). Actually, I lied to you as technically what you get is a cds_df object, which is a tibble with some addons. In the attributes of cds_dfs’ you’ll find 3 entries that contain the whole Dataverse metadata of the entity you just downloaded!

There is a helper function get_cds_md() which allows you to specify which definition you wish to extract with which_md argument:

Let’s give this rich metadata a spin. Let’s assume you forgot what is the display name for the entity you are working with:

contacts <- download_cds("contacts")
contacts %>% get_cds_md() %>% dplyr::select(DisplayName)

This time you want to select all picklists for your model to run on:

contacts %>%
  select(
    get_cds_md(., "attributes") %>% 
      filter(AttributeType == "Picklist") %>% 
      pull(LogicalName)
  )

As long as the info you need hides somewhere in the Dataverse’s metadata you can use it in your code.

Update Records in your Dataverse

Manipulating data in your session is fun for our analytical minds, but the most potent way we can contribute to our organizations is to return the results of our numerical work to those who will use it. As all Power Apps work on Dataverse, updating records in your instance will benefit all of your business applications at once! I hate the word, but this is a powerful concept from Microsoft.

You’ve just trained a customer churn prediction algorithm and want to help your sales reps by sorting their contact list by a propensity to go away. update_cds_records is your buddy this time. We’ll use contacts entity we downloaded earlier:

contacts %>%
  inner_join(churn_algo_results, by = "contactid") %>%
  select(contactid, ap_churnrisk = prob_to_churn) %>%
  update_cds_records(
    cds_instance_name = "prod",
    fail_safe_entity_name = "contact",
    force_operation = TRUE
  )

This call will even give you an ETA for the end of the update job and warn about any errors no the way. This function though works a bit differently:

Automatic Type Conversion with update_cds_records()

Ok then, simple enough for small requests like the one above. What about state column? Picklists, multipicklists, dates? They all need to be parsed to integer keys or specially formatted strings for the Web API to accept a POST call, right? Well update_cds_recrods() does everything for you! Simply use the human-readable labels that were returned to you with download_cds() and pass the mutated and filtered cds_df you are working with to update_cds_records()! There are a few quirks though:

contacts %>% 
  mutate(
    `ownerid@odata.bind`="systemusers(2zcaa784-0dd5-dr11-81e4-123c2981575g)"
  )

Deleting Records

Cleaning time! Duplicates and legacy records are a common plague in any CRM-like application. As a data scientist, you’ve just trained an ML model that recognizes low utility entries. Just pipe away:

contacts %>%
  filter(contactid %in% cleaning_algo_results$junk_contacts_guids) %>%
  delete_cds_records("prod", "contact")

All the fail-safe features are the same as in the update_cds_recrods(). For example, the above code would not remove any rows that were changed before you used download_cds() to get your contacts. Please refer to the function’s documentation or previous section for the details.

Creating New Records

This seems like the least useful operation, however, if you worked with any CRM-type system in the past you know that some kind of data migration/integration awaits you. That’s why I decided to include a simple function create_cds_recrods(). It allows you to create new records from a fresh data frame. This function does not include any of the fancy type conversions of update_cds_records(). I suggest leveraging get_cds_md() on a cds_df representing a Dataverse table you wish to produce records in.

new_contacts <- tribble(
  ~firstname, ~lastname, ~department,
  "Andy",     "Doe",      1,
  "John",     "Smith",    3 
)
create_cds_records(
  new_contacts,
  cds_instance_name = "prod",
  target_entity_set_name = "contacts"
)


jarekkupisz/cdsR documentation built on Dec. 20, 2021, 9:05 p.m.