The purpose of the package is to enable R users to use MarkLogic without deep knowledge of JavaScript/XQuery or the REST API. It goes beyond ODBC and expose a lot of the functionality within MarkLogic.

It is using MarkLogic server built in REST API with the use of REST extensions.

Currently rfml is using the following extensions:

In addition, there is two libraries installed:

All backend code is developed in JavaScript. Depending on MarkLogic server version either jsearch or cts search functions are used. The code is also trying to use existing Range indexes if possible, currently only element range indexes are supported.

The package is designed to utilize the MarkLogic server as much as possible by pushing processing back to the database and only keeping information about the data on the client.

Data handling

Most data within MarkLogic server are either stored as XML or JSON which is very flexible data formats enabling a hierarchy structure. However, this type of storage is not always optimal for the analysis done with R, most analytical functions expect a tabular format.

Rfml is overcoming that obstacle by converting data retrieved from MarkLogic Server into a tabular format, it can handle both XML and JSON documents.

As a example consider the following XML document:

<customerOrder>
  <orderNumber>T200<orderNumber>
  <lineItems>
    <lineItem>
      <orderLineNumber>1</orderLineNumber>
      <productNo>2</productNo>
      <productName>1900s Vintage Bi-Plane</productName>
      <quantityOrdered>21</quantityOrdered>
      <priceEach>33.00</priceEach>
      <lineTotal>693</lineTotal>
    </lineItem>
    <lineItem>
      <orderLineNumber>2</orderLineNumber>
      <productNo>4</productNo>
      <productName>Collectable Wooden Train</productName>
      <quantityOrdered>32</quantityOrdered>
      <priceEach>23.45</priceEach>
      <lineTotal>750.4</lineTotal>
    </lineItem>
  </lineItems>
</customerOrder>

When viewed using rfml it will look like this:

|cr1orderNumber|cr1ls1lm11orderLineNumber|cr1ls1lm11productNo|cr1ls1lm11productName |cr1ls1lm11quantityOrdered|cr1ls1lm11priceEach|cr1ls1lm11lineTotal|cr1ls1lm12orderLineNumber|cr1ls1lm12productNo|cr1ls1lm12productName |cr1ls1lm12quantityOrdered|cr1ls1lm12priceEach|cr1ls1lm12lineTotal| |--------------|-------------------------|-------------------|----------------------|-------------------------|-------------------|-------------------|-------------------------|-------------------|------------------------|-------------------------|-------------------|-------------------| | T200 | 1 | 2 |1900s Vintage Bi-Plane| 21| 33.00| 693| 2| 4|Collectable Wooden Train| 32 | 23.45 | 750.4 |

What is happening behind the scene is that all data is flatten out before returned to the R client. Column names are created based on the level the element is, having the first and last character from the name of the parent elements with a number after before the element name.

<customerOrder><orderNumber>

becomes

cr1orderNumber

This means that values that are in the same level with the same element name will be in the same column in the result.

Another effect is that there is a risk of getting very wide tables, so a best practise is to keep your documents in MarkLogic server simple if they are going to be used with rfml.

Setting up a MarkLogic databse to be used with rfml

Before you can use rfml against a MarkLogic Server database you need to make sure that there are a REST instance, with a modules database, defined. You create a REST instance following http://docs.marklogic.com/guide/rest-dev/service#id_12021

You also need to install the REST extensions and library modules; this is needed once for each database that is going to be used with the package. In order to that you need to use an administrator user or a user with rest-admin role or the following privileges:

The actual installation is done with the ml.init.database function, as in the example below. Host are the hostname or ip-adress of your Marklogic Server (in this example localhost), port is the port your REST server listen on (in this example 8000, which is the default that is using the Documents database).

library(rfml)
# setup the database to be used with rfml
ml.init.database(host="localhost", port="8000", adminuser="admin", password="admin")
````
Once the function is done you can start using rfml.

## Using rfml
Before data can be selected a call to ml.connect is needed, the function verifies that the database is setup correctly and returns a connection object. You can have multiple connections at the same time. Host are the hostname or ip-adress of your Marklogic Server (in this example localhost), port is the port your REST server listen on (in this example 8000, which is the default that is using the Documents database).
```R
library(rfml)
#create a connection
localConn <- ml.connect(host="localhost,"port = "8000", username="admin", password="admin")

After the connection is done there is multiple ways to select data from the MarkLogic database.

Using a string query, more information around the syntax can be found at http://docs.marklogic.com/guide/search-dev/search-api#id_41745, in combination with a collection filter.

# create a ml.data.frame
mlIris <- ml.data.frame(localConn, "setosa", collection="iris")

Collections are best practice of organising documents within a MarkLogic database. To list existing collections.

# create a ml.data.frame
ml.collections(localConn)
## [[1]]
## [1] "STOCKS"
## 
## [[2]]
## [1] "baskets"
## 
## [[3]]
## [1] "iris"
## 
## [[4]]
## [1] "iris-test1"
## 
## [[5]]
## [1] "newIris"
## 
## [[6]]
## [1] "rfml"
## 
## [[7]]
## [1] "xmlBaskets"

To get the structure of the documents within a collection.

# create a ml.data.frame
ml.collection.info(localConn, "iris")
## $name
## [1] "iris"
## 
## $nrows
## [1] 150
## 
## $fields
##           name         xpath docFormat xmlns
## 1 Sepal.Length /Sepal.Length      JSON      
## 2  Sepal.Width  /Sepal.Width      JSON      
## 3 Petal.Length /Petal.Length      JSON      
## 4  Petal.Width  /Petal.Width      JSON      
## 5      Species      /Species      JSON

The extracted fields are based on sampling the first 30 results from searching in the collection.nrows are the number of documents that are in the collection.

Collection name can also be used to create a ml.data.frame

# create a ml.data.frame
mlIris <- ml.data.frame(localConn, collection = "iris")

It is also possible to do field level filtering. When using it there a re different requriments depending on the operator used. For ">" "<" "!=" "<=" ">=" operators a Element Range Index are needed on the field used, index can be created using the ml.add.index function. "==" can be used without Element Range Indexes.

# create a ml.data.frame object based filtering on the Species field
mlIris <- ml.data.frame(localConn, fieldFilter = "Species == setosa", collection="iris")
head(mlIris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.0         3.3          1.4         0.2  setosa
## 2          5.0         3.2          1.2         0.2  setosa
## 3          5.1         3.8          1.6         0.2  setosa
## 4          5.8         4.0          1.2         0.2  setosa
## 5          5.0         3.4          1.5         0.2  setosa
## 6          5.1         3.4          1.5         0.2  setosa
# create a ml.data.frame object based filtering on the Petal.Length field, this requires a Element Range Index
mlIris <- ml.data.frame(localConn, fieldFilter = "Petal.Length > 5", collection="iris")
head(mlIris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1          6.9         3.1          5.4         2.1 virginica
## 2          7.2         3.2          6.0         1.8 virginica
## 3          6.3         3.4          5.6         2.4 virginica
## 4          6.5         3.0          5.2         2.0 virginica
## 5          6.5         3.0          5.8         2.2 virginica
## 6          7.2         3.0          5.8         1.6 virginica

There is also possible to upload data to the MarkLogic database, which returns a ml.data.frame object.

# create a ml.data.frame object based on the iris data set
mlIris <- as.ml.data.frame(localConn, iris, "iris")

It is possible to create new columns for the ml.data.frame object. The columns only exist within the object and are not physical created in the source documents in the database.

# create a field based on an existing
mlIris$newField <- mlIris$Petal.Width

# create a field based calculation on existing
mlIris$newField2 <- mlIris$Petal.Width + mlIris$Petal.Length

# create a field based on a previous created
mlIris$newField3 <- mlIris$Petal.Width + 10

mlIris$abs_width <- abs(mlIris$Petal.Width)

The new columns are calculated at runtime when retrieving the data, the calculation is done on the server side.

# pull back the whole result, including the previous created fields
head(mlIris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species newField
## 1          5.0         3.3          1.4         0.2     setosa      0.2
## 2          6.9         3.1          5.4         2.1  virginica      2.1
## 3          6.0         2.9          4.5         1.5 versicolor      1.5
## 4          5.0         3.2          1.2         0.2     setosa      0.2
## 5          7.2         3.2          6.0         1.8  virginica      1.8
## 6          5.0         2.0          3.5         1.0 versicolor      1.0
##   newField2 newField3 abs_width
## 1       1.6      10.2       0.2
## 2       7.5      12.1       2.1
## 3       6.0      11.5       1.5
## 4       1.4      10.2       0.2
## 5       7.8      11.8       1.8
## 6       4.5      11.0       1.0

You can also extract a selection from a ml.data.frame into a new ml.data.frame.

For example, the following statement would select only rows where the column 'Species' equals 'setosa', and only the columns 'Sepal.Length' and 'Sepal.Width'

mlIris2 <- mlIris[mlIris$Species=="setosa",c("Sepal.Length","Sepal.Width")]
head(mlIris2)
##   Sepal.Length Sepal.Width
## 1          5.0         3.3
## 2          5.0         3.2
## 3          5.1         3.8
## 4          5.8         4.0
## 5          5.0         3.4
## 6          5.1         3.4

If the data is going to be used with functions not covered with the rfml package or with other package, it can be pulled back and returned as a data.frame object.

localDf <- as.data.frame(mlIris)
is.data.frame(localDf)
## [1] TRUE

You can also create new documents in MarkLogic based on a ml.data.frame.

# Generate new documents in MarkLogic Server based on the mlIris ml-data.frame object.
newIris <- as.ml.data.frame(x=mlIris,name="newIris" )
head(newIris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species newField
## 1          5.8         2.7          5.1         1.9  virginica      1.9
## 2          5.1         3.5          1.4         0.2     setosa      0.2
## 3          5.8         2.8          5.1         2.4  virginica      2.4
## 4          6.7         3.3          5.7         2.5  virginica      2.5
## 5          5.1         3.7          1.5         0.4     setosa      0.4
## 6          6.1         3.0          4.6         1.4 versicolor      1.4
##   newField2 newField3 abs_width
## 1       7.0      11.9       1.9
## 2       1.6      10.2       0.2
## 3       7.5      12.4       2.4
## 4       8.2      12.5       2.5
## 5       1.9      10.4       0.4
## 6       6.0      11.4       1.4

For more examples how how to use the package see the help.



mstellwa/rfml documentation built on May 23, 2019, 8:16 a.m.