In regisoc/kibior: A Simple Data Management and Sharing Tool

library(knitr)
knitr::opts_chunk$set(include = TRUE, 
                      echo = TRUE, 
                      warning = FALSE, 
                      message = FALSE, 
                      results = "markup", 
                      collapse = TRUE,
                      cache = FALSE,
                      comment = "##")
library(ggplot2)
library(dplyr)
library(stringr)
library(readr)
library(magrittr)
library(kibior)

Welcome to kibior package introduction vignette!

General notions

As one of the hot topics in science, being able to make findable, accessible, interoperable and researchable our datasets (FAIR principles) brings openness, versionning and unlocks reproductibility. To support that, great projects such as biomaRt R package enable fast consumption and ease handling of massive validated data through a small R interface.

Even though main entities such as Ensembl or NBCI avail massive amounts of data, they do not provide a way to store data elsewhere, delegating data handling to research teams. During data analysis, this can be an issue since researchers often need to send intermediary subsets of analyzed data to collaborators. Moreover, it is pretty common now that, when a new database or dataset emerges, a web platform and an API are provided alongside it, allowing easier exploration and querying.

Multiplying the number of research teams in life-science worldwide with the ever-growing database and datasets publication on widely varying sub-columns results in an even greater number of ways to query heterogenous life-science data.

Here, we present an easy way for datasets manipulation and sharing throught decentralization. Indeed, kibior seeks to make available a search engine and distributed database system for sharing data easily through the use of Elasticsearch (ES) and Elasticsearch-based architectures such as Kibio.

It is a way to handle large datasets and unlock the possibility to:

pull/download datasets from a local or remote instance of Elasticsearch,
filter, query and search in large amounts of data,
push/store datasets to local or remote instance of Elasticsearch,
share datasets for collaborators around the world,
perform joins between R in-memory and ES-based datasets,
import and export datasets from and to files,
valid safe-state datasets during pipeline execution,
comply to FAIR-sharing requirements by allowing REST requests on data and metadata from Elasticsearch API.

Goal of this vignette

The following sections will explain some basic and advanced technical usage of kibior. A second vignette will focus these features to biological applicaitons.

Vocabulary

We will use both Elasticsearch and R vocabulary, which have similar notions:

| R | Elasticsearch | |-----------------------------|---------------------| | data(set), tibble, df, etc. | index | | columns, variables | fields | | lines, observations | documents |

kibior uses tibbles as main data representation.

Demonstration datasets

Before going to the second separate vignette showing biological datasets example, we strongly advise the reader to start reading the basic and advanced usage sections. In these sections, we will use some datasets taken from other known packages, such as dplyr::starwars...

dplyr::starwars[1:5,]

...dplyr::storms...

dplyr::storms[1:5,]

...datasets::iris...

datasets::iris[1:5,]

...and ggplot2::diamonds to show our examples.

ggplot2::diamonds[1:5,]

Vignettes build requirements

In order to build properly kibior vignettes, you MUST have a running and accessible Elasticsearch instance. See next section (Deploy a single local Elasticsearch instance) will help you deploy a local instance.

The kc instance declared in the following R-code section is based on the previously mentionned configuration and will be used in all examples. If you already have configured Elasticsearch instances, change the next $new() call to match your own configuration.

#> ------------------------------------------------------------
#>                            /!\ 
#> Change this declaration for a custom Elasticsearch instance
#> You MUST HAVE a running and accessible Elasticsearch instance.
#>                            /!\ 
#> ------------------------------------------------------------

#>
#> Change the .Renviron file, default variables are: 
#> 
#>  KIBIOR_BUILD_ES_ENDPOINT="elasticsearch"
#>  KIBIOR_BUILD_ES_PORT=9200 
#>  # KIBIOR_BUILD_ES_USERNAME=
#>  # KIBIOR_BUILD_ES_PASSWORD=
#>
#> and will match the following initialization values:
#>
#>  host = "elasticsearch"
#>  port = 9200
#>  user = NULL
#>  pwd = NULL
#>  verbose = FALSE
#>
#> which will bind "kc" to the instance named "docker-cluster"
#>

# get kibior var from ".Renviron" file
dd <- system.file("doc_env", "kibior_build.R", package = "kibior")
source(dd, local = TRUE)
kc <- .kibior_get_instance_from_env()

#> quiet progress bar
kc$quiet_progress <- TRUE

#> kibior vignette instance: kc
kc

The default login/password for Elasticsearch is elastic/changeme

For the sake of readability, we will use kc as the generic name for our example instance of kibior.

#> preparing data for the vignette

#> remove unwanted from Elasticsearch
delete_if_exists <- function(index_names){
    tryCatch(
        expr = { kc$delete(index_names) },
        error = function(e){  }
    )
}

c(
    "aaa",
    "bbb",
    "ccc",
    "sw", 
    "iris", 
    "starwars", 
    "starwars_alderaan", 
    "starwars_naboo", 
    "starwars_tatooine", 
    "storms", 
    "diamonds", 
    "storm_with_our_id", 
    "storms_file", 
    "storms_file_moved", 
    "storms_zeta_moved", 
    "test_index_single"
) %>% 
    delete_if_exists()

dplyr::storms %>% 
    dplyr::select(name) %>% 
    unique() %>% 
    as.list() %>% 
    .$name %>% 
    tolower() %>%
    paste0("storms_", .) %>% 
    lapply(delete_if_exists)

Deploying an Elasticsearch instance {#deploy-docker}

Before starting, you should know that this step will start an Elasticsearch service and store all data on your machine.

So, you should ponder the quantity of data you will handle in your code according the remaining space left on your computer.

Installation with Docker and docker-compose

To use this feature, you will need Docker and docker-compose installed on your system.

To install Docker, simply follow the steps detailled on its website.

If you are on a Linux / Unix-based system, you should also check the post-installation steps, mainly for the Manage Docker as a non-root user step.

To install docker-compose, simply follow the next steps.

Run your own Elasticsearch instance

The following is the docker-compose fashion. You can use the docker way by passing parameters, but it is verbose and not really needed here since we want something simple to use.

Copy-paste these lines inside a "single-es.yml" file.

# "single-es.yml" file
version: '2.4'

services:

  elasticsearch:
    # this configuration will run a service called "elasticsearch"
    container_name: elasticsearch

    # the elasticsearch image used will be version 7.5.1
    # but you can use another version, such as 6.8.6
    image: docker.elastic.co/elasticsearch/elasticsearch:7.5.1

    # defines env var
    # last line tells us java will use 512MB
    # if you need more, change it for 2GB, for instance
    # "ES_JAVA_OPTS=-Xms2g -Xmx2g"
    environment:
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"

    # strict limit to 1GB of RAM
    mem_limit: 1g
    memswap_limit: 0

    # lock memory
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536

    # init a local volume named "local_es"
    volumes:
      - local_es:/usr/share/elasticsearch/data

    # export port to access Elasticsearch service from outside docker
    ports:
      - 9200:9200

# volumes desclaration
volumes:
    local_es:

Now, run the configuration to launch the service with:

docker-compose -f /your/path/to/single-es.yml up -d

The Elasticsearch service will be accessible. Check http://localhost:9200. You can test it on any browser, it will print out something like that:

Once done, you can now use this instance as your own data repository with kibior and R:

#> Initiate a remote connection
kc_remote <- kibior$new(host = "something-far", user = "foo", pwd = "bar")

#> Create an new local instance bound to your local Elasticsearch
#> By default, `kibior uses localhost isntance with 9200 port
kc_local <- kibior$new()

#> you may need to authenticate since Elasticsearch uses auth system
#> the default login/password is "elastic"/"changeme", so
kc_local <- kibior$new(user = "elastic", pwd = "changeme")

You can now use kc_local as your own private instance.

Stop the Elasticsearch service

To stop the service, simply enter the command:

docker-compose -f /your/path/to/single-es.yml down

Vignettes menu {#vignette-menu}

This vignette is organized as a simple tutorial with some examples you can follow to get the base of how kibior works:

Basic usage, shows the main methods and simple examples how to use them.
Advanced usage, details the kibior object and methods specificities, such as attributes and querying syntax.

The last part is the second vignette, illustrating a more biologically-oriented use case with kibior.

Basic usage {#basic-usage}

Here, we will see the main methods (push(), pull(), list(), columns(), keys(), has(), match(), export(), import(), move(), copy()) and public attributes (verbosity) of kibior class. kibior uses elastic [@ref_elastic] to perform base functions.

Verbosity attributes

By default, kibior comes with three public attributes: $verbose, $quiet_progress and $quiet_results all initiliazed to FALSE.

$verbose toggles the printing of more informations which can be useful to see all processes steps.
$quiet_progress toggles the printing of progress bars. This can be useful for scripts.
$quiet_results toggles the verbosity output of called methods. You may want to deactivate it when you do not need interactive feedback.

To quickly show them, simply print the instance you are using:

kc

Use kc$<attribute-name> <- TRUE/FALSE to toggle verbosity mode on these three attributes.

A new instance of kibior has defaults to interactive behavior: progress bar and results immediate printing, but no additional informations.

See Attribute access in Advanced usage section for all attribute descriptions.

`$push()`: Store a dataset to Elasticsearch

To store data using kc connection:

kc$push(dplyr::storms, "storms")
# or magrittr style
dplyr::starwars %>% kc$push("starwars")

If not already taken, the given index name will be created automatically before receiving data. If already taken, an error is raised.

Important points:

$push() automatically send data to Elasticsearch server, which needs unique IDs.

If not defined, kibior will attribute a kid column counter as unique IDs (default).

One can define its own IDs using the id_col parameter which requires a column name that has unique elements.

$push() expects well-formatted data, mainly in a data.frame or derivative structure such as tibble.

See Push modes in Advanced usage section for more information.

`$pull()`: Download a dataset from Elasticsearch

The $pull() method downloads datasets. It can retrieve all or parts of datasets.

s <- kc$pull("storms")
s %>% names()

Results are stored in a list of tibbles.

s$storms

With this, we can use search patterns to return multiple indices at once.

See Pattern search in Advanced usage section for more information.

`$list()`: List all Elasticsearch indices

#> list all indices
kc$list()

`$columns()`: List all columns of an Elasticsearch index

#> list all columns
kc$columns("storms")

`$count()`: Count the number of elements

#> count all lines
kc$count("storms")

#> count all columns
kc$count("storms", type = "variables")

#> count all indices lines via a pattern
kc$count("s*")

As $search() and $pull(), this method accepts a query parameter to count the number of hits in your dataset following a query. See Querying in Advanced usage section for more information.

`$keys()`: List all unique keys of an Elasticsearch index column

#> list all keys on integer column 
kc$keys("storms", "year")

#> list all keys on string column
kc$keys("storms", "status")

You should not use this on columns that can represent a continuous range such as temperature or coordinate. It will aggregate all possible values which can a large amount of time if your dataset is big enough.

`$has()`: Test if an Elasticsearch index exists

#> test presence of an index
kc$has("storms")
kc$has("abcde")

#> test presence of all indices
c("storms", "abcde") %>% kc$has()

`$match()`: Select matching Elasticsearch indices

#> get exact matching indices 
kc$match("storms")
kc$match("abcde")

#> get matching pattern indices
kc$match("s*")

#> get list of mixed pattern and non pattern matching indices
c("s*", "abcde") %>% kc$match()

$match() and $has() differ on some points:

$has() retuns TRUE or FALSE for any string passed.
$has() does not accept patterns and only looks if the given strings are in $list().
$match() only returns something if some indices match the given strings.
$match() accepts patterns and unpacks all possible indices matching given strings.

`$export()`: Extract Elasticsearch index content to a file

#> preparing data for exporting
delete_if_exists("storms")
dplyr::storms %>% kc$push("storms")

The $export() method create file and export in-memory dataset or Elasticsearch index to this file.

#> Create temp files with data
storms_memory_tmp <- tempfile(fileext=".csv")
storms_elastic_tmp <- tempfile(fileext=".csv")

#> export a in-memory dataset to a file
dplyr::storms %>% kc$export(data = ., filepath = storms_memory_tmp)
kc$import(storms_memory_tmp) %>% tibble::as_tibble()

#> export an Elasticsearch index to a file
"storms" %>% kc$export(data = ., filepath = storms_elastic_tmp)
kc$import(storms_elastic_tmp) %>% tibble::as_tibble()

This method can also automatically use zip by adding the file extension.

#> file with zip extension
storms_memory_zip <- tempfile(fileext=".csv.zip")
#> export it
dplyr::storms %>% kc$export(storms_memory_zip)

Note: kibior is using rio [@ref_rio] that can export much more formats. See rio documentation and rio::install_formats() function.

`$import()`: Get a file content to a new Elasticsearch index

The $import() method can duplicate a dataset retrieved from a file to a in-memory variable, a new Elasticsearch index or both.

#> import data from file
kc$import(filepath = storms_memory_tmp)

#> import data from file and send it to a new 
#> Elasticsearch index, with default configuration
kc$import(filepath = storms_memory_tmp, 
        push_index = "storms_file",
        push_mode = "recreate")
kc$list()

As $export(), it can also read directly from zipped files.

#> import data from file and send it to a new 
#> Elasticsearch index, with default configuration
kc$import(storms_memory_zip)

Note: kibior is using rio [@ref_rio] that can export much more formats. See rio documentation and rio::install_formats() function.

The $import() method can natively manage sequence, alignement and feature formats (e.g. fasta, bam, gtf, gff, bed, etc.) since it also wraps Bioconductor library methods such as rtracklayer::import() [@ref_rtracklayer], Biostrings::read*StringSet() [@ref_biostrings] and Rsamtools::scanBam() [@ref_rsamtools].

Dedicated methods are implemented inside kibior (e.g. $import_features() and $import_alignments()), and the generic $import() method tries to open the right format according to file extension. You can also use specific methods if the import cannot be guessed by the general import() method: import_sequences(), import_alignments(), import_features(), import_tabluar() and import_json().

`$move()`: Rename an index

The $move() method rename an index. The $copy() method is equivalent to $move(copy = TRUE).

#> move a existing dataset to another index
kc$list()
m <- kc$move("storms_file", "storms_file_moved")
kc$list()

`$copy()`: Copy an index

The $copy() method copy an index to another name. It is a wrapper around $move(copy = TRUE).

#> copy index
m <- kc$copy("storms_file_moved", "storms_file")
kc$list()

`$delete()`: Delete an Elasticsearch index

The $delete() method deletes one or more indices.

#> delete one or multiple indices
c("storms_file", "storms_file_moved") %>% kc$delete()

It can also delete following a pattern.

#> push some subsets with the same prefix
push_storm <- function(storm_name, index_name){
    dplyr::storms %>% 
        filter(name == storm_name) %>% 
        kc$push(index_name)
}
push_storm("Amy", "storms_amy")
push_storm("Doris", "storms_doris")
push_storm("Bess", "storms_bess")

#> list
kc$list()

#> delete following a pattern
kc$delete("storms_*")
kc$list()

`$search()`: Search everything

Elasticsearch is here... You know, For search. As a search engine, it is its main feature.

Using $search() method, you can search for everything inside a part or all data indexed by Elasticsearch. If no restrictions is found in the query parameter, all data will be searched, which means in every indices, every columns, every keywords.

#> here, we search the exact string "something" everywhere
#> but will find nothing
kc$search(query = "something")

#> we search for the exact string "anita" in "storms" dataset
kc$search("storms", query = "anita")[["storms"]]

#> we search for text containing the substring "am" in "storms" dataset
s <- kc$pull("storms", query = "*am*")[["storms"]]
#> we get 4 storms names
s %>% select(name) %>% unique()

By default, $search() has head mode active, which will return a small subset (default is 5) of the actual complete result to allow quick inspection of data. With $verbose <- TRUE, it will be printed in the result as "Head mode: on". To change the head size, modify the $head_search_size attribute.

To get the full result, you have to use $search(head = FALSE), or more simply : $pull().

See Querying in Advanced usage section for more information.

`$stats()`: base statistics of columns

Alongside data handling methods are descriptive statistical methods. You already know $count() but here some others displayed by kibior.

The $stats() method is a shortcut to ask for: count, min, max, avg, sum, sum_of_squares, variance, std_deviation, std_deviation_upper (bound), std_deviation_lower (bound).

#> multi-indices, index pattern and multicolumns
kc$stats(c("starwars", "s*"), c("height", "mass"))

#> work also with query and sigma for standard deviation
kc$stats("starwars", c("height", "mass"), sigma = 2.5, query = "homeworld:naboo")

Some important warnings here:

Counts are approximate

Standard Deviation and Bounds require normality

In addition to $count() and $stats(), lots of others methods exist to perform descriptive analysis: avg, mean, min, max, sum, q1, q2, median, q3 and summary.

`$describe_index()` and `$describe_columns()`: get the description of index and columns

You can ask for description of datasets with these methods.

Important: this feature requires the user that pushed the data to manually add the metadata with $add_description().

Advanced usage {#advanced-usage}

Pattern search {#pattern-search}

Some methods allow wildcard use "*" such as $search() and $pull().

#> consider these two datasets
dplyr::starwars %>% kc$push("starwars", mode = "recreate")
dplyr::storms %>% kc$push("storms", mode = "recreate")

#> We want to search all indices startings with an "s" 
#> We search for words in the "name" field that start with a "d"
#> Both "index" and "storms" index have a "name" field
s <- kc$search("s*", query = "name:d*", head = FALSE)
s %>% names()
s$starwars
s$storms

Attributes access {#attribute-access}

As objects, kibior instances attributes can be accessed and updated for some.

| Attribute name | Read-only | Default | Description | |-------------------------------|-----------|-----------------|-------------------------------------------------------------------------------------------| | $host | | "localhost" | the Elasticsearch host | | $port | | 9200 | the Elasticsearch port | | $user | x | NULL | the Elasticsearch user | | $pwd | x | NULL | the Elasticsearch password | | $connection | x | NULL | the Elasticsearch connection object | | $head_search_size | | 5 | the head size default value | | $cluster_name | x | When connected | the cluster name if and only if already connected | | $cluster_status | x | When connected | the cluster status if and only if already connected | | $nb_documents | x | When connected | the current cluster total number of documents if already connected | | $version | x | When connected | the Elasticsearch version if and only if already connected | | $elastic_wait | | 2 | the Elasticsearch wait time for update commands if already connected (in seconds) | | $valid_joins | x | A vector | the valid joins available in `kibior | | $valid_count_types | x | A vector | the valid count types available (mainly observations = rows, variables = columns) | | $valid_elastic_metadata_types | x | A vector | the valid Elasticsearch metadata types available | | $valid_push_modes | x | A vector | the valid push modes available | | $shard_number | | 1 | the number of allocated primary shards when creating an Elasticsearch index | | $shard_replicas_number | | 1 | the number of allocated replicas in an Elasticsearch index | | $default_id_col | | "kid" | the ID column name used when sending data to Elasticsearch if not provided by user | | $verbose | | FALSE | the verbose mode | | $quiet_progress | | FALSE | the progress bar printing mode | | $quiet_results | | FALSE | the method results printing mode |

#> access the current host for the "kc" instance
kc$host
#> modify the head_search threshold
kc$head_search_size <- 10L

Some attributes cannot be modified.

#> error when trying to modify read-only attributes
kc$user <- "nope"

Organizing data for searches

Working alone directly on a massive cluster of servers is an unlikely situation. Moreover, handling large datasets on your own computer or storing all data in your local Elasticsearch repository is generally a bad idea. We generally tend to only handle what we can afford to, and organize pipelines and softwares accordingly.

There are multiple strategies to organize data, and our main objective here is to use servers for what they have been built for: to do the cpu- and memory-greedy job. Thus, in comparison, our personal computers or laptop will not have huge load processes. Putting kibior in this equation will help us further as it is backed by a database and search engine.

As a rule of thumb, subsetting and querying is a good strategy, e.g. splitting on categorial variables.

#> push storms dataset
dplyr::storms %>% 
    kc$push("storms", mode = "recreate")

#> select the first 5 storms names and push them
#> in different indices, each name prefixed with "storms_"
dplyr::storms %>% 
    split(dplyr::storms$name) %>% 
    head() %>% 
    purrr::imap(function(data, index_name){ 
        index_name %>% 
            tolower() %>% 
            paste0("storms_", .) %>%
            kc$push(data, .) 
    })
kc$list()

What we can do then, is searching in all indices names starting with the prefix "storms_"

#> Within them, we search some minimum winds and pressure
#> results come already filtered by storm names
kc$search("storms_*", 
        query = "wind:>25 && pressure:>30", 
        columns = c("name", "year", "month", "lat", "long", "status"), 
        head = FALSE)

As we show before, we did not push all data but only some subsets of interest. By selecting and pushing what we need, datasets can be searched and shared immediately after.

If you work in sync with multiple remote collaborators on the same Elasticsearch cluster, that can be a great strategy. For instance, one of your collaborators can add a new dataset that will not change the request, but will enrich the result.

#> added from remote kibior instance 
#> using `tail()` to simulate other data
dplyr::storms %>% 
    split(dplyr::storms$name) %>% 
    tail(2) %>% 
    purrr::imap(function(data, index_name){ 
        index_name %>% 
            tolower() %>% 
            paste0("storms_", .) %>%
            kc$push(data, .) 
    })

We can apply the same request and found some new results.

#> search all, same request as before
s <- kc$search("storms_*", 
            query = "wind:>25 && pressure:>30", 
            columns = c("name", "year", "month", "lat", "long", "status"), 
            head = FALSE)
#> assemble results if needed
do.call(rbind, s)

Querying {#querying}

One of the main features of kibior is to be able to search inside vast amounts of data thanks to Elasticsearch. You can use the search feature with the eponym method $search() but also $pull() by using the query parameter.

Querying notation

To query specific data, the query parameter of methods such as $count() or $search() requires one string following the Elasticsearch Query String Syntax.

To sum them up, you can search for:

terms,

kc$search("starwars", query = "orange")$starwars

or phrases, with double-quotes.

kc$search("starwars", query = '"Luke Skywalker"')$starwars

To complement, you can apply multiple operators:

boolean operators:
AND (or "&&", double-ampersand),
OR (or "||", double-pipe),
NOT (or "!", exclamation point),
+ (plus) the term MUST be present,
- (minus) the term MUST NOT be present.
grouping: organize boolean operators, ex: "(quick OR brown) AND fox".
field selecting: target a specific column.
Phrases can be searched.

#> rows that have "name" == "Luke Skywalker" 
kc$search("starwars", query = 'name:"Luke Skywalker"')$starwars

Boolean operators can be used.

#> rows that have blue or green eyes
kc$search("starwars", query = 'eye_color:(blue OR green)')$starwars

range notation: using [min TO max] for inclusive or {min TO max} for exclusive.
Can be use as a simple search expression for one side unbounded:
- n:>=10 is equivalent to n:[10 TO *].
- n:<=10 is equivalent to n:[* TO 10].
- n:>10 is equivalent to n:{10 TO *}.
- n:<10 is equivalent to n:{* TO 10}.
Inclusive threshold.

#> include 160 and 180 values
kc$search("starwars", query = "height:[160 TO 180]")$starwars

Exclusive threshold.

#> exclude 160 and 180 values
kc$search("starwars", query = "height:{160 TO 180}")$starwars

Mixing inclusive and exclusive.

#> exclude 160 but include 180
kc$search("starwars", query = "height:{160 TO 180]")$starwars

fuzzyness and proximity: using "~" at the end of a term to use approximative search.
Default fuzzy factor is 2, meaning "quikc~" and "quikc~2" are identical.
It can be applied to phrases, ex: ""fox quick"~5".

#> fuzzy search for blue/black/brown/... eyes
#> useful when we do not know exactly the content
kc$search("starwars", query = "eye_color:bla~3")$starwars

boosting: using "^" ponderate some expressions over others.
Value:
- O to 1: decrease boosting.
- Superior to 1: increase boosting.
Boost type:
- terms, ex: quick^2 fox, quick is boosted.
- phrases, ex: "foo bar"^2.
- groups, ex: (foo bar)^4.

#> boost the black eye search but get the blue too
kc$search("starwars", query = "eye_color:(black^2 OR blue)")$starwars

Now, we can consider making easily a more complex search query:

#> consider this dataset
ggplot2::diamonds %>% kc$push("diamonds")

#> searching premium or ideal quality of diamonds, 
#> with a price inferior to 10k$, a carat superior to 1.4,
#> a z between 2.2 and 5.4 included, and not colors E or H. 
#> we only want some columns.
kc$search("diamonds", 
        query = "cut:(premium || ideal) 
            && price:<10000 
            && carat:>1.4 
            && z:[2.2 TO 5.4] 
            && -color:(E || H)", 
        columns = c("carat", "color", "depth", "clarity", "price", "z"), 
        head = FALSE)

`$search()` behavior

#> consider this dataset
dplyr::storms %>% kc$push("storms", mode = "recreate")
dplyr::starwars %>% kc$push("starwars", mode = "recreate")

Though Elasticsearch is very powerful as a document-oriented database, it is a full-text search engine.

#> searching for exact word "dar" but nothing found
kc$search(query = "dar")

With wildcard and targeting a single index:

#> The search is case-insensitive meaning: 
#> Dar == dAr == daR == DAr == ...etc.
kc$search(query = "*Dar*")$starwars

Column selection:

#> searching every word in name that starts with "d"
s <- kc$search("*", 
            query = "name:d*", 
            columns = c("name", "status"))
s %>% names()

#> Empty
s$diamonds 

#> some names, but no status field found
s$starwars

#> complete columns
s$storms

As you can see on the last request, some columns did not match, thus were not returned.

Now a more complex search, directly done by pulling data:

#> We can search premium or ideal quality of diamonds, 
#> with a price inferior to 10k$, a carat superior to 1.4,
#> a z between 2.2 and 5.4 included, not colors E or H,
#> and not from a clarity starting with the string "VS"
#> we only want some columns.
kc$pull("diamonds", 
        query = "cut:(premium || ideal) 
            && price:<10000 
            && carat:>1.4 
            && z:[2.2 TO 5.4] 
            && -color:(E || H)
            && -clarity:VS*", 
        columns = c("carat", "color", "depth", "clarity", "price", "z"))

This was executed on a small 54k observations and 10 variables dataset. We will see it on a bigger one in biological example vignette.

`text` and `keyword` querying {#text-querying}

Lastly, we need to see the difference between a keyword and a text field.

Elasticsearch can index text values as two different types: text and keyword. The difference between those two is that:

text columns such as "name" or "skin_color" are broken up into words during indexing, allowing searches on one or more words,

#> search every documents which have at least 
#> a word in "name" columns starting with "L"
kc$pull("starwars", 
        query = "name:L*", 
        columns = "name")$starwars

keyword columns (always added when pushing data with kibior) keep the full text as one string.

#> search every documents which have their "name"
#> field starting with "L"
kc$pull("starwars", 
        query = "name.keyword:L*", 
        columns = "name")$starwars

kibior indexes all text values as text AND keyword, so we can use whole-text search (with .keyword tag) AND word-specific (without .keyword tag).

Doing a search for a word starting with a specific prefix in pure R is a bit more annoying:

dplyr::starwars[["name"]] %>%                    #> take the name column data
    lapply(function(x){                          #> for each name
        stringr::str_split(x, " ") %>%           #> split name by space
        unlist(use.names = FALSE) %>%            #> align
        grepl("^L", ., ignore.case = TRUE) %>%   #> search pattern for words starting with "L", ignore case to search also for "^l"
        any()                                    #> TRUE if at least one word match
    }) %>%                                       #> list of logicals
    unlist(use.names = FALSE) %>%                #> flatten it to logical vector to match starwars observations number
    dplyr::starwars[.,] %>%                      #> apply logical filter only on lines that were found
    dplyr::select(name)                          #> select only "name" var

Reserved Elasticsearch characters

Elasticsearch has some reserved characters : + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /

You should remove them before pushing them into Elasticsearch. If it is not possible or you want to retrieve data from someone else that contains reserved characters, you should try to query with a keyword field.

`$push()` details

Define a unique IDs column

When pushing data with default parameters, kibior will define unique IDs for each record (each line of a table) and add them as metadata. You can retrieve them by using $pull(keep_metadata = TRUE).

#> With the storms index
kc$pull("storms", keep_metadata = TRUE)$storms

Metadata columns are mainly prefixed by an underscore. The actual record is embedded into the _source field. Since data have been pushed without specifying an ID column, the _id field that defines Elasticsearch unique IDs reflects the one automatically added by kibior in the data (kid by default). To change the default ID column added by kibior, change the $default_id_col attribute value.

Letting kibior handle ID attribution will produce uniqueness, but might not be the most meaningful and practical for update.

To change that behavior, you can define your own ID field when calling $push() data by using the id_col parameter.

#> Again, pushing storms, but with our own IDs, for instance, 
#> by adding "aaa" at the begining of each row number and use it as ID.
data <- storms
ids <- seq_len(nrow(data)) %>% 
  paste("aaa", ., sep="")
data <- cbind(a_new_unique_id = ids, data)
#> the column "a_new_unique_id" will be used as our unique ID
s <- kc$push(data, "storm_with_our_id", id_col = "a_new_unique_id")
#> and see 
s <- kc$pull("storm_with_our_id", 
             columns = "a_new_unique_id",
             keep_metadata = TRUE)$storm_with_our_id
s %>% select(c("_id", "_source.a_new_unique_id"))

Caution here: the columns parameter does not apply to metadata.

#> columns match nothing except actual pushed data columns
kc$pull("storms", keep_metadata = TRUE, columns = c("_id", "_version"))$storms

Push modes {#push-modes}

When pushing data, if the index you are using in $push() already exists, an error will be thrown. This is due to mode = "check" parameter that will check if an index with the name you gave already exists. This is the default option, but can be changed to "recreate" or "update":

"recreate" will erase the index and write to a fresh one with the same name. Be cautious with this option as you will erase previously written data from that index name.

#> recreate one index, whether it already exists or no
dplyr::starwars %>% kc$push("starwars", mode = "recreate")

"update" will push and update indexed data with corresponding IDs. For this option, you must know which field is the unique ID and send updated documents over them. You do not need all data to be updated, just send a subset of updated data. Send all data again might be error prone and can take a lot of time if your dataset is big. Knowing which field is the unique ID also helps a lot and prevent errors.

#> we will change the height of orange-eyed inhabitants of "Naboo"
#> homeworld to 300 and update that subset to the main one.
s <- kc$pull("starwars", 
             query = "eye_color:orange && homeworld:naboo")$starwars
s

#> change the height of those selected to 300
s$height <- 300
s

#> and update the main dataset. Since it is a subset of that dataset, 
#> IDs are the same, which is default "kid" column.
ns <- kc$push(s, "starwars", 
              mode = "update", 
              id_col = "kid")
#> see the result
ns <- kc$pull("starwars", 
              query = "eye_color:orange && homeworld:naboo")$starwars
ns

Comparison with `dplyr` functions

dplyr package offers simple and effective functions called filter and select to quickly reduce the scope of interest. In the same fashion, kibior uses Elasticsearch query string syntax that is very similar to the dplyr syntax (see Querying section). Elasticsearch decuple the search possibilities by allowing similar usage on multiple indices, or datasets, on multiple remote servers.

Moreover, using $count(), $search() or $pull(), one can use their analogous features:

dplyr::select() with columns parameter,
and dplyr::filter() with query parameter.

Using both of them result in much more powerful search capabilities in a much more readable code.

Following sections are some examples of analogous requests.

Similarities

Select some columns:

#> Tidyverse 
s <- dplyr::starwars %>% 
        dplyr::select(name, height, homeworld)

#> `kibior
s <- kc$pull("starwars", 
             columns = c("name", "height", "homeworld"))

Filter on strict thresholds:

#> Tidyverse 
s <- dplyr::starwars %>% 
        dplyr::filter(height > 180)

#> `kibior
s <- kc$pull("starwars", 
             query = "height:>180")

Filter on soft thresholds:

#> Tidyverse 
s <- dplyr::starwars %>% 
        dplyr::filter(height >= 180)

#> `kibior
s <- kc$pull("starwars", 
             query = "height:>=180")
#> or with range notation
s <- kc$pull("starwars", 
             query = "height:[180 TO *]")

Filter on ranges:

#> Tidyverse 
s <- dplyr::starwars %>% 
        dplyr::filter(height >= 180 && height < 300)

#> `kibior
s <- kc$pull("starwars", 
             query = "height:[180 TO 300}")

Filter on exact string match for one field:

#> Tidyverse 
s <- dplyr::starwars %>% 
        dplyr::filter(homeworld == "Naboo")

#> `kibior
s <- kc$pull("starwars", 
             query = "homeworld:Naboo")

Filter on exact string match with multiple choices on one field:

#> Tidyverse 
s <- dplyr::starwars %>% 
        dplyr::filter(homeworld == "Naboo" || homeworld == "Tatooine")
#> or
s <- dplyr::starwars %>% 
        dplyr::filter(homeworld %in% c("Naboo", "Tatooine"))

#> `kibior (several ways to do it)
s <- kc$pull("starwars", 
             query = "homeworld:(Naboo || Tatooine)")

Filter on partial string matching:

#> Tidyverse, we have to use `str_detect`
s <- dplyr::starwars %>% 
        dplyr::filter(stringr::str_detect(name, "Luk|Dar"))

#> `kibior, nothing else required
s <- kc$pull("starwars", 
             query = "name:(*Luk* || *Dar*)")

Filter over a compositions of multiple filters (multiple columns):

#> Tidyverse 
s <- dplyr::starwars %>% 
        dplyr::filter(homeworld == "Naboo" && height > 180)

#> `kibior
s <- kc$pull("starwars", 
             query = "homeworld:Naboo && height:>180")

Differences

Even if there are lots of similarities regarding the syntax, Elasticsearch is powerful search engine. Thus, requests on billions of records are less expensive to do with it. Also, Elasticsearch is accessible throught an its API. Numerous people can access it at the same time. Which mean you can work synchronously with a collaborator pushing data and using them immediately after. Moreover, using wildcards, we can search on multiple indices at once.

What we can do very easily with Elasticsearch is searching everywhere: in every indices, in every columns, and in every words. Lastly, full-text searches are the big deal. See Text and Keyword querying for more details.

Change tibble column type

kibior will return base types in tibble structures (integer, character, logical, and list) for representing data. If you want to change some columns, use readr::type_convert() after retrieving the dataset.

#> changing the "status" column from string to factor
kc$pull("storms")$storms %>%
    readr::type_convert(
        col_types = readr::cols(
            status = readr::col_factor()))

Compare two instances

If you manage multiple instances, you can compare host:port couple easily with == and != operators.

#> is kc instance equal to kc_two instance?
(kc == kc_two)
#> are kc and kc_two instances differents?
(kc != kc_two)

Attach one instance to global environment

Using only one instance of kibior, you might want to attach this instance to the global environment. This will indeed remove the instance call at the beginning of each method call (in our examples: kc$...).

Though it can be practical in local developments for only one instance, we strongly discourage that pratice if you entend to share your code. It can induce wrong behaviors during execution in environments with different configurations or multiple instances.

Joins

kibior integrated dplyr package joins: full, left, right, inner, anti, and semi joins.

By using kibior joins, you can apply these joins to in-memory datasets and Elasticsearch-based indices. `kibior supports query parameter when joining to accelerate data retrival time but cannot join on listed columns.

#> pushing a subset of data
dplyr::starwars %>% 
    dplyr::filter(homeworld == "Naboo") %>%
    kc$push("starwars_naboo", mode = "recreate")

kc$pull("starwars_naboo")

#> perform an inner join  between the in-memory full dataset
#> and the remote subset we have just sent
columns <- c("name", "height", "mass", "gender", "homeworld")
kc$inner_join(dplyr::starwars, "starwars_naboo",
            left_columns = columns,
            right_columns = columns,
            by = c("name", "height", "mass"))

As you can see, kibior uses suffixes left and right on data column.

Moving and copying data from another instance

Appart from moving and copying indices from the same cluster of Elasticsearch instances, the $move() and $copy() methods can do the same with REMOTE instances. The remote Elasticsearch endpoint has to be declared inside your elasticsearch.yml configuration file.

By adding one line to the elasticsearch.yml configuration file, allowing a server whitelist, Elasticsearch servers can talk to each others. By this, they can transfer data across them in a much faster and secure way.

#> config/elasticsearch.yml

...
reindex.remote.whitelist: "otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*"
...

Full description can be found on Elasticsearch documentation.

After that, kibior will be able to use the from_instance parameter of $move() and $copy().

#> init two ES binding
#> kc_local must be configured
#> we make the assumption that both kc are accessible
kc_local <- kibior$new("es_local")
kc_remote <- kibior$new("es_remote", port = 9205)

#> copy data from kc_remote to kc_local
kc_local$copy(from_index = "remote_index", 
              to_index = "new_copy_of_remote_index_in_local",
              from_instance = kc_remote)

This method allows massive data copying in a much faster way since all data are structured the same.

Known limits

As all implementations and developments, there are some limits:

Elasticsearch cannot store uppercase field names, thus all column names are forced to lowercase when submitted by default.
Elasticsearch interprets dots in strings as nested values (ex: "aaa.bbb" is understand as field "aaa" containing a field "bbb"), which is prone to errors with R language since variables can be named with dots. To avoid errors when pushing data to Elasticsearch, dots in column names are replaced by underscores.

#> iris column names
datasets::iris %>% names()

#> example with iris dataset
datasets::iris %>% kc$push("iris")

# get columns of index iris
kc$columns("iris")

Elasticsearch has updatable default limitations to 1000 columns, so if datasets pushed with more than 1000 variables, it will generate an error. Two solutions: try to transpose it, or define a higher Elasticsearch limit in its configurations.
Elasticsearch handles each document (each line of a table) with a unique ID: a specific "_id" metadata field. What can be confusing here is that metadata are not on the same level as data in Elasticsearch. To be able to update data more easily by targeting accurately document IDs, we force add a new unique field (default is kid) when pushing data to Elasticsearch and define it as the unique "_id" field. If you know one of your column is unique and can be used as an ID column, you can use the id_col of the $push() method to define this column as main ID.
The columns parameter does not handle metadata columns.
Elasticsearch is really great for textual and keyword search, for that the text has to have common delimiters to be cut down to words. Passing a single, billions-long, uninterrupted biomolecular sequence is not a good thing for Elasticsearch and may result in an indexing failure.
$move() and $copy() for remote instances are very sensitive to authentication and security configurations. Some tasks will not be possible due to each organism security measures. Check with your favorite or proper system administrator.
Joins are not executed server-side (on ES), which actually means the Elasticsearch data must be downloaded before executing the actual join. Querying and selecting columns with joins parameters left_columns, right_columns, left_query and right_query is realtively important to lower data transfer payload and fasten the execution.
Elasticsearch limits returned results to 10.000 elements per bulk. If you try to set bulk_size > 10000 in parameter, `kibior will downsize it to match the maximum allowed.
The query parameter expressiveness is a powerful string-based mecanism. Users need to understand that the query parameter sends in one request a query to an Elasticsearch instance. If the request is generated based on a list of elements such as c("id1", "id2", "id3", ...) %>% paste0(collapse = " || ") %>% kc$search("*", query = .), it can possibly represents a very long string which cannot be entirely passed down to Elasticsearch properly. One way to counter this issue is to split up the element vector into subset and do mulitple calls.
Kibior applies some modifications on datasets before sending them on Elasticsearch: turns all dataset names to lowercase, removes all dataset dotted-based names to underscore-based names, adds kid column, etc. All these tranformations can affect the behavior of $*_join() methods.
The $keys() method limits by default the number of unique keys found to 1000 since it aggregate a possible unlimited number of keys which can happen when calling it on integer or floating point values. If you want more, change the max_size method parameter.

Tested with

kibior has been tested with these configurations:

| Software | Version | |-----------------|---------------------------------- | | Elasticsearch | 6.8, 7.5, 7.8, 7.9 | | R | 3.6.1, 4.0.2, 4.0.3 | | RStudio | 1.2.5001, build 93, 7b3fe265 |

This vignette has been built using the following session:

Session info

<p>

```r
sessionInfo()
```

</p>

#> preparing data for the vignette

#> quiet progress bar
kc$quiet_progress <- TRUE
kc$verbose <- FALSE

#> hard remove all indices
kc$list() %>% kc$delete()

References

regisoc/kibior documentation built on Aug. 15, 2021, 9:51 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

regisoc/kibior A Simple Data Management and Sharing Tool

In regisoc/kibior: A Simple Data Management and Sharing Tool

General notions

Goal of this vignette

Vocabulary

Demonstration datasets

Vignettes build requirements

Deploying an Elasticsearch instance {#deploy-docker}

Installation with Docker and docker-compose

Run your own Elasticsearch instance

Stop the Elasticsearch service

Vignettes menu {#vignette-menu}

Basic usage {#basic-usage}

Verbosity attributes

$push(): Store a dataset to Elasticsearch

$pull(): Download a dataset from Elasticsearch

$list(): List all Elasticsearch indices

$columns(): List all columns of an Elasticsearch index

$count(): Count the number of elements

$keys(): List all unique keys of an Elasticsearch index column

$has(): Test if an Elasticsearch index exists

$match(): Select matching Elasticsearch indices

$export(): Extract Elasticsearch index content to a file

$import(): Get a file content to a new Elasticsearch index

$move(): Rename an index

$copy(): Copy an index

$delete(): Delete an Elasticsearch index

$search(): Search everything

$stats(): base statistics of columns

$describe_index() and $describe_columns(): get the description of index and columns