knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Table Schema is a simple format to describe tabular data, including field names, types, constraints, missing values, foreign keys, etc.
::: {.callout-info} In this document we use the terms "package" for Data Package, "resource" for Data Resource, "dialect" for Table Dialect, and "schema" for Table Schema. :::
Frictionless supports schema$fields
and schema$missingValues
to parse data types and missing values when reading Tabular Data Resources. Schema manipulation is limited to extracting a schema from a resource, creating one from a data frame, and providing one back to a resource. Schema metadata is including when writing a package.
get_schema()
extracts the schema from a resource:
library(frictionless) package <- example_package() # Get the Table Schema for the resource "observations" schema <- get_schema(package, "observations") str(schema)
read_resource()
uses schema$fields
to parse the names and data types of the columns in a tabular data file. For example, the third field in the schema (timestamp
) is defined as a datetime type
with a specific format
:
str(schema$fields[[3]])
read_resource()
uses that information to correctly parse the data type and to assign the name timestamp
to the column:
observations <- read_resource(package, "observations") observations$timestamp
The sixth field life_stage
has an enum
defined as one of its constraints
:
str(schema$fields[[6]])
read_resource()
uses that information to parse the column as a factor, using enum
as the factor levels:
class(observations$life_stage) levels(observations$life_stage)
A schema is a list which you can manipulate, but frictionless does not provide functions to do that. Use {purrr}
or base R instead (see vignette("frictionless")
). You do not have to start a schema from scratch though: use get_schema()
(see above) or create_schema()
instead.
create_schema()
creates a schema from a data frame and defines the name
, type
(and if a factor constraints$enum
) for each field:
# Create a schema from the built-in dataset "iris" iris_schema <- create_schema(iris) str(iris_schema)
add_resource()
allows to include the schema with a resource. If no schema is provided, one is created with create_schema()
:
package <- add_resource( package, resource_name = "iris", data = iris, schema = iris_schema )
write_package()
writes a package to disk as a datapackage.json
file. This file includes the metadata of all the resources, including the schema. To directly write a schema to disk, use jsonlite::write_json()
.
fields
is required. It is used by read_resource()
to parse the names and data types of the columns in a tabular data file. create_schema()
sets fields
based on information in a data frame. See Field properties implementation for details.
missingValues
is used by read_resource()
and defaults to ""
. It is passed to na
in readr::read_delim()
. create_schema()
does not set missingValues
. write_package()
converts NA
values to ""
when writing a data frame to a CSV file. Since this is the default, no missingValues
property is set.
primaryKey
is ignored by read_resource()
and not set by create_schema()
.
foreignKeys
is ignored by read_resource()
and not set by create_schema()
.
name
is used by read_resource()
to assign a column name. The vector of names is passed as col_names
to readr::read_delim()
, ignoring names provided in the header of the data file. create_schema()
uses the data frame column name to set name
.
type
and (for some types) format
is used by read_resource()
to understand the column type. The vector of types is passed as col_types
to readr::read_delim()
, which warns if there are parsing issues (inspect with problems()
). create_schema()
uses the data frame column type to set type
. See Field types implementation for details.
read_resource()
interprets type
as follows:
field type | column type
--- | ---
string
| character
or factor
number
| double
or factor
integer
| double
or factor
boolean
| logical
object
| character
array
| character
datetime
| POSIXct
date
| Date
time
| hms::hms()
year
| Date
yearmonth
| Date
duration
| character
geopoint
| character
geojson
| character
any
| character
other value | error
undefined | guessed
create_schema()
sets type
as follows:
column type | field type
--- | ---
character
| string
Date
| date
difftime
| number
factor
| string
with factor levels as constraints$enum
hms::hms()
| time
integer
| integer
logical
| boolean
numeric
| number
POSIXct
/POSIXlt
| datetime
any other type | any
create_schema()
does not set a format
, since defaults are used for all types. This is also the case for datetimes, dates and times, since readr::write_csv()
used by write_package()
formats those to ISO8601, which is considered the default.
title
is ignored by read_resource()
and not set by create_schema()
.
description
is ignored by read_resource()
and not set by create_schema()
.
example
is ignored by read_resource()
and not set by create_schema()
.
constraints
is ignored by read_resource()
and not set by create_schema()
, except for constraints$enum
. read_resource()
uses it set the column type to factor
, with enum
values as factor levels. create_schema()
does the reverse.
rdfType
is ignored by read_resource()
and not set by create_schema()
.
string
is interpreted as character
. Or factor
when constraints$enum
is defined.
format
is ignored.number
is interpreted as double
. Or factor
when constraints$enum
is defined.
bareNumber
is supported. If false
, whitespace and non-numeric characters are ignored.decimalChar
(.
by default) is supported, but as a single value for all number fields. If different values are defined, the most occurring one is selected.groupChar
(undefined by default) is supported, but as a single value for all number fields. If different values are defined, the most occurring one is selected.integer
is interpreted as double
(to avoid issues with big numbers). Or factor
when constraints$enum
is defined.
bareNumber
is supported. If false
, whitespace and non-numeric characters are ignored.boolean
is interpreted as logical
.
trueValues
that are not defaults are not supported.falseValues
that are not defaults are not supported.object
is interpreted as character
array
is interpreted as character
.
datetime
is interpreted as POSIXct
.
format
is supported for the values default
(ISO datetime), any
(ISO datetime) and the same patterns as for date
and time
. The value %c
is not supported.date
is interpreted as Date
.
format
is supported for the values default
(ISO date), any
(guess ymd
) and Python/C strptime patterns, such as %a, %d %B %Y
for Sat, 23 November 2013
. %x
is interpreted as %m/%d/%y
. The values %j
, %U
, %w
and %W
are not supported.time
is interpreted as hms::hms()
.
format
is supported for the values default
(ISO time), any
(guess hms
) and Python/C strptime patterns, such as %I%p%M:%S.%f%z
for 8AM30:00.300+0200
.year
is interpreted as Date
with month and day set to 01
.
yearmonth
is interpreted as Date
with day set to 01
.
duration
is interpreted as character
. You can parse these values with lubridate::duration()
.
geopoint
is interpreted as character
.
geojson
is interpreted as character
.
any
is interpreted as character
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.