data_model/README.md

iAtlas Data Model

The data used in the iAtlas application is tabular data currently stored in a PostgreSQL database. There are script within the project that can automate ghetting data into the database. For this script to work effectively, there are some data conventions that must be followed. This document will outline those conventions.

Please see the diagram at the bottom of this document for a visualization. Click here to go to the diagram

Enums

The database makes use of a couple enumerables. These "enums" are built into the database when it is built and help ensure consistency in some data fields.

The available enums are:

( 'Amp', 'Del' )

This enum represents the direction of a node. Amp = Amplified, Del = Deleted

( 'Wt', 'Mut' )

This enum represents the status of a gene. Wt = Wild Type, Mut = Mutant

( 'Fraction', 'Count', 'Score', 'Per Megabase', ''Year' )

This enum represents the unit used for the value of the relationship between features.

Main Data Tables

The data is organized into several "main" data tables. These tables hold data describing specific entities. The data in a specific table should never be aware of how it is being used in another table.

The main tables are:

Each row describes an individual gene. The genes are primarily identified by Entrez Id and secondarily by HGNC Symbol.

Each row describes an individual dataset.

Each row describes an individual patient.

Each row describes an individual feature.

Each row describes and individual tag.

Tags may be used to associate entities. For example, a number of individual samples may be associated with the "TCGA_Study" tag. Further, a number of tags may be with a tag like "TCGA_Study". this allows the entities associated with the initial tags to utlimately be associated with "TCGA_Study" through the initial tags. Instead of thinking of "groups", we can think of tags. This tagging allows us to create groups and hierarchies by association.

Each row describes an individual driver result.

Each row describes an individual copy number result.

Each row describes an individual node. A node is the relataionship of a gene with a specific tag to the same gene with a different tag or tags.

Currently, the only value represented is the extra cellular network (ecn_value). Moving forward, other values may be added as new columns.

Each row describes an individual edge. An edge is a relationship between two nodes.

Each row describes a publication.

Sub Tables

There are many sub tables. These tables contain data that is typically related to a main table. As the information withing the sub table is duplicated many times within the rows of a main table, it makes sense to relate that data in a separate sub table so that it may be better indexed. This also uses the power of the database itself to ensure consistency of these values and helps prevent duplication through human typos.

The sub tables are:

Relational (Join) Tables

Much of the data has a one to many or many to many relationship. Rather than have that data expressed in the main tables or sub tables as array structures or similar which would be challenging (and slow) to access, these relationships are kept in join tables that simply hold the ids of the related data. This also makes it indexable as fast integers.

The relational (join) tables are:

Each row describes a publication to gene relationship.

Each row describes a dataset to sample relationship.

Each row describes a dataset to tag relationship.

Each row describes a gene to sample relationship.

For example, "Gene A" may be related to "Sample A". That would be one row. "Gene B" may also be related to "Sample A". That would be an additional row. "Gene A" may also be related to "Sample B". That would be yet again another row, and so on.

Each row also holds the RNA Sequence Expression of this realtionship.

Each row describes a sample to mutation relationship.

For example, "Sample A" may be related to the mutation relationship "Mutation A". That would be one row. "Sample A" may also be related to the mutation relationship "Mutation B". That would be an additional row. "Sample B" may also be related to the mutation relationship "Mutation A". That would be yet again another row, and so on.

Each row also holds the status of this realtionship, being either "Wt" (Wild Type) or "Mut" (Mutant) - (STATUS_ENUM).

Each row describes a gene to gene type relationship.

For example, "Gene A" may be related to "Type 1". That would be one row. "Gene B" may also be related to "Type 1". That would be an additional row. "Gene A" may also be related to "Type 2". That would be yet again another row, and so on.

Each row describes a tag to tag relationship. This table is for adding semantic information to tags. It can be used to group tags. It is useful to think of it in this way:

For example, "Good Tag" may be related to "Parent Tag". That would be one row. "Great Tag" may also be related to "Parent Tag". That would be an additional row. "Good Tag" may also be related to "Sub Tag". That would be yet again another row, and so on.

Each row describes a sample to tag relationship.

For example, "Sample 1" may be related to "Good Tag". That would be one row. "Sample 2" may also be related to "Good Tag". That would be an additional row. "Sample 1" may also be related to "Sub Tag". That would be yet again another row, and so on.

Each row describes a feature to sample relationship.

For example, "Feature A" may be related to "Sample 1". That would be one row. "Feature B" may also be related to "Sample 1". That would be an additional row. "Feature A" may also be related to "Sample 2". That would be yet again another row, and so on.

Each row describes a node to tag relationship.

For example, "Node 1" may be related to "Good Tag". That would be one row. "Node 2" may also be related to "Good Tag". That would be an additional row. "Node 1" may also be related to "Great Tag". That would be yet again another row, and so on.

Table Fields

The fields in each table represent a specific property of the entities represented by the table.

For example, in the genes table, the fields are properties specifically of a gene. The gene is not aware of the sample it comes from and is not aware of any tag it is associated with. App specific information about the gene should not be contained in the data. If we imagine using this data in many different applications, we see that only data that is specific to the actual gene should be in the gene table.

The following are descriptions of each field in each table. This should be exhaustive.

Data Structure

Information on the data structure can be found in the feather_files folder which contains this README.md markdown file.

Data Model Diagram

iAtlas Database Relationships



CRI-iAtlas/iatlas-data documentation built on July 7, 2020, 2:18 a.m.