Design Guidelines¶
These are guidelines for creating IDSs according to Tetra Data conventions.
These guidelines go beyond the base level of platform requirements which every IDS must follow (such as having an @idsType defined).
The aim of these guidelines is to make IDSs consistent by using a common set of practices and reusable Components, which makes IDSs easier to create and maintain, and makes data easier to consume downstream by making IDS design predictable.
General¶
# |
Rule |
Checked by IDS Validator (only for IDS designed by TS) |
|---|---|---|
1 |
Most of the fields in schema.json should be nullable unless defined to be NOT NULL in Null Section or confirmed with the vendor that cannot be null |
No |
2 |
If a field indicates that it’s a scientific measure, like “weight”, “mass”, “volume”, use the |
No |
3 |
All scientific units should be standardized where possible, see Units |
No |
4 |
All ID-like fields must be string type, user ID, instrument ID, sample ID, etc. |
No |
5 |
Combine fields that has the common key prefix into one single object by extracting the prefix as the parent key |
Yes, warning if not followed |
6 |
Any IDS field called |
No |
7 |
Make the field name singular if its value is not an array |
No |
8 |
Make the field name plural if its value is an array |
No |
9 |
For fields with a value that is a non-scientific number, you don’t need to define it as |
No |
10 |
Spell out acronyms and abbreviations as the full names if possible to avoid ambiguities |
No |
11 |
Annotate fields with |
No |
Nullable values and required fields¶
Nullable field definitions are explained here: Nullable fields, and required field definitions are explained here: Required Fields. This section gives best practice recommendations for when to use them.
Terminology for null and required concepts¶
Null (opposite: non-null): a special value which can be assigned to a field to indicate that no value is available for that field.
Missing (opposite: present): when a key-value pair is omitted from the primary data it is missing, otherwise it is present. Across multiple files, if a field is sometimes present and sometimes missing, call it sometimes-missing. Otherwise call it always-present if it’s guaranteed to be present in every file.
Container type (counterpart: atomic type): A container type is an object or array - it can contain more fields. The other types are atomic types: string, number, boolean.
Required (opposite: non-required) the field must be present in every IDS JSON instance, i.e. this field must be included in the "required": [] array in JSON Schema.
Nullable (opposite: not nullable or non-nullable): the value of this field can be null in an IDS JSON instance. Note that a container type cannot be nullable due to TDP platform requirements.
Null and required design guidelines¶
By default, we recommend making IDS fields which have an atomic type (like string, number, boolean) nullable and non-required; and container types (like array, object) non-required.
The benefits of this approach are:
Variations in source data can be handled by omitting fields or using null values. Then a single schema can apply to data from a particular source regardless of the software version or export format being used.
Updating schemas to a new version causes fewer breaking changes to existing data consumers.
When adding a non-required field, data from the previous version of the schema is still compatible with the new schema, meaning old data can easily be loaded and used according to the new schema. If a required field were added in a schema update, then all old data is now invalid and has to be reprocessed before the new schema can be used. For downstream use cases which don’t need the newly added field, making it non-required causes less disruption.
When adding a nullable field, downstream applications which require that field can assume a default value of
nullwhen the field is missing from IDS data which hasn’t been updated yet, instead of having to wait until all data is updated to include the field.
It makes the fewest assumptions about requirements for downstream use cases.
Should a field be required? Recommended starting point: non-required, followed by the logic in this table:
Is the field always present in the primary data? |
Does the field need to be required based on task script logic or as a requirement of every downstream use-case? |
Outcome in IDS |
|---|---|---|
Always present |
Needs to be required |
Make it required, e.g. |
Always present |
No restriction |
Leave it as non-required, e.g. |
Sometimes missing |
N/A |
Leave it as non-Required, e.g. |
Should a field be non-nullable? Recommended starting point: nullable, except for container types (array, object) which must be non-nullable, followed by the logic in this table:
Is the field always non-null in the primary data? |
Does the field need to be non-null based on task script logic or as a requirement of every downstream use-case? |
Outcome in IDS |
|---|---|---|
Always non-null |
Needs to be non-nullable |
Make it non-nullable, e.g. |
Always non-null |
No restriction |
Leave it as nullable, e.g. |
Sometimes null |
N/A |
Leave it as Nullable, e.g. |
How to map data to top-level fields¶
To create an IDS, first you can roughly categorize the raw data into 3 high-level categories: metadata, input, and output. Then for each category, you should decide which top-level field your data belongs to. Below are the recommended top-level fields for each category:
Metadata: describing your workflow
projectsis intended to be used as shown here and should contain information which uniquely identifies and describes a particular initiative and methodologies used to produce the data in a given IDS. This information is typically found in ELN and LIMS applications and allow users to organize data to associate related datasets.systemsis intended to be used as shown here. A single element ofsystemstypically contains information about the hardware used to produce the data for a given IDS. The hardware is often an instrument, a component of the instrument, such as a column in a chromatographic instrument, detector information, such as name and vendor, etc. It should contain information regarding the components used, but not how those instruments are configured for the experiment, which is better suited formethods.usersis intended to be used as shown here. A single element ofusersshould contain information about a individual user associated with a step on an experiment. A step could consist of a run of an instrument, a software login, a user executing a signoff, etc.samplesis intended to be used as shown here. A single element ofsamplestypically describes a single sample being tested during an assay or within a specific test or experiment. An assay may require one or more samples. Samples can be split across different assays and can be used to search for data on a specific sample using a specific method, often created with any combination of a compound ID, batch IDs, experiment ID, project ID, and/or descriptors of the applied experimental condition. Each sample in a system should have a unique name/ID within its sample type. The unique sample names can be provided by the user or can be generated by a system, such as a registry in an ELN or LIMS; when you ask the system to generate names, you specify a naming pattern to use.
Input: User defined parameters that describe how an experiment, run, assay, etc. will be executed.
methods: An individual element ofmethodsis freeform and up to the user to define. The definition of a method is typically tied to an instrument family, thus any predefined method component will live in ts-ids-components. Generally a single element will contain information about a protocol and/or the parameters used to execute a run of an experiment or instrument.related_filesis intended to be used as shown here. A single element ofrelated_filesshould contain a reference to a separate file that exists in the TetraScience platform. The related file(s) are often files that are not adequately stored in a JSON format but are related to a given IDS. These files are typically things like images or parquet files that contain large amounts of data. If you need to capture metadata about these files, it’s recommended not to extend aRelatedFilecomponent and instead store the meta information in a separate field in the IDS. A singleRelatedFilecomponent should just store the file reference.
Output: The output of the instrument run like measurements or calculated values.
results: An individual element ofresultsis freeform and up to the user to define. The definition of a result is typically tied to an instrument family, thus any predefined result component will live in ts-ids-components. Generally, it contains the measurements and/or calculated values of a given run of an experiment, instrument, etc. Note thatresultsis not intended to store measurements which are n-dimensional, instead these values should be reserved fordatacubes.datacubesis intended to be used as shown here. Likeresults, it generally contains the measured values of a given run of an experiment, instrument, etc. However,datacubesis intended to store data which is an n-dimensional matrix. For example, chromatography traces and mass spectra. Datacubes is a special component which is handled by the platform differently from other components, see the product documentation for more details.
Some points to consider when choosing between results and datacubes follow:
Do we want to search the data using Elasticsearch? - Yes:
resultsIs the data multidimensional? - Yes:
datacubesIs the data meaningful only if they are analyzed all together? Yes:
datacubesIs an individual value meaningful by itself? Yes:
results
Basic Types¶
Boolean¶
The values True, False, true, false, Yes, No, yes, no, 1, 0 are often modeled as the type boolean.
Of course, you should consider if these values should be interpreted as booleans.
Note that it is common for a field which appears to be a boolean to actually have more than two possible values, for example "True", "False" and "Unknown".
Unless the source data is guaranteed to always be a boolean, it is usually better to use a nullable string type for this set of values.
If additional values need to be supported, a string field can already support them without needing a schema update, avoiding a breaking change.
String with a format specification¶
Avoid using format for fields with string type except when the format is guaranteed to be followed by the upstream data source and is a requirement of all downstream use cases.
Adding a format can make the schema fragile to changes in requirements.
It can also make JSON Schema validation open to a redos attack in some cases.
IDS metadata and identifiers¶
$id¶
Make sure it follows the same convention as other IDS.
Each JSON schema should have a $id on the root level that looks like
"$id": "https://ids.tetrascience.com/common/instrument-a/v1.0.0/schema.json"
This can be done in ts-ids-core using the SchemaExtraMetadataType as demonstrated here
$schema¶
Since JSON Schema is itself a JSON file, it’s not always easy to tell when something is JSON Schema or just an arbitrary chunk of JSON.
The $schema keyword is used to declare that something is JSON Schema.
We include it in every schema.json as a top-level metadata field.
"$schema": "http://json-schema.org/draft-07/schema#"
This can be done in ts-ids-core using the SchemaExtraMetadataType as demonstrated here
@idsNamespace, @idsType, @idsVersion¶
It is important for the schema.json and the IDS JSON itself to indicate what the IDS namespace, type, and version are.
This provides a unique identifier for what schema an IDS JSON instance should conform to and how it should served on the platform.
@idsTyperefers to the schema name, such asqpcr-thermofisher-viia7. This often describes a specific instrument or an instrument family if multiple instrument’s data can be modeled with a single IDS.@idsNamespacedefines a realm within the TetraScience platform where only those who have the appropriate permissions can use the artifacts in that realm. For more information on the namespace to choose for your IDS, see the product documentation.@idsVersionis the version of a given IDS and follows semantic versioning, please see the IDS versioning rules to understand how to version your IDS. It is important follow the correct versioning rules as the platfrom relies on the version to index data into Elasticsearch and serve the data with the correct Athena tables.
These fields are included when using IdsSchema or TetraDataSchema - see Quickstart for an example of defining these fields.
Example
{
"type": "object",
"required": ["@idsNamespace", "@idsType", "@idsVersion"],
"properties": {
"@idsNamespace": {
"type": "string",
"const": "common"
},
"@idsType": {
"type": "string",
"const": "example"
},
"@idsVersion": {
"type": "string",
"const": "v1.0.0"
}
}
}
Schema property descriptions¶
description is a JSON schema keyword.
It must be a string. description will provide more explanation about the field. This description should consist of full, grammatically correct, and formatted sentences (capitalization, period, etc).
Some common usages of description are
Explanation of the meaning of the field which an end-user of the data would understand
What this field corresponds to in the source data system and where it can be found
Give the fully-spelled out version of an abbreviation or an acronym
You can add descriptions to both container and atomic types with ts-ids-core.
The docstring of an IdsElement subclass will become the description of its respective object type when exported to JSON schema.
Describing a field can be done using the description argument to IdsField.
from ts_ids_core.base.ids_element import IdsElement
from ts_ids_core.base.ids_field import IdsField
class Example(IdsElement):
"""Object description"""
foo: str = IdsField(description="Field description.")
Show JSON schema
{
"additionalProperties": false,
"description": "Object description",
"properties": {
"foo": {
"description": "Field description.",
"type": "string"
}
},
"type": "object"
}
id and name¶
Add these standard fields to any object which needs an identifier or a name.
id is typically a string and is machine-readable.
It’s usually a unique value across all the data produced by the same instrument or within a subset of related data produced by that instrument, for example a result ID.
name should human-readable. It’s usually defined by human and there is no guarantee it will be unique across all data.
IDSs often use a combination of id and name, for example as seen in the Sample component.
Date and Time¶
Use Cases¶
Time will be searchable once it’s parsed into the ISO 8601 UTC format (see ElasticSearch built-in date format) otherwise we can’t guarantee instrument native date formats can be understood by Elasticsearch.
Parsing datetimes into the ISO 8601 UTC format remove ambiguity of the datetime formats within and across IDSs.
Standardized datetimes allow lexicographic ordering.
Format and parsing¶
Datetimes should be parsed into ISO 8601 format with UTC, e.g. 2021-01-26T23:19:09Z. Use UTC (Zulu) time. Convert time zones to Zulu time.
Elasticsearch date detection is currently disabled by TetraScience. This means dynamic mapping is disabled. We do this because we don’t want Elasticsearch indexing to fail if a malformed date format is used. So, we index every string type as keyword type.
Do not use the JSON Schema format,
date-time, in the IDS. If you cannot convert a timestamp to ISO 8601, then we recommend preserving the timestamp in the original, not standardized format in the IDS JSON instance. We recommend a permissive approach; standardize if you can, but don’t fail parsing if you can’t, and store the raw value.If a datetime string’s time is not available, just parse date as the value will still follow ISO 8601. Vice versa if only time is available (e.g.
12:00:38.012).If an instrument does not provide the ability to distinguish ambiguous datetimes, like an offset or timezone; another option is to create a protocol config in which a user can enter the datetime format which can be used by the task script to parse the datetimes provided in the primary data.
In ts-ids-core you can model datetime values using a value-raw value pair to capture the parsed datetimes and the raw value.
Capturing the raw value will allow you to be permissive in your parsing strategy.
from ts_ids_core.base.ids_element import IdsElement
from ts_ids_core.base.ids_field import IdsField
class DatetimePair(IdsElement):
value: str = IdsField(description="datetime value standardized to ISO 8601 UTC.")
raw_value: str = IdsField(description="The raw datetime value provided in the primary data.")
Units¶
All units should be converted to its respective QUDT or QUDT extended unit. If it is not available in either of these, then it’s recommended to capture the raw unit provided in the primary data and document this along with your IDS.
Use ArbitraryUnit when there should be a unit, but the unit doesn’t mean anything outside of instrument scope. For example, light absorbance.
Use Unitless for quantities which are unitless. For example, ratios or pH values.
Use null when there should be a unit, but it is not available in the raw file.
Example
{
"ph": {
"value": 7.1,
"unit": "Unitless"
},
"weight": {
"value": 2.3,
"unit": "Gram"
},
"numerical_variable_with_unknown_unit": {
"value": 102,
"unit": null
}
}
How to find a QUDT Unit¶
QUDT.org provides semantic specifications for units of measure, quantity kind, dimensions and data types. QUDT extended adds additional units to the QUDT ontology.
One way to browse these ontologies is to use Protege:
Download qudt-ext.ttl
Download Protege, an open source tool to browse ontologies.
Load qudt-ext.ttl into Protege and then search for “second”.
It will return the following search results and in the ontology “second” has the URI http://qudt.org/vocab/unit#SecondTime, thus we will use SecondTime as the standard unit name, namely whatever comes after the #.
Samples fields: batch, set, lot¶
See the component docs here: Samples
These terms are used somewhat interchangeably. However, they mean discrete things to scientists, especially in processes that operate under 21 CFR 210.3
batch
21 CFR 210.3 definition: a specific quantity of a drug or other material that is intended to have uniform character and quality, within specified limits, and is produced according to a single manufacturing order during the same cycle of manufacture
It is a concept in drug development and more macro compared to
set. Specifically, a batch is the amount of yielded by a single manufacturing operation, with uniform character and quality.
set
Is a sequence representing a small group of items used in one transaction. An example from scientific usage involves a dissolution or solubility study; the “sample set” is the specific collection of tubes a scientist analyzes to obtain aggregate statistics about the sample (SD, mean/median, variability, etc)
lot
21 CFR 210.3 definition: a batch, or a specifically identified portion of a batch, having uniform character and quality within specified limits; or, in the case of a drug product produced by continuous process, it is a specific identified amount produced in a unit of time or quantity in a manner that assures its having uniform character and quality within specified limits.
A lot as defined in the CFR is like a “sub-batch” - it’s an amount of material produced per unit time in a continuous process. Consider baking cookies as an anology: If you baked a single bowl of batter and got a tray, it would be a
batchof cookies. However, if you had a conveyor belt oven continuously producing cookies, then alotwould be the amount of cookies produced between 10:15-10:20 AM last Tuesday. Properties would be measured and assigned to each lot to gauge consistency and QC.
Describing relationships in IDSs¶
An array of objects in an IDS can be mapped to a table, where a single object in the array corresponds to a row of the table, and the fields in the objects are the columns. This mapping happens on the Tetra Data Platform to enable querying IDS data with SQL.
If an IDS contains multiple arrays of objects (corresponding to multiple tables) then it can be helpful to explicitly define relationships between the objects in these arrays.
For example, for an IDS containing methods and results it can be useful to understand which element of results relates to which element of methods, or equivalently, the relationship between rows of the results and methods tables.
These relationships can be created in an IDS by using UUID primary keys and foreign keys, which is explained in this section.
UUID primary keys are a form of surrogate key: a key which is created during task script processing and is not taken from source data. This is in contrast with a natural key, which uses values from the source data as the keys for defining relationships. Those natural keys can still exist in the IDS data, but adding surrogate keys creates a single consistent approach for representing relationships across all IDS data, rather than using a different set of natural keys for each relationship. Natural keys are also prone to data integrity problems from source data, such as duplication or missing values, which is avoided by creating surrogate keys.
Define primary and foreign keys fields in an IDS¶
To create an IDS which has primary and foreign keys, see the UUIDForeignKey docs and UUIDPrimaryKey on the same page.
Generating UUIDs in a task script¶
The values assigned to primary and foreign key fields are UUID strings (such as "123e4567-e89b-12d3-a456-426614174000") which must be generated during task script parsing.
These UUIDs need to be deterministically generated by seeding a random generator with the content of the input file and the identity of the task script (namespace, slug and version). This achieves the following:
UUIDs are reproducible
Test code in the task script repo can rely on UUID values in tests, for example in snapshot testing, the snapshot files will not change between test executions.
Reprocessing a file on TDP using the same protocol version leads to the same UUIDs being generated, meaning the content of the IDS JSON output is stable between workflow executions.
UUIDs are unique across different files and across different task scripts
This means that there are no primary key clashes between different files or data generated by different task scripts, meaning these relationships can be used across multiple IDS documents, for example when aggregating multiple IDS JSON documents, or when querying data with SQL.
To generate UUID strings following these guidelines, use the UUID generator in ts-task-script-utils, documented here with examples of how to seed the random generator with the task script identity and file content, and then generate UUID strings.
Naming guidelines¶
A consistent approach to naming makes IDSs containing primary and foreign keys easier to understand when working with multiple IDSs:
Name primary key fields as
pk. There only needs to be one primary key in any given array of objects, so there is no need for a more specific name.Name foreign keys as
fk_<object_path>where<object_path>is the path of the object in the JSON document, including the object name itself. For example: a reference to the objectmethods[*].cycles[*]would be namedfk_methods_cycle(notice it’scyclenotcycles). The purpose of this is to make the target of the foreign key quickly and easily reconizeable to data users without having to refer back to documentation. Shorter names may be used if they are unambiguous and still make the target clear, such asfk_cycle, but this name would be unclear if bothmethods[*].cycles[*]andruns[*].cycles[*]existed in the IDS.Prefacing the key with
fkmakes it quick to recognize that this is a foreign key field.Including the path to the object makes it quick to recognize the target object.
Relational structure guidelines¶
There can be many foreign keys pointing to a single primary key.
For example after defining a primary key in object_a.pk, a foreign key which links to it (fk_object_a) can appear in multiple places elsewhere in the schema.
To make queries simpler to write and understand (for example SQL queries), it can be beneficial to avoid primary/foreign key structures which lead to deeply nested JOINs in queries.
In the example below, users, methods and samples contain primary keys, and they are all tied together in one place with foreign keys in results, meaning that any of those three tables can be related to results with a single JOIN which doesn’t depend on other intermediate tables.
This has similar benefits to the star schema design pattern.
Example structure and query¶
The following is a way to create linked users, methods, samples and results.
users
pk: uuid1
methods
pk: uuid2
samples
pk: uuid3
results
fk_user: uuid1
fk_system: uuid2
fk_sample: uuid3
If these fields are part of an IDS common/example:v1.0.0, then an Athena SQL query to join all of these tables into one could look like:
SELECT
*
FROM
"example_v1_results" results
JOIN "example_v1_users" users ON results.fk_user = users.pk
JOIN "example_v1_methods" systems ON results.fk_method = methods.pk
JOIN "example_v1_samples" samples ON results.fk_sample = samples.pk
LIMIT
100
Example approaches for modeling different cardinalities¶
The above example takes a star schema approach which is a concept within dimensional modeling. Here we will build upon this schema to demonstrate example approaches for modeling common cardinalities, one-to-many and many-to-many.
Using the example above, we can define users, methods, and samples as dimension tables and results as a fact table.
Additionally, lets consider the grain of results to be one record per measurement of the source instrument.
One-to-many¶
This is perhaps the most commonly encountered cardinality.
The configuration and execution of scientific instruments often follows a similar pattern, one method configuring the instrument to execute in a specified way, one execution of the instrument, and many measurements are made.
In the case of our example model we will assume our source system follows the same pattern, one method produces many measurements (i.e. there is a one-to-many relationship between methods and results).
Defining the relationship between methods and results is simple:
from typing import ClassVar, List, Literal
from ts_ids_core.annotations import Required, UUIDForeignKey, UUIDPrimaryKey
from ts_ids_core.base.ids_element import IdsElement, SchemaExtraMetadataType
from ts_ids_core.base.ids_field import IdsField
from ts_ids_core.schema import IdsSchema
class Result(IdsElement):
fk_method: UUIDForeignKey = IdsField(
primary_key="/properties/methods/items/properties/pk"
)
value: float
class Method(IdsElement):
pk: UUIDPrimaryKey
name: str
class StarSchema(IdsSchema):
schema_extra_metadata: ClassVar[SchemaExtraMetadataType] = {
"$id": "https://ids.tetrascience.com/common/example/v1.0.0/schema.json",
"$schema": "http://json-schema.org/draft-07/schema#",
}
ids_namespace: Required[Literal["my_namespace"]] = IdsField(
default="common", alias="@idsNamespace"
)
ids_type: Required[Literal["my_unique_ids_name"]] = IdsField(
default="example", alias="@idsType"
)
ids_version: Required[Literal["v1.0.0"]] = IdsField(
default="v1.0.0", alias="@idsVersion"
)
methods: List[Method]
results: List[Result]
Show JSON schema
{
"$id": "https://ids.tetrascience.com/common/example/v1.0.0/schema.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"additionalProperties": false,
"properties": {
"@idsType": {
"const": "my_unique_ids_name",
"type": "string"
},
"@idsVersion": {
"const": "v1.0.0",
"type": "string"
},
"@idsNamespace": {
"const": "my_namespace",
"type": "string"
},
"methods": {
"items": {
"$ref": "#/definitions/Method"
},
"type": "array"
},
"results": {
"items": {
"$ref": "#/definitions/Result"
},
"type": "array"
}
},
"required": [
"@idsType",
"@idsVersion",
"@idsNamespace"
],
"type": "object",
"definitions": {
"Method": {
"additionalProperties": false,
"properties": {
"pk": {
"@primary_key": true,
"type": "string"
},
"name": {
"type": "string"
}
},
"required": [
"pk"
],
"type": "object"
},
"Result": {
"additionalProperties": false,
"properties": {
"fk_method": {
"@foreign_key": "/properties/methods/items/properties/pk",
"type": "string"
},
"value": {
"type": "number"
}
},
"required": [
"fk_method"
],
"type": "object"
}
}
}
All we are doing is defining a primary key in the methods table and a foreign key referencing said primary key in the results table.
The primary thing to ensure is that the relationship is truly one-to-many, so when results is joined to methods, no result rows are duplicated.
If the join results in duplicated result rows, then any aggregations performed on the joined dataset will likely be invalid.
If our assumption that one method produces many measurements is correct, then we can guarantee that any given measurement will only have one corresponding row in the methods table, thus no rows from the results table will be duplicated upon joining.
Many-to-many¶
Many-to-many relationships introduce a problem, if multiple records in a dimension correspond to many records in a fact table, then how can you assign multiple foreign keys to an individual record in the fact table that join to all related records in the dimension table?
The answer is to use a bridge table.
A bridge table is a referential intermediate table which contains only relationships to between entities.
In the case of our example model, consider the possibility that there can be an indeterminate amount of users that produced a given set of measurements.
In this case, there is a many-to-many relationship between users and results.
Building upon our model above, we can create a bridge between results and users in the following way:
from typing import ClassVar, List, Literal
from ts_ids_core.annotations import Required, UUIDForeignKey, UUIDPrimaryKey
from ts_ids_core.base.ids_element import IdsElement, SchemaExtraMetadataType
from ts_ids_core.base.ids_field import IdsField
from ts_ids_core.schema import IdsSchema
class Result(IdsElement):
fk_method: UUIDForeignKey = IdsField(
primary_key="/properties/methods/items/properties/pk"
)
fk_user_group: UUIDForeignKey = IdsField(
primary_key="/properties/user_groups/items/properties/pk"
)
value: float
class Method(IdsElement):
pk: UUIDPrimaryKey
name: str
class UserGroup(IdsElement):
pk: UUIDPrimaryKey
name: str
class UserBridge(IdsElement):
fk_user_group: UUIDForeignKey = IdsField(
primary_key="/properties/user_groups/items/properties/pk"
)
fk_user: UUIDForeignKey = IdsField(
primary_key="/properties/users/items/properties/pk"
)
class User(IdsElement):
pk: UUIDPrimaryKey
name: str
class StarSchema(IdsSchema):
schema_extra_metadata: ClassVar[SchemaExtraMetadataType] = {
"$id": "https://ids.tetrascience.com/common/example/v1.0.0/schema.json",
"$schema": "http://json-schema.org/draft-07/schema#",
}
ids_namespace: Required[Literal["my_namespace"]] = IdsField(
default="common", alias="@idsNamespace"
)
ids_type: Required[Literal["my_unique_ids_name"]] = IdsField(
default="example", alias="@idsType"
)
ids_version: Required[Literal["v1.0.0"]] = IdsField(
default="v1.0.0", alias="@idsVersion"
)
methods: List[Method]
results: List[Result]
user_groups: List[UserGroup]
users_bridge: List[UserBridge]
users: List[User]
Show JSON schema
{
"$id": "https://ids.tetrascience.com/common/example/v1.0.0/schema.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"additionalProperties": false,
"properties": {
"@idsType": {
"const": "my_unique_ids_name",
"type": "string"
},
"@idsVersion": {
"const": "v1.0.0",
"type": "string"
},
"@idsNamespace": {
"const": "my_namespace",
"type": "string"
},
"methods": {
"items": {
"$ref": "#/definitions/Method"
},
"type": "array"
},
"results": {
"items": {
"$ref": "#/definitions/Result"
},
"type": "array"
},
"user_groups": {
"items": {
"$ref": "#/definitions/UserGroup"
},
"type": "array"
},
"users_bridge": {
"items": {
"$ref": "#/definitions/UserBridge"
},
"type": "array"
},
"users": {
"items": {
"$ref": "#/definitions/User"
},
"type": "array"
}
},
"required": [
"@idsType",
"@idsVersion",
"@idsNamespace"
],
"type": "object",
"definitions": {
"Method": {
"additionalProperties": false,
"properties": {
"pk": {
"@primary_key": true,
"type": "string"
},
"name": {
"type": "string"
}
},
"required": [
"pk"
],
"type": "object"
},
"Result": {
"additionalProperties": false,
"properties": {
"fk_method": {
"@foreign_key": "/properties/methods/items/properties/pk",
"type": "string"
},
"fk_user_group": {
"@foreign_key": "/properties/user_groups/items/properties/pk",
"type": "string"
},
"value": {
"type": "number"
}
},
"required": [
"fk_method",
"fk_user_group"
],
"type": "object"
},
"User": {
"additionalProperties": false,
"properties": {
"pk": {
"@primary_key": true,
"type": "string"
},
"name": {
"type": "string"
}
},
"required": [
"pk"
],
"type": "object"
},
"UserBridge": {
"additionalProperties": false,
"properties": {
"fk_user_group": {
"@foreign_key": "/properties/user_groups/items/properties/pk",
"type": "string"
},
"fk_user": {
"@foreign_key": "/properties/users/items/properties/pk",
"type": "string"
}
},
"required": [
"fk_user_group",
"fk_user"
],
"type": "object"
},
"UserGroup": {
"additionalProperties": false,
"properties": {
"pk": {
"@primary_key": true,
"type": "string"
},
"name": {
"type": "string"
}
},
"required": [
"pk"
],
"type": "object"
}
}
}
Notice the creation of two new tables: user_groups and users_bridge.
user_groups exists as a way to bucket all relevant users to a single record; we can assign a single foreign key to a given result record which points to the primary key of user_groups.
users_bridge then contains a foreign key pointing to the user_groups primary key and a foreign key pointing to the users primary key.
Joining results to users can then be done in the following way:
SELECT results.value,
users.name
FROM example_v1_results results
JOIN example_v1_user_groups groups
ON results.fk_user_group = groups.pk
JOIN example_v1_users_brige bridge
ON bridge.fk_user_group = groups.pk
JOIN example_v1_users users
ON bridge.fk_user = users.pk
This query will then return the results values tied to the users associated with said value.
However, note that we run into the same problem mentioned above, a given result row will be duplicated by the number of users associated with said result.
To avoid the duplication, you can alter the query to aggregate the users before joining to results.
WITH aggregate_users AS (
SELECT groups.pk,
array_agg(users.name) AS user_group
FROM example_v1_user_groups groups
JOIN example_v1_users_brige bridge
ON bridge.fk_user_group = groups.pk
JOIN example_v1_users users
ON bridge.fk_user = users.pk
GROUP BY groups.pk
) SELECT results.value,
agg.user_group
FROM example_v1_results results
JOIN aggregate_users agg
ON results.fk_user_group = agg.pk