athena.json¶
To understand the following sections, it’s recommended to familiarize yourself with how an IDS is represented in Athena. See, the product documentation for more details.
Minimum requirements¶
The following definition is the minimum required definition for athena.json in an IDS artifact.
Even if you do not want to customize table creation and create partitions, you must at least copy the contents of the following JSON block into your athena.json file.
{
"root": "root",
"partitions": []
}
Design guidelines¶
Root table¶
As mentioned in the product documentation, each top-level field in an IDS whose definition is not an array of objects will be added as a column in the root table.
Updating the "root" value in athena.json will effectively rename the root table from <idsType>_<idsMajorVersionNumber>_root, but will keep the contents the same.
For example:
{
"root": "injections",
"partitions": []
}
will change the root table to <idsType>_<idsMajorVersionNumber>_injection.
Partition¶
Partitioning can be used to improve Athena query performance.
Data is partitioned based on the value of the chosen field(s) defined in the "partitions" field.
A partition is defined by an object which has a path and a name.
The objects are placed into the "partitions" array in athena.json.
See Partition path and Partition name for specific on these fields.
For example:
If you have a schema.json file defined as
{
"$id": "https://ids.tetrascience.com/my_namespace/demo_ids/v1.0.0/schema.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"additionalProperties": false,
"properties": {
"@idsType": {
"const": "my_unique_ids_name",
"type": "string"
},
"@idsVersion": {
"const": "v1.0.0",
"type": "string"
},
"@idsNamespace": {
"const": "my_namespace",
"type": "string"
},
"log": {
"additionalProperties": false,
"properties": {
"name": {
"type": "string"
},
"year": {
"type": "string"
},
"month": {
"type": "string"
}
},
"type": "object"
}
},
"required": ["@idsType", "@idsVersion", "@idsNamespace"],
"type": "object"
}
Then one possible partition could be log.year and would look like the following in `athena.json
{
"root": "root",
"partitions": [
{
"path": "log.year",
"name": "year"
}
]
}
Ordering¶
Multiple partitions can be created.
For example, you could partition by year and month using the example schema above.
When defining more than one partition, ordering within the partitions array is important.
Place the main partition first, then any sub-partitions later.
For example, partitioning by year and month:
{
"root": "root",
"partitions": [
{
"path": "log.year",
"name": "year"
},
{
"path": "log.month",
"name": "month"
}
]
}
Partition path¶
"path" defines the location of the field to partition by within an IDS file.
Each path should be a path to the IDS field separated by ., which delimits the chain of fields pointing to the partition field.
Partitions can be created only from paths which are available in the root table.
That means any property whose path is not nested inside an array.
Valid path:
log.namewherelogis an object that is not contained within an array definition of any parent field.Invalid path:
logs[*].nameaslogsis an array of objects
Partition name¶
The name defines what the partition column will be called.
Warning
Do not name the partition the same as the normalized value of the IDS path, or any other path in the IDS. You can find more details in Athena table/column/partition name normalization
For example, consider the partition path log.name.
Valid case:
{
"path": "log.name",
"name": "audit_log_name"
}
Invalid case which conflicts with normalized name of the IDS path:
{
"path": "log.name",
"name": "log_name"
}
Choosing of partition¶
There are several factors to consider when choosing a property to be a partition:
How will data be balanced across partitions?
How many values will partition generate (i.e. will this column have high cardinality)?
A good choice for a partition is a property that will split approximately equally between partitions, that will not generate too many partitions and that will be a useful column to filter by in an SQL query.
Good choices: year, project, location
Bad choices: sample_value, created_at
Partition examples¶
For the following examples, consider having a schema.json definition of the following:
{
"$id": "https://ids.tetrascience.com/my_namespace/demo_ids/v1.0.0/schema.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"additionalProperties": false,
"properties": {
"@idsType": {
"const": "demo_ids",
"type": "string"
},
"@idsVersion": {
"const": "v1.0.0",
"type": "string"
},
"@idsNamespace": {
"const": "my_namespace",
"type": "string"
},
"name": {
"type": "string"
},
"class": {
"type": "string"
},
"results": {
"items": {
"additionalProperties": false,
"properties": {
"value": {
"type": "number"
},
"unit": {
"type": "string"
}
},
"type": "object"
},
"type": "array"
}
},
"required": ["@idsType", "@idsVersion", "@idsNamespace"],
"type": "object"
}
Example 1: no partitioning¶
Click to expand
athena.json
{
"root": "root",
"partitions": []
}
A combination of schema.json and athena.json are used to generate a file containing the definitions of the data transformation which occurs when any IDS is published to the TetraScience platform.
This file is called normalized.json and it contains all the information necessary to create Athena tables as well as transform incoming IDS instances which are flattened to CSVs which are the data sources of Athena.
Normalized file location: s3://<athena-bucket>/<orgSlug>/example/v1/normalized.json
Resulting Tables¶
After publishing the IDS to the TetraScience platform the following Athena tables will be created.
uuid |
name |
class |
|---|
uuid |
parent_uuid |
value |
unit |
|---|
Populating the tables¶
When the following IDS instance is produced by a protocol and written to the platform, the instance will be flattened and written to a CSV.
{
"name": "Experiment #1",
"class": "distillation",
"results": [
{
"value": 1829,
"unit": "s"
},
{
"value": 1.23,
"unit": "l"
}
]
}
The IDS instance will be stored at a file location, for example /folder/test/0.json.
All files will be found under s3://<athena-bucket>/<orgSlug>/ and the CSVs will be generated as
/example/v1/root/foldertest0.json.csv.gz/example/v1/results/foldertest0.json.csv.gz/metadata/v1/foldertest0.json.csv.gz
Querying the tables¶
Query:
SELECT * FROM demo_ids_v1_root;
Returns:
uuid |
name |
class |
|---|---|---|
4be5d31a-587c-4974-8653-728f167df201 |
Experiment #1 |
distillation |
Query:
SELECT * FROM demo_ids_v1_results;
returns:
uuid |
parent_uuid |
value |
unit |
|---|---|---|---|
254b74ab-9269-4a9b-8049-9c2a9fa9722f |
4be5d31a-587c-4974-8653-728f167df201 |
1829 |
s |
c3098033-fdac-4125-a7f1-8f362e67e9ff |
4be5d31a-587c-4974-8653-728f167df201 |
1.23 |
l |
Example 2: With Partitioning¶
Click to expand
athena.json
{
"root": "root",
"partitions": [
{
"name": "experiment_class",
"path": "class"
}
]
}
Resulting tables¶
After publishing the IDS to the TetraScience platform the following Athena tables will be created.
uuid |
name |
class |
experiment_class |
|---|
uuid |
parent_uuid |
value |
unit |
experiment_class |
|---|
Note
experiment_class is a virtual column with type String
Populating the tables¶
When the following IDS instance is produced by a protocol and written to the platform, the instance will be flattened and written to a CSV.
{
"name": "Experiment #1",
"class": "distillation",
"results": [
{
"value": 1829,
"unit": "s"
},
{
"value": 1.23,
"unit": "l"
}
]
}
The IDS instance will be stored at a file location, for example /folder/test/0.json.
All files will be found under s3://<athena-bucket>/<orgSlug>/ and the CSVs will be generated as
/example/v2/experiments/experiment_class=distillation/foldertest0.json.csv.gz/example/v2/results/experiment_class=distillation/foldertest0.json.csv.gz/metadata/v1/foldertest0.json.csv.gz
Querying the tables¶
Query:
SELECT * FROM demo_ids_v1_root;
returns:
uuid |
name |
class |
experiment_class |
|---|---|---|---|
2be5d31a-587c-4974-8653-728f167df201 |
Experiment #1 |
distillation |
distillation |
Query:
SELECT * FROM demo_ids_v1_results;
returns:
uuid |
parent_uuid |
value |
unit |
experiment_class |
|---|---|---|---|---|
a54b74ab-9269-4a9b-8049-9c2a9fa9722f |
2be5d31a-587c-4974-8653-728f167df201 |
1829 |
s |
distillation |
a3098033-fdac-4125-a7f1-8f362e67e9ff |
2be5d31a-587c-4974-8653-728f167df201 |
1.23 |
l |
distillation |
Athena table/column/partition name normalization¶
Athena has restrictions on what tables/columns/partitions can be named:
lowercase letter and number
no special characters, except underscores
_
Because IDS fields can potentially include uppercase and special chars we need to sanitize the name to conform to Athena specifications. The following transformation rules are applied:
convert to lowercase
replace all special characters with an underscore
remove repeated underscores
remove leading underscores
Example 1:
IDS field:
_Weird-partition!@nameAthena column name:
weird_partition_name
Example 2:
IDS field:
@fileIdAthena column name:
fileid
Example 3:
IDS field:
project.nameAthena column partition:
project_name
Athena value sanitization¶
Make sure you escape: line breaks, return characters, escape characters, and commas in the values.
Avoid using array of string literal that contains “,”
Example
["A,B", "C"] will be serialized to A,B,C