athena.json¶

To understand the following sections, it’s recommended to familiarize yourself with how an IDS is represented in Athena. See, the product documentation for more details.

Minimum requirements¶

The following definition is the minimum required definition for athena.json in an IDS artifact. Even if you do not want to customize table creation and create partitions, you must at least copy the contents of the following JSON block into your athena.json file.

{
  "root": "root",
  "partitions": []
}

Design guidelines¶

Root table¶

As mentioned in the product documentation, each top-level field in an IDS whose definition is not an array of objects will be added as a column in the root table. Updating the "root" value in athena.json will effectively rename the root table from <idsType>_<idsMajorVersionNumber>_root, but will keep the contents the same.

For example:

{
  "root": "injections",
  "partitions": []
}

will change the root table to <idsType>_<idsMajorVersionNumber>_injection.

Partition¶

Partitioning can be used to improve Athena query performance. Data is partitioned based on the value of the chosen field(s) defined in the "partitions" field. A partition is defined by an object which has a path and a name. The objects are placed into the "partitions" array in athena.json. See Partition path and Partition name for specific on these fields.

For example:

If you have a schema.json file defined as

{
  "$id": "https://ids.tetrascience.com/my_namespace/demo_ids/v1.0.0/schema.json",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "additionalProperties": false,
  "properties": {
    "@idsType": {
      "const": "my_unique_ids_name",
      "type": "string"
    },
    "@idsVersion": {
      "const": "v1.0.0",
      "type": "string"
    },
    "@idsNamespace": {
      "const": "my_namespace",
      "type": "string"
    },
    "log": {
      "additionalProperties": false,
      "properties": {
        "name": {
          "type": "string"
        },
        "year": {
          "type": "string"
        },
        "month": {
          "type": "string"
        }
      },
      "type": "object"
    }
  },
  "required": ["@idsType", "@idsVersion", "@idsNamespace"],
  "type": "object"
}

Then one possible partition could be log.year and would look like the following in `athena.json

{
  "root": "root",
  "partitions": [
    {
      "path": "log.year",
      "name": "year"
    }
  ]
}

Ordering¶

Multiple partitions can be created. For example, you could partition by year and month using the example schema above. When defining more than one partition, ordering within the partitions array is important. Place the main partition first, then any sub-partitions later.

For example, partitioning by year and month:

{
  "root": "root",
  "partitions": [
    {
      "path": "log.year",
      "name": "year"
    },
    {
      "path": "log.month",
      "name": "month"
    }
  ]
}

Partition path¶

"path" defines the location of the field to partition by within an IDS file. Each path should be a path to the IDS field separated by ., which delimits the chain of fields pointing to the partition field. Partitions can be created only from paths which are available in the root table. That means any property whose path is not nested inside an array.

Valid path: log.name where log is an object that is not contained within an array definition of any parent field.
Invalid path: logs[*].name as logs is an array of objects

Partition name¶

The name defines what the partition column will be called.

Warning

Do not name the partition the same as the normalized value of the IDS path, or any other path in the IDS. You can find more details in Athena table/column/partition name normalization

For example, consider the partition path log.name.

Valid case:

{
  "path": "log.name",
  "name": "audit_log_name"
}

Invalid case which conflicts with normalized name of the IDS path:

{
  "path": "log.name",
  "name": "log_name"
}

Choosing of partition¶

There are several factors to consider when choosing a property to be a partition:

How will data be balanced across partitions?
How many values will partition generate (i.e. will this column have high cardinality)?

A good choice for a partition is a property that will split approximately equally between partitions, that will not generate too many partitions and that will be a useful column to filter by in an SQL query.

Good choices: year, project, location
Bad choices: sample_value, created_at

Partition examples¶

For the following examples, consider having a schema.json definition of the following:

{
  "$id": "https://ids.tetrascience.com/my_namespace/demo_ids/v1.0.0/schema.json",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "additionalProperties": false,
  "properties": {
    "@idsType": {
      "const": "demo_ids",
      "type": "string"
    },
    "@idsVersion": {
      "const": "v1.0.0",
      "type": "string"
    },
    "@idsNamespace": {
      "const": "my_namespace",
      "type": "string"
    },
    "name": {
      "type": "string"
    },
    "class": {
      "type": "string"
    },
    "results": {
      "items": {
        "additionalProperties": false,
        "properties": {
          "value": {
            "type": "number"
          },
          "unit": {
            "type": "string"
          }
        },
        "type": "object"
      },
      "type": "array"
    }
  },
  "required": ["@idsType", "@idsVersion", "@idsNamespace"],
  "type": "object"
}

Example 1: no partitioning¶

Click to expand

athena.json

{
  "root": "root",
  "partitions": []
}

A combination of schema.json and athena.json are used to generate a file containing the definitions of the data transformation which occurs when any IDS is published to the TetraScience platform. This file is called normalized.json and it contains all the information necessary to create Athena tables as well as transform incoming IDS instances which are flattened to CSVs which are the data sources of Athena.

Normalized file location: s3://<athena-bucket>/<orgSlug>/example/v1/normalized.json

Resulting Tables¶

After publishing the IDS to the TetraScience platform the following Athena tables will be created.

**demo_ids_v1_root**¶
uuid	name	class

**demo_ids_v1_results**¶
uuid	parent_uuid	value	unit

Populating the tables¶

When the following IDS instance is produced by a protocol and written to the platform, the instance will be flattened and written to a CSV.

{
  "name": "Experiment #1",
  "class": "distillation",
  "results": [
    {
      "value": 1829,
      "unit": "s"
    },
    {
      "value": 1.23,
      "unit": "l"
    }
  ]
}

The IDS instance will be stored at a file location, for example /folder/test/0.json.

All files will be found under s3://<athena-bucket>/<orgSlug>/ and the CSVs will be generated as

/example/v1/root/foldertest0.json.csv.gz
/example/v1/results/foldertest0.json.csv.gz
/metadata/v1/foldertest0.json.csv.gz

Querying the tables¶

Query:

SELECT * FROM demo_ids_v1_root;

Returns:

uuid	name	class
4be5d31a-587c-4974-8653-728f167df201	Experiment #1	distillation

Query:

SELECT * FROM demo_ids_v1_results;

returns:

uuid	parent_uuid	value	unit
254b74ab-9269-4a9b-8049-9c2a9fa9722f	4be5d31a-587c-4974-8653-728f167df201	1829	s
c3098033-fdac-4125-a7f1-8f362e67e9ff	4be5d31a-587c-4974-8653-728f167df201	1.23	l

Example 2: With Partitioning¶

Click to expand

athena.json

{
  "root": "root",
  "partitions": [
    {
      "name": "experiment_class",
      "path": "class"
    }
  ]
}

Resulting tables¶

After publishing the IDS to the TetraScience platform the following Athena tables will be created.

**demo_ids_v1_root**¶
uuid	name	class	experiment_class

**demo_ids_v1_results**¶
uuid	parent_uuid	value	unit	experiment_class

Note

experiment_class is a virtual column with type String

Populating the tables¶

When the following IDS instance is produced by a protocol and written to the platform, the instance will be flattened and written to a CSV.

{
  "name": "Experiment #1",
  "class": "distillation",
  "results": [
    {
      "value": 1829,
      "unit": "s"
    },
    {
      "value": 1.23,
      "unit": "l"
    }
  ]
}

The IDS instance will be stored at a file location, for example /folder/test/0.json.

All files will be found under s3://<athena-bucket>/<orgSlug>/ and the CSVs will be generated as

/example/v2/experiments/experiment_class=distillation/foldertest0.json.csv.gz
/example/v2/results/experiment_class=distillation/foldertest0.json.csv.gz
/metadata/v1/foldertest0.json.csv.gz

Querying the tables¶

Query:

SELECT * FROM demo_ids_v1_root;

returns:

uuid	name	class	experiment_class
2be5d31a-587c-4974-8653-728f167df201	Experiment #1	distillation	distillation

Query:

SELECT * FROM demo_ids_v1_results;

returns:

uuid	parent_uuid	value	unit	experiment_class
a54b74ab-9269-4a9b-8049-9c2a9fa9722f	2be5d31a-587c-4974-8653-728f167df201	1829	s	distillation
a3098033-fdac-4125-a7f1-8f362e67e9ff	2be5d31a-587c-4974-8653-728f167df201	1.23	l	distillation

Athena table/column/partition name normalization¶

Athena has restrictions on what tables/columns/partitions can be named:

lowercase letter and number
no special characters, except underscores _

Because IDS fields can potentially include uppercase and special chars we need to sanitize the name to conform to Athena specifications. The following transformation rules are applied:

convert to lowercase
replace all special characters with an underscore
remove repeated underscores
remove leading underscores

Example 1:

IDS field: _Weird-partition!@name
Athena column name: weird_partition_name

Example 2:

IDS field: @fileId
Athena column name: fileid

Example 3:

IDS field: project.name
Athena column partition: project_name

Athena value sanitization¶

Make sure you escape: line breaks, return characters, escape characters, and commas in the values.

Avoid using array of string literal that contains “,”

Example

["A,B", "C"] will be serialized to A,B,C