Generate programmatic IDS from JSON schema¶

Summary¶

This page describes how to migrate from a JSON Schema definition of an IDS to a programmatic IDS definition using ts-ids-core, via a code generation tool.

This tool is exposed as the CLI command ts-ids import-schema. For further information on parameters, run ts-ids import-schema --help.

The generated programmatic IDS code is not perfect, so it needs to be corrected after being generated. This is described in the section Manually Correct the Auto-generated Code.

Installation¶

Using the ts-ids import-schema command requires an extra dependency to be installed, which can be done like this:

pip install "ts-ids-core[import-schema]"

Or an equivalent using a package manager, such as poetry add "ts-ids-core[import-schema]" or pipx install "ts-ids-core[import-schema]".

Trying to run ts-ids import-schema without first installing the extra dependency will result in an error message explaining how to install the extra dependency.

Usage¶

Convert the JSON Schema IDS to programmatic IDS using the import-schema script. In the environment where ts-ids-core is installed, run:

ts-ids import-schema -i /path/to/schema.json -o ts_ids_foo/schema.py

Now ts_ids_foo/schema.py contains a model which represents the input schema.

ts-ids import-schema is an imperfect tool. Users should manually update the output Python file to reflect the input IDS as explained in the following sections.

Manually Correct the Auto-generated Code¶

Manually comparing and aligning the generated programmatic IDS with the IDS JSON Schema field by field can be laborious, especially for big IDS JSON Schema. Instead, we can use the fact that ts-ids import-schema – which uses the datamodel-codegen tool under the hood – makes mistakes in a very stereotypical way; use the following guide to look for and correct the auto-generated code.

Reuse objects with `$ref` to simplify generated code¶

After auto-generating the code, it is possible that many of the generated objects will have exactly the same fields but different docstrings, such as this:

class MeasurementTime(IdsElement):
    """
    How long each well was measured for
    """

    value: Required[Optional[float]]
    unit: Required[Optional[str]]


class DelayTime(IdsElement):
    """
    The delay time of the TRF measurement
    """

    value: Required[Optional[float]]
    unit: Required[Optional[str]]


class Action(IdsElement):
    """
    An action that was performed by a step in methods[*].protocol[*]
    """
    measurement_time: MeasurementTime = Field(
        default=None, description="How long each well was measured for"
    )
    delay_time: DelayTime = Field(
        default=None, description="The delay time of the TRF measurement"
    )

The above could be simplified - the docstrings on MeasurementTime and DelayTime are redundant so they could be combined into one class, which saves a lot of duplication if this class is used in multiple places:

class ValueUnit(IdsElement):
    """
    A scientific quantity which has a value and a unit
    """

    value: Required[Optional[float]]
    unit: Required[Optional[str]]


class Action(IdsElement):
    """
    An action that was performed by a step in methods[*].protocol[*]
    """
    measurement_time: ValueUnit = Field(
        default=None, description="How long each well was measured for"
    )
    delay_time: ValueUnit = Field(
        default=None, description="The delay time of the TRF measurement"
    )

This could be fixed in the generated code by finding these duplicates and simplifying them, or it could be fixed in the JSON Schema before code generation by moving these reusable objects to definitions and then including them as a $ref in the places they are used. The example below shows this JSON Schema change for an object containing value and unit.

Expand example: updating JSON Schema to use a shared definition

Before:

{
  "type": "object",
  "properties": {
    "delay_time": {
      "description": "The delay time of the TRF measurement",
      "additionalProperties": false,
      "properties": {
        "value": {
          "type": ["number", "null"]
        },
        "unit": {
          "type": ["string", "null"]
        }
      },
      "required": ["value", "unit"],
      "type": "object"
    },
    "measurement_time": {
      "description": "How long each well was measured for",
      "additionalProperties": false,
      "properties": {
        "value": {
          "type": ["number", "null"]
        },
        "unit": {
          "type": ["string", "null"]
        }
      },
      "required": ["value", "unit"],
      "type": "object"
    }
  }
}

After moving the object containing value and unit to be its own definition:

{
  "type": "object",
  "properties": {
    "delay_time": {
      "description": "The delay time of the TRF measurement",
      "$ref": "#/definitions/ValueUnit"
    },
    "measurement_time": {
      "description": "How long each well was measured for",
      "$ref": "#/definitions/ValueUnit"
    }
  },
  "definitions": {
    "ValueUnit": {
      "additionalProperties": false,
      "properties": {
        "value": {
          "type": ["number", "null"]
        },
        "unit": {
          "type": ["string", "null"]
        }
      },
      "required": ["value", "unit"],
      "type": "object"
    }
  }
}

Correct nullable field types¶

In ts_ids_core, whether a property is required and whether it is nullable are separate concepts, same as in JSON Schema. To make this clearer, we use the type annotation ts_ids_core.annotations.Nullable which is an alias of the standard Python type Optional. This makes it clearer that this annotation means the field value may be null, and it doesn’t affect whether the property is required or not, which is exclusively determined by the ts_ids_core.annotations.Required annotation.

Find and replace all Optional type annotations with Nullable.

It’s common to create a reusable JSON Schema definition for a nullable type, such as a nullable string:

{
  "nullable_string": {
    "type": ["string", "null"]
  }
}

In the generated code, this is converted to a root model like this:

class NullableString(RootModel[Optional[str]]):
    root: Optional[str]

And then fields which use this root model may erroneously include an extra Optional, like Optional[NullableString]. To correct this, remove the extra Optional (now replaced by Nullable), leaving just NullableString as the type.

Finally, delete the RootModel class, and use the ts-ids-core annotation instead (the same applies to NullableNumber and NullableBoolean):

from ts_ids_core.annotations import NullableString, NullableBoolean, NullableNumber

Replace `typing.Sequence` with `typing.List`¶

Convert annotations using typing.Sequence to typing.List.

Replace `Field` with `IdsField` and remove unused defaults¶

Autogenerated code will include Field for any properties which contain metadata like description, for example: prop: str = Field(..., description="abc"). Find and replace all Fields with the ts-ids-core equivalent IdsField from ts_ids_core.base.ids_field, which is already imported as part of code generation.

Each Field definition has a default value which is usually not needed: code generation adds defaults which weren’t present in the JSON Schema, which can be removed. For example, replace prop: str = Field(..., description="abc") with prop: str = IdsField(description="abc"), and replace prop: Nullable[str] = Field(default=None, description="abc") with prop: Nullable[str] = Field(description="abc")

Remove no-op classes¶

IDS JSON Schema often contains duplicate fields. For example, consider the fields “foo” and “bar”:

Click to collapse JSON

{
  "type": "object",
  "properties": {
    "foos": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "first_name": {
            "type": "string"
          },
          "last_name": {
            "type": "string"
          }
        },
        "additionalProperties": false
      }
    },
    "bar": {
      "type": "object",
      "properties": {
        "first_name": {
          "type": "string"
        },
        "last_name": {
          "type": "string"
        }
      },
      "additionalProperties": false
    }
  },
  "additionalProperties": false
}

Auto-conversion will result in the following classes:

class Foo(IdsElement):
    first_name: str
    last_name: str

class Bar(Foo):
    pass

class TopLevelClass(IdsElement):
    foos: List[Foo]
    bar: Bar

In this case, it’s recommended to de-duplicate the Foo and Bar classes:

class FullName(IdsElement):
    first_name: str
    last_name: str

class TopLevelClass(IdsElement):
    foos: List[FullName]
    bar: FullName

Simplify `RootModel` classes¶

The auto-generated code will often contain RootModel classes (see Pydantic docs). For example, the above JSON Schema example might be converted to:

class Foo(IdsElement):
    first_name: str
    last_name: str

class Foos(RootModel[List[Foo]]):
    root: List[Foo]

class TopLevelClass(IdsElement):
    foos: Foos

In this case, consider simplifying the code by removing the RootModel class (Foos) and using the list type directly.

class Foo(IdsElement):
    first_name: str
    last_name: str

class TopLevelClass(IdsElement):
    foos: List[Foo]

Use common components¶

To reduce duplicate code and thus improve maintainability, inherit from existing programmatic IDS components where possible.

For example, ValueUnit objects may have been included in the generated code like this:

class ValueUnit(IdsElement):
    value: Required[Optional[float]]
    unit: Required[Optional[str]]

Delete this class and instead import the equivalent class from ts_ids_core, to make use of standard IDS components:

from ts_ids_core.schema.value_unit import ValueUnit

See Components for more about components.

Complete the top-level schema class¶

The top-level schema class should be updated to inherit from IdsSchema which includes validation and logic which applies to the whole schema. Users should also add the following metadata fields to the top level programmatic IDS class using the schema_extra_metadata class variable. Note that SchemaExtraMetadataType is a ClassVar. This way, ts-ids-core interprets ClassVar as class attributes rather than instance attributes.

from typing import ClassVar

from ts_ids_core.base.ids_element import SchemaExtraMetadataType
from ts_ids_core.schema import IdsSchema

class TopLevelSchema(IdsSchema):
    schema_extra_metadata: ClassVar[SchemaExtraMetadataType] = {
        "$id": ...,
        "$schema": "http://json-schema.org/draft-07/schema#"
    }

Use primary and foreign key annotations¶

If primary or foreign keys were used as schema metadata, they will have been captured as json_schema_extra metadata in the generated code, for example:

pk: str = Field(
    ...,
    json_schema_extra={"@primary_key": True},
)
fk_step: str = Field(
    ...,
    json_schema_extra={
        "@foreign_key": "/properties/methods/items/properties/protocol/items/properties/pk"
    },
)

The above can be converted to use ts-ids-core’s primary and foreign key field annotations, see the documentation of UUIDForeignKey for an example.

Common problems¶

import-schema may occasionally fail with an error such as:

pydantic.error_wrappers.ValidationError: 2 validation errors for JsonSchemaObject
properties -> datacubes -> items -> properties -> additionalProperties
  value is not a valid dict (type=type_error.dict)
properties -> additionalProperties
  value is not a valid dict (type=type_error.dict)

This can be caused by mistakes in the JSON schema, or by a schema using features which are not supported by the code generator.

The error message includes the path to each property which caused code generation to fail. Check the schema at each of these paths for any non-compliant JSON Schema or unsupported features, correct it or replace the unsupported feature, and then try to run import-schema again.

A common mistake is to use JSON Schema keywords inside an object’s properties. For example, one of the code generation errors above is caused by setting $.properties.additionalProperties = false. This is most likely not what the developer intended because it defines a property called additionalProperties whose schema is false. The intention was most likely to set it one level above, $.additionalProperties = false, which has the usual meaning of the additionalProperties schema keyword.

Note that the additionalProperties mistake above is actually still valid JSON Schema, which makes it difficult to notice when dealing directly with JSON Schema. The more limited set of features supported by the code generator and programmatic IDS can be useful for identifying and avoiding this kind of mistake.