Skip to main contentIBM ST4SD

DSL 2.0 Specification

Use this page to learn about the new Domain Specific Language (DSL 2.0) of ST4SD and how it works.

Here, we are using DSL 2.0, if you need to understand the previous syntax check out the FlowIR docs.

DSL 2.0 is the new (and beta) way to define the computational graphs of ST4SD workflows.

Namespace

In DSL 2.0, a Computational Graph consists of Components which can be grouped under Workflow containers. It also has an Entrypoint which points to the root node of the graph, which is an instance of a Component or Workflow template.

A Namespace is simply a container for the Component, Workflow, and Entrypoint definitions which represent the Computational Graph of one ST4SD workflow.

Below is an example of a Namespace containing a single component that prints the message Hello world to the terminal.

entrypoint:
entry-instance: print
execute:
- target: "<entry-instance>"
args:
message: Hello world
components:
- signature:
name: print

Entrypoint

The Optional Entrypoint serves a single purpose. Describe how to execute root Template instance of the Computational Graph.

Its schema is:

# This executes an instance of $template which is called "<entry-instance>"
entry-instance: $template # name of a Component or Workflow template
execute: # an array with exactly 1 entry
- target: <entry-instance> # which instance of a Template to execute.
# In this scope there is only <entry-instance>
args:
$paramName: $value # one for each parameter of the template that
# the "target" points to

The entry-instance field receives the name of a Template and creates an instance of it called <entry-instance>. The execute field then describes how to “execute” the <entry-instance> i.e. how to populate the arguments of the associated Template.

In execute[].args you:

  • must provide values for any parameters in the child $template which do not have default values
  • may override the value of the parameters in $template which have default values

The Template instance that the entrypoint points to can have special parameters which are data references to paths that are external to the workflow. These parameters must be called input.$filename and they must not have default values in the signature of the Template definition. The entrypoint may not explicitly override the values of said parameters, the runtime system will auto-generate them.

Consider a scenario where the Template that the <entry-instance> step points to has a parameter called input.my-input.db. The runtime will post-process the entrypoint.execute[0].args dictionary to include the following key-value pair:

input.my-input.db: "input/my-input.db"

In Assigning values to parameters we describe in more detail how to assign values to parameters of Template instances in general.

Workflow

A Workflow is a Template that describes how to execute a number of Template instances called steps. It has a signature that consists of a unique name and a parameter list. Each such step can consume the outputs of a sibling step, or the parameters of the parent Workflow.

The outputs of a workflow are its steps. The schema of Workflow is:

signature:
name: $Template # the name of this Workflow Template - must be unique
parameters:
- name: $paramName
# optional default value
default: $value # str, number, or dictionary of {str: str/number}
steps: # which steps to instantiate
$stepName: $Template # for example child: simulation-code
execute: # how to execute the steps - one for each entry of steps

In Assigning values to parameters we describe how to assign values to parameters of Template instances.

Component

A Component describes how to execute a task. Just like a Workflow Template, it has a signature that consists of a name and a parameter list.

The outputs of a Component are the paths under its working directory.

The schema of a Component is:

signature:
name: $Template # the name of this Component Template - must be unique
parameters:
- name: $paramName
# optional default value
default: $value # str, number, or dictionary of {str: str/number}
# All the FlowIR fields, except for stage, name, references, and override
command:
executable: str

The above fields are the same as those in the Component section of the Workflow Specification in FlowIR.

For more information, read our documentation on the basic FlowIR component fields.

Assigning values to parameters

Both Component and Workflow templates are instantiated in the same way: by declaring them as a step and adding an entry to an execute block which assigns values to the Template’s parameters.

The value of a parameter can be a number, string, or a key: value dictionary. The body of a Template can reference its parameters like so %(parameterName)s.

When assigning a value to the parameters of a template via the execute[].args dictionary

In execute[].args you:

  • must provide values for any parameters in the child $template which do not have default values
  • may override the value of the parameters in $template which have default values
  • may use OutputReferences to indicate dependencies to steps (definition follows this bullet list)
  • may use %(parentParameter)s to indicate a dependency to the value that the parent parameter has. In turn that can be a dependency to the output of a Template instance or an input file or it might just be a literal constant
  • may use a $key: $value dictionary to propagate a dictionary-type value. At the moment Template can only reference this kind of parameters to set the value of the command.environment field of Components
  • may use %(input.$filename)sto propagate an input file reference from a parent to a step.
    • Eventually a step must apply a DataReferences :$method to the parameter to indicates it wishes to consume the input file

Environments

The environment that components run in is defined in the command.environment field. If you don’t define anything in this section ST4SD will create a default environment containing all the environment variables of the runtime system process.

Example:

command:
environment:
ENV-VAR1: value/for/env-var1
ENV-VAR2: value/for/env-var2
DEFAULTS: ENV-VAR3:ENV-VAR4

The above defines an environment with 4 environment variables:

  • ENV-VAR1 whose value is value/for/env-var1
  • ENV-VAR2 whose value is value/for/env-var2
  • ENV-VAR3 whose value is inherited from the environment variable ENV-VAR3 of the process running the runtime system
  • ENV-VAR4 whose value is inherited from the environment variable ENV-VAR4 of the process running the runtime system

In the above example, we use the DEFAULTS directive to inherit the values for a list of environment variables from the environment variables of the runtime system process. The value of the special “DEFAULTS” key is a list of environment variable name separated with ”:“.

Want to find out more? Check out our example.

OutputReference

The format of an OutputReference is:

<$stepId>/$optionalPath:$optionalMethod

$stepId is a / separated array of stepNames starting from the scope of the current workflow. For example, the OutputReference <one/child>/file.txt:ref resolves to the absolute path of the file file.txt that the component child produces under the sibling step one which is an instance of a Workflow template. You can find more reference methods in our DataReferences docs.

Example

Here is a simple example which uses one Workflow and one Component template two run 2 tasks.

  • consume-input: prints the contents of an input file called my-input.db
  • consume-sibling: prints the text “my sibling said” followed by stdout of the sibling step <consume-input>
entrypoint:
entry-instance: main
execute:
- target: <entry-instance>
workflows:
- signature:
name: main
parameters:
# special variable with auto-populated value

To try it out, store the above DSL in a file called dsl-params.yaml and run

pip install "st4sd-runtime-core[develop]>=2.5.0"

which installs the command-line-tool elaunch.py, followed by:

echo "hello world" >my-input.db
elaunch.py -i my-input.db --failSafeDelays=no -l40 dsl-params.yaml

Key outputs

All experiments produce files, but not all generated files are equally important. To this end ST4SD has the concept of key-outputs. These are files, and directories, that an experiment produces which the developers of the experiment consider important.

Here is a an example of an experiment with a key-output:

entrypoint:
entry-instance: hello
execute:
- target: <entry-instance>
args:
message: Hello world
output:
- name: greeting
data-in: <entry-instance>:output

The output field in the entrypoint dictionary defines the key-outputs of this experiment:

entrypoint:
# ... other fields ...
output:
- name: greeting
data-in: <entry-instance>:output

This experiment has a single key-output called greeting. The data associated with this key-output is actually the stdout of the <entry-instance> step which is an instance of the hello component. As the experiment finishes producing this key-output the $INSTANCE_DIR/output/output.json file is updated to reflect the state of this experiment.

Here’s an how the output.json file will look like for the above key outputs:

{
"greeting": {
"creationtime": "1725374555.6836693",
"description": "just a friendly greeting",
"filename": "out.stdout",
"filepath": "stages/stage0/entry-instance/out.stdout",
"final": "yes",
"production": "yes",
"type": "",

While the experiment is running, the runtime system asynchronously updates this file with metadata about the generated key-outputs of the experiment. In this example, there is just one key-output called greeting. For more information on key-outputs check out our documentation.

If you are running experiments on the cloud and are instructing the runtime system to register them into the ST4SD datastore you may also use the ST4SD python API to download the key-outputs of your experiment instances.

Interface

Key outputs are not always immediately parseable without deep understanding of their format. To address this, ST4SD supports the interface feature. This feature allows workflow developers to extract measured properties and store them in a CSV file, making the data easier to consume.

Some virtual experiments define interfaces which make it simpler for users to retrieve the input systems and measured properties from executions of that virtual experiment.

The interface of a virtual experiment defines:

  • The specification used to describe input systems it processes e.g. SMILEs for small molecules
  • Instructions to extract the input systems from input data
  • Instructions to extract the values of properties that the virtual experiment computes

Once a virtual experiment has an interface ST4SD can return a pandas.DataFrame containing the properties calculated by instances of the virtual experiment, as well as the ids of the input systems that an instance processed. This functionality is provided via the st4sd-datastore API and the st4sd-runtime-service API. See using a virtual experiment interface for further information.

In this example we will work with a virtual experiment which:

  1. extracts the IDs of its input systems
  2. has 2 key-outputs that correspond to 2 measured properties of the interface
  3. uses builtin hooks to extract the measured properties from the key-outputs

The DSL of the experiment is :

entrypoint:
interface:
description: Counts vowels in words
inputSpec:
namingScheme: words
inputExtractionMethod:
csvColumn:
source:
path: input/words.csv

The interface contains a human readable description of the experiment under entrypoint.interface.description.

entrypoint:
interface:
description: Counts vowels in words

Then, in entrypoint.interface.inputspec it uses the builtin input extraction method csvColumn to extract the ids of the systems it processes:

entrypoint:
interface:
inputSpec:
namingScheme: words
inputExtractionMethod:
csvColumn:
source:
path: input/words.csv
args:

It instructs the method to read the CSV file input/words.csv (i.e. the input file) and treat every row of the CSV as one input system whose identifier lies in the column word.

Following that, it uses the builtin property extraction method csvDataFrame twice to measure its 2 properties Vowels and Letters from the key-outputs vowels and letters respectively.

entrypoint:
interface:
propertiesSpec:
- name: Vowels
propertyExtractionMethod:
csvDataFrame:
source:
keyOutput: vowels
args:

The csvDataFrame property extraction method expects a CSV file which has the columns input-id and ${the property name}. One of the requirements for using a ST4SD interface is that the property names start with a capital letter. One of the requirements of the csvDataFrame is that there should be a column with the same name as the property name that is being extracted. Another is that there should be a column called input-id.

In this example the components happen to produce key-output CSV files which contain a properly named column for the values of properties but instead of using the input-id column they use the column word. To account for this inconsistency, the developers of the workflow use the renameColumns argument of the csvDataFrame property extraction method. Via renameColumns they instruct csvDataFrame to treat the column word as if it were called input-id.

This means that you have to create a CSV file called words.csv and use it as an input for (via the -i arg) to the workflow.

You can find more information on this in the creating an interface documentation. Just keep in mind that this documentation was originally written with the FlowIR syntax in mind.

Differences between DSL 2.0 and FlowIR

There are some differences between DSL 2.0 and FlowIR.

In the current version (0.3.x) of DSL 2.0:

  • we offer support for natural composition of Computational Graphs using Workflow and Component templates
  • the signature field replaces the stage, name, references, and override fields of the component specification in FlowIR
  • settings and inputs flow through parameters, we do not support global/stage environments or variables
  • the fields of components can contain %(parameter)s references as well as component %(variable)s
  • dependencies between components are defined by referencing the output of a producer component in one parameter of the consumer component - DataReferences are reserved for referencing input files only
    • the equivalent of a DataReference for Template instances is an OutputReference
  • data files and manifests
  • key outputs and interface

DSL 2.0 will eventually contain a superset of the FlowIR features. However, the current beta version of DSL 2.0 does not support:

  • FlowIR platforms
  • application-dependencies
    • however, you can use a manifest to implicitly define your application-dependencies