DSL 2.0 Specification
Use this page to learn about the new Domain Specific Language (DSL 2.0) of ST4SD and how it works.
- Namespace
- Entrypoint
- Workflow
- Component
- Assigning values to parameters
- OutputReference
- Example
- Key outputs
- Interface
- Differences between DSL 2.0 and FlowIR
DSL 2.0 is the new (and beta) way to define the computational graphs of ST4SD workflows.
Namespace
In DSL 2.0, a Computational Graph consists of Components which can be grouped under Workflow containers. It also has an Entrypoint which points to the root node of the graph, which is an instance of a Component or Workflow template.
A Namespace is simply a container for the Component, Workflow, and Entrypoint definitions which represent the Computational Graph of one ST4SD workflow.
Below is an example of a Namespace containing a single component that prints the message Hello world
to the terminal.
entrypoint:entry-instance: printexecute:- target: "<entry-instance>"args:message: Hello worldcomponents:- signature:name: print
Entrypoint
The Optional Entrypoint serves a single purpose. Describe how to execute root Template instance of the Computational Graph.
Its schema is:
# This executes an instance of $template which is called "<entry-instance>"entry-instance: $template # name of a Component or Workflow templateexecute: # an array with exactly 1 entry- target: <entry-instance> # which instance of a Template to execute.# In this scope there is only <entry-instance>args:$paramName: $value # one for each parameter of the template that# the "target" points to
The entry-instance
field receives the name of a Template and creates an instance of it called <entry-instance>
.
The execute
field then describes how to “execute” the <entry-instance>
i.e. how to populate the arguments of the associated Template.
In execute[].args
you:
- must provide values for any parameters in the child
$template
which do not have default values - may override the value of the parameters in
$template
which have default values
The Template instance that the entrypoint points to can have special parameters which are data references to paths that are external to the workflow.
These parameters must be called input.$filename
and they must not have default values in the signature of the Template definition.
The entrypoint may not explicitly override the values of said parameters, the runtime system will auto-generate them.
Consider a scenario where the Template that the <entry-instance>
step points to has a parameter called input.my-input.db
.
The runtime will post-process the entrypoint.execute[0].args
dictionary to include the following key-value pair:
input.my-input.db: "input/my-input.db"
In Assigning values to parameters we describe in more detail how to assign values to parameters of Template instances in general.
Workflow
A Workflow is a Template that describes how to execute
a number of Template instances called steps
.
It has a signature
that consists of a unique name
and a parameter
list.
Each such step can consume the outputs of a sibling step, or the parameters of the parent Workflow.
The outputs of a workflow are its steps
. The schema of Workflow is:
signature:name: $Template # the name of this Workflow Template - must be uniqueparameters:- name: $paramName# optional default valuedefault: $value # str, number, or dictionary of {str: str/number}steps: # which steps to instantiate$stepName: $Template # for example child: simulation-codeexecute: # how to execute the steps - one for each entry of steps
In Assigning values to parameters we describe how to assign values to parameters of Template instances.
Component
A Component describes how to execute
a task.
Just like a Workflow Template, it has a signature
that consists of a name
and a parameter
list.
The outputs of a Component are the paths under its working directory.
The schema of a Component is:
signature:name: $Template # the name of this Component Template - must be uniqueparameters:- name: $paramName# optional default valuedefault: $value # str, number, or dictionary of {str: str/number}# All the FlowIR fields, except for stage, name, references, and overridecommand:executable: str
The above fields are the same as those in the Component section of the Workflow Specification in FlowIR.
For more information, read our documentation on the basic FlowIR component fields.
Assigning values to parameters
Both Component and Workflow templates are instantiated in the same way:
by declaring them as a step
and adding an entry to an execute
block which assigns values to the Template’s parameters.
The value of a parameter can be a number, string, or a key: value dictionary.
The body of a Template can reference its parameters like so %(parameterName)s
.
When assigning a value to the parameters of a template via the execute[].args
dictionary
In execute[].args
you:
- must provide values for any parameters in the child
$template
which do not have default values - may override the value of the parameters in
$template
which have default values - may use
OutputReferences
to indicate dependencies to steps (definition follows this bullet list) - may use
%(parentParameter)s
to indicate a dependency to the value that the parent parameter has. In turn that can be a dependency to the output of a Template instance or an input file or it might just be a literal constant - may use a
$key: $value
dictionary to propagate a dictionary-type value. At the moment Template can only reference this kind of parameters to set the value of thecommand.environment
field of Components - may use
%(input.$filename)s
to propagate an input file reference from a parent to a step.- Eventually a step must apply a DataReferences
:$method
to the parameter to indicates it wishes to consume the input file
- Eventually a step must apply a DataReferences
Environments
The environment that components run in is defined in the command.environment
field. If you don’t define anything in this section ST4SD will create a default environment containing all the environment variables of the runtime system process.
Example:
command:environment:ENV-VAR1: value/for/env-var1ENV-VAR2: value/for/env-var2DEFAULTS: ENV-VAR3:ENV-VAR4
The above defines an environment with 4 environment variables:
ENV-VAR1
whose value isvalue/for/env-var1
ENV-VAR2
whose value isvalue/for/env-var2
ENV-VAR3
whose value is inherited from the environment variableENV-VAR3
of the process running the runtime systemENV-VAR4
whose value is inherited from the environment variableENV-VAR4
of the process running the runtime system
In the above example, we use the DEFAULTS
directive to inherit the values for a list of environment variables from the environment variables of the runtime system process. The value of the special “DEFAULTS” key is a list of environment variable name separated with ”:“.
Want to find out more? Check out our example.
OutputReference
The format of an OutputReference
is:
<$stepId>/$optionalPath:$optionalMethod
$stepId
is a /
separated array of stepNames
starting from the scope of the current workflow. For example, the OutputReference <one/child>/file.txt:ref
resolves to the absolute path of the file file.txt
that the component child
produces under the sibling step one
which is an instance of a Workflow template. You can find more reference methods
in our DataReferences docs.
Example
Here is a simple example which uses one Workflow and one Component template two run 2 tasks.
- consume-input: prints the contents of an input file called
my-input.db
- consume-sibling: prints the text “my sibling said” followed by stdout of the sibling step
<consume-input>
entrypoint:entry-instance: mainexecute:- target: <entry-instance>workflows:- signature:name: mainparameters:# special variable with auto-populated value
To try it out, store the above DSL in a file called dsl-params.yaml
and run
pip install "st4sd-runtime-core[develop]>=2.5.0"
which installs the command-line-tool elaunch.py, followed by:
echo "hello world" >my-input.dbelaunch.py -i my-input.db --failSafeDelays=no -l40 dsl-params.yaml
Key outputs
All experiments produce files, but not all generated files are equally important. To this end ST4SD has the concept of key-outputs. These are files, and directories, that an experiment produces which the developers of the experiment consider important.
Here is a an example of an experiment with a key-output:
entrypoint:entry-instance: helloexecute:- target: <entry-instance>args:message: Hello worldoutput:- name: greetingdata-in: <entry-instance>:output
The output
field in the entrypoint
dictionary defines the key-outputs of this experiment:
entrypoint:# ... other fields ...output:- name: greetingdata-in: <entry-instance>:output
This experiment has a single key-output called greeting
. The data associated with this key-output is actually the stdout
of the <entry-instance>
step which is an instance of the hello
component. As the experiment finishes producing this key-output the $INSTANCE_DIR/output/output.json
file is updated to reflect the state of this experiment.
Here’s an how the output.json
file will look like for the above key outputs:
{"greeting": {"creationtime": "1725374555.6836693","description": "just a friendly greeting","filename": "out.stdout","filepath": "stages/stage0/entry-instance/out.stdout","final": "yes","production": "yes","type": "",
While the experiment is running, the runtime system asynchronously updates this file with metadata about the generated key-outputs of the experiment. In this example, there is just one key-output called greeting
. For more information on key-outputs check out our documentation.
If you are running experiments on the cloud and are instructing the runtime system to register them into the ST4SD datastore you may also use the ST4SD python API to download the key-outputs of your experiment instances.
Interface
Key outputs are not always immediately parseable without deep understanding of their format. To address this, ST4SD supports the interface feature. This feature allows workflow developers to extract measured properties and store them in a CSV file, making the data easier to consume.
Some virtual experiments define interfaces which make it simpler for users to retrieve the input systems and measured properties from executions of that virtual experiment.
The interface
of a virtual experiment defines:
- The specification used to describe
input
systems it processes e.g. SMILEs for small molecules - Instructions to extract the
input
systems from input data - Instructions to extract the values of
properties
that the virtual experiment computes
Once a virtual experiment has an interface
ST4SD can return a pandas.DataFrame
containing the properties calculated by instances of the virtual experiment, as well as the ids of the input
systems that an instance processed. This functionality is provided via the st4sd-datastore
API and the st4sd-runtime-service
API. See using a virtual experiment interface for further information.
In this example we will work with a virtual experiment which:
- extracts the IDs of its input systems
- has 2 key-outputs that correspond to 2 measured properties of the interface
- uses builtin hooks to extract the measured properties from the key-outputs
The DSL of the experiment is :
entrypoint:interface:description: Counts vowels in wordsinputSpec:namingScheme: wordsinputExtractionMethod:csvColumn:source:path: input/words.csv
The interface contains a human readable description of the experiment under entrypoint.interface.description
.
entrypoint:interface:description: Counts vowels in words
Then, in entrypoint.interface.inputspec
it uses the builtin input extraction method csvColumn
to extract the ids of the systems it processes:
entrypoint:interface:inputSpec:namingScheme: wordsinputExtractionMethod:csvColumn:source:path: input/words.csvargs:
It instructs the method to read the CSV file input/words.csv
(i.e. the input file) and treat every row of the CSV as one input system whose identifier lies in the column word
.
Following that, it uses the builtin property extraction method csvDataFrame
twice to measure its 2 properties Vowels
and Letters
from the key-outputs vowels
and letters
respectively.
entrypoint:interface:propertiesSpec:- name: VowelspropertyExtractionMethod:csvDataFrame:source:keyOutput: vowelsargs:
The csvDataFrame
property extraction method expects a CSV file which has the columns input-id
and ${the property name}
. One of the requirements for using a ST4SD interface is that the property names start with a capital letter. One of the requirements of the csvDataFrame
is that there should be a column with the same name as the property name that is being extracted. Another is that there should be a column called input-id
.
In this example the components happen to produce key-output CSV files which contain a properly named column for the values of properties but instead of using the input-id
column they use the column word
. To account for this inconsistency, the developers of the workflow use the renameColumns
argument of the csvDataFrame
property extraction method. Via renameColumns
they instruct csvDataFrame
to treat the column word
as if it were called input-id
.
This means that you have to create a CSV file called words.csv
and use it as an input for (via the -i
arg) to the workflow.
You can find more information on this in the creating an interface documentation. Just keep in mind that this documentation was originally written with the FlowIR syntax in mind.
Differences between DSL 2.0 and FlowIR
There are some differences between DSL 2.0 and FlowIR.
In the current version (0.3.x) of DSL 2.0:
- we offer support for natural composition of Computational Graphs using Workflow and Component templates
- the
signature
field replaces thestage
,name
,references
, andoverride
fields of the component specification in FlowIR - settings and inputs flow through parameters, we do not support global/stage environments or variables
- the fields of components can contain
%(parameter)s
references as well as component%(variable)s
- dependencies between components are defined by referencing the output of a producer component in one parameter of the consumer component - DataReferences are reserved for referencing input files only
- the equivalent of a DataReference for Template instances is an OutputReference
- data files and manifests
- key outputs and interface
DSL 2.0 will eventually contain a superset of the FlowIR features. However, the current beta version of DSL 2.0 does not support:
- FlowIR platforms
- application-dependencies
- however, you can use a manifest to implicitly define your application-dependencies