DSL 2.0 Specification
Use this page to learn about the new Domain Specific Language (DSL 2.0) of ST4SD and how it works.
- Namespace
- Entrypoint
- Workflow
- Component
- Assigning values to parameters
- OutputReference
- Example
- Key outputs
- Interface
- Differences between DSL 2.0 and FlowIR
DSL 2.0 is the new (and beta) way to define the computational graphs of ST4SD workflows.
Namespace
In DSL 2.0, a Computational Graph consists of Components which can be grouped under Workflow containers. It also has an Entrypoint which points to the root node of the graph, which is an instance of a Component or Workflow template.
A Namespace is simply a container for the Component, Workflow, and Entrypoint definitions which represent the Computational Graph of one ST4SD workflow.
Below is an example of a Namespace containing a single component that prints the message Hello world to the terminal.
entrypoint:entry-instance: printexecute:- target: "<entry-instance>"args:message: Hello worldcomponents:- signature:name: print
Entrypoint
The Optional Entrypoint serves a single purpose. Describe how to execute root Template instance of the Computational Graph.
Its schema is:
# This executes an instance of $template which is called "<entry-instance>"entry-instance: $template # name of a Component or Workflow templateexecute: # an array with exactly 1 entry- target: <entry-instance> # which instance of a Template to execute.# In this scope there is only <entry-instance>args:$paramName: $value # one for each parameter of the template that# the "target" points to
The entry-instance field receives the name of a Template and creates an instance of it called <entry-instance>.
The execute field then describes how to “execute” the <entry-instance> i.e. how to populate the arguments of the associated Template.
In execute[].args you:
- must provide values for any parameters in the child
$templatewhich do not have default values - may override the value of the parameters in
$templatewhich have default values
The Template instance that the entrypoint points to can have special parameters which are data references to paths that are external to the workflow.
These parameters must be called input.$filename and they must not have default values in the signature of the Template definition.
The entrypoint may not explicitly override the values of said parameters, the runtime system will auto-generate them.
Consider a scenario where the Template that the <entry-instance> step points to has a parameter called input.my-input.db.
The runtime will post-process the entrypoint.execute[0].args dictionary to include the following key-value pair:
input.my-input.db: "input/my-input.db"
In Assigning values to parameters we describe in more detail how to assign values to parameters of Template instances in general.
Workflow
A Workflow is a Template that describes how to execute a number of Template instances called steps.
It has a signature that consists of a unique name and a parameter list.
Each such step can consume the outputs of a sibling step, or the parameters of the parent Workflow.
The outputs of a workflow are its steps. The schema of Workflow is:
signature:name: $Template # the name of this Workflow Template - must be uniqueparameters:- name: $paramName# optional default valuedefault: $value # str, number, or dictionary of {str: str/number}steps: # which steps to instantiate$stepName: $Template # for example child: simulation-codeexecute: # how to execute the steps - one for each entry of steps
In Assigning values to parameters we describe how to assign values to parameters of Template instances.
Component
A Component describes how to execute a task.
Just like a Workflow Template, it has a signature that consists of a name and a parameter list.
The outputs of a Component are the paths under its working directory.
The schema of a Component is:
signature:name: $Template # the name of this Component Template - must be uniqueparameters:- name: $paramName# optional default valuedefault: $value # str, number, or dictionary of {str: str/number}# All the FlowIR fields, except for stage, name, references, and overridecommand:executable: str
The above fields are the same as those in the Component section of the Workflow Specification in FlowIR.
For more information, read our documentation on the basic FlowIR component fields.
Assigning values to parameters
Both Component and Workflow templates are instantiated in the same way:
by declaring them as a step and adding an entry to an execute block which assigns values to the Template’s parameters.
The value of a parameter can be a number, string, or a key: value dictionary.
The body of a Template can reference its parameters like so %(parameterName)s.
When assigning a value to the parameters of a template via the execute[].args dictionary
In execute[].args you:
- must provide values for any parameters in the child
$templatewhich do not have default values - may override the value of the parameters in
$templatewhich have default values - may use
OutputReferencesto indicate dependencies to steps (definition follows this bullet list) - may use
%(parentParameter)sto indicate a dependency to the value that the parent parameter has. In turn that can be a dependency to the output of a Template instance or an input file or it might just be a literal constant - may use a
$key: $valuedictionary to propagate a dictionary-type value. At the moment Template can only reference this kind of parameters to set the value of thecommand.environmentfield of Components - may use
%(input.$filename)sto propagate an input file reference from a parent to a step.- Eventually a step must apply a DataReferences
:$methodto the parameter to indicates it wishes to consume the input file
- Eventually a step must apply a DataReferences
Environments
The environment that components run in is defined in the command.environment field. If you don’t define anything in this section ST4SD will create a default environment containing all the environment variables of the runtime system process.
Example:
command:environment:ENV-VAR1: value/for/env-var1ENV-VAR2: value/for/env-var2DEFAULTS: ENV-VAR3:ENV-VAR4
The above defines an environment with 4 environment variables:
ENV-VAR1whose value isvalue/for/env-var1ENV-VAR2whose value isvalue/for/env-var2ENV-VAR3whose value is inherited from the environment variableENV-VAR3of the process running the runtime systemENV-VAR4whose value is inherited from the environment variableENV-VAR4of the process running the runtime system
In the above example, we use the DEFAULTS directive to inherit the values for a list of environment variables from the environment variables of the runtime system process. The value of the special “DEFAULTS” key is a list of environment variable name separated with ”:“.
Want to find out more? Check out our example.
OutputReference
The format of an OutputReference is:
<$stepId>/$optionalPath:$optionalMethod
$stepId is a / separated array of stepNames starting from the scope of the current workflow. For example, the OutputReference <one/child>/file.txt:ref resolves to the absolute path of the file file.txt that the component child produces under the sibling step one which is an instance of a Workflow template. You can find more reference methods in our DataReferences docs.
Example
Here is a simple example which uses one Workflow and one Component template two run 2 tasks.
- consume-input: prints the contents of an input file called
my-input.db - consume-sibling: prints the text “my sibling said” followed by stdout of the sibling step
<consume-input>
entrypoint:entry-instance: mainexecute:- target: <entry-instance>workflows:- signature:name: mainparameters:# special variable with auto-populated value
To try it out, store the above DSL in a file called dsl-params.yaml and run
pip install "st4sd-runtime-core[develop]>=2.5.1"
which installs the command-line-tool elaunch.py, followed by:
echo "hello world" >my-input.dbelaunch.py -i my-input.db --failSafeDelays=no -l40 dsl-params.yaml
Key outputs
All experiments produce files, but not all generated files are equally important. To this end ST4SD has the concept of key-outputs. These are files, and directories, that an experiment produces which the developers of the experiment consider important.
Here is a an example of an experiment with a key-output:
entrypoint:entry-instance: helloexecute:- target: <entry-instance>args:message: Hello worldoutput:- name: greetingdata-in: <entry-instance>:output
The output field in the entrypoint dictionary defines the key-outputs of this experiment:
entrypoint:# ... other fields ...output:- name: greetingdata-in: <entry-instance>:output
This experiment has a single key-output called greeting. The data associated with this key-output is actually the stdout of the <entry-instance> step which is an instance of the hello component. As the experiment finishes producing this key-output the $INSTANCE_DIR/output/output.json file is updated to reflect the state of this experiment.
Here’s an how the output.json file will look like for the above key outputs:
{"greeting": {"creationtime": "1725374555.6836693","description": "just a friendly greeting","filename": "out.stdout","filepath": "stages/stage0/entry-instance/out.stdout","final": "yes","production": "yes","type": "",
While the experiment is running, the runtime system asynchronously updates this file with metadata about the generated key-outputs of the experiment. In this example, there is just one key-output called greeting. For more information on key-outputs check out our documentation.
If you are running experiments on the cloud and are instructing the runtime system to register them into the ST4SD datastore you may also use the ST4SD python API to download the key-outputs of your experiment instances.
Interface
Key outputs are not always immediately parseable without deep understanding of their format. To address this, ST4SD supports the interface feature. This feature allows workflow developers to extract measured properties and store them in a CSV file, making the data easier to consume.
Some virtual experiments define interfaces which make it simpler for users to retrieve the input systems and measured properties from executions of that virtual experiment.
The interface of a virtual experiment defines:
- The specification used to describe
inputsystems it processes e.g. SMILEs for small molecules - Instructions to extract the
inputsystems from input data - Instructions to extract the values of
propertiesthat the virtual experiment computes
Once a virtual experiment has an interface ST4SD can return a pandas.DataFrame containing the properties calculated by instances of the virtual experiment, as well as the ids of the input systems that an instance processed. This functionality is provided via the st4sd-datastore API and the st4sd-runtime-service API. See using a virtual experiment interface for further information.
In this example we will work with a virtual experiment which:
- extracts the IDs of its input systems
- has 2 key-outputs that correspond to 2 measured properties of the interface
- uses builtin hooks to extract the measured properties from the key-outputs
The DSL of the experiment is :
entrypoint:interface:description: Counts vowels in wordsinputSpec:namingScheme: wordsinputExtractionMethod:csvColumn:source:path: input/words.csv
The interface contains a human readable description of the experiment under entrypoint.interface.description.
entrypoint:interface:description: Counts vowels in words
Then, in entrypoint.interface.inputspec it uses the builtin input extraction method csvColumn to extract the ids of the systems it processes:
entrypoint:interface:inputSpec:namingScheme: wordsinputExtractionMethod:csvColumn:source:path: input/words.csvargs:
It instructs the method to read the CSV file input/words.csv (i.e. the input file) and treat every row of the CSV as one input system whose identifier lies in the column word.
Following that, it uses the builtin property extraction method csvDataFrame twice to measure its 2 properties Vowels and Letters from the key-outputs vowels and letters respectively.
entrypoint:interface:propertiesSpec:- name: VowelspropertyExtractionMethod:csvDataFrame:source:keyOutput: vowelsargs:
The csvDataFrame property extraction method expects a CSV file which has the columns input-id and ${the property name}. One of the requirements for using a ST4SD interface is that the property names start with a capital letter. One of the requirements of the csvDataFrame is that there should be a column with the same name as the property name that is being extracted. Another is that there should be a column called input-id.
In this example the components happen to produce key-output CSV files which contain a properly named column for the values of properties but instead of using the input-id column they use the column word. To account for this inconsistency, the developers of the workflow use the renameColumns argument of the csvDataFrame property extraction method. Via renameColumns they instruct csvDataFrame to treat the column word as if it were called input-id.
This means that you have to create a CSV file called words.csv and use it as an input for (via the -i arg) to the workflow.
You can find more information on this in the creating an interface documentation. Just keep in mind that this documentation was originally written with the FlowIR syntax in mind.
Differences between DSL 2.0 and FlowIR
There are some differences between DSL 2.0 and FlowIR.
In the current version (0.3.x) of DSL 2.0:
- we offer support for natural composition of Computational Graphs using Workflow and Component templates
- the
signaturefield replaces thestage,name,references, andoverridefields of the component specification in FlowIR - settings and inputs flow through parameters, we do not support global/stage environments or variables
- the fields of components can contain
%(parameter)sreferences as well as component%(variable)s - dependencies between components are defined by referencing the output of a producer component in one parameter of the consumer component - DataReferences are reserved for referencing input files only
- the equivalent of a DataReference for Template instances is an OutputReference
- data files and manifests
- key outputs and interface
DSL 2.0 will eventually contain a superset of the FlowIR features. However, the current beta version of DSL 2.0 does not support:
- FlowIR platforms
- application-dependencies
- however, you can use a manifest to implicitly define your application-dependencies