FlowIR Specification
Use this page to learn what FlowIR elements are there and how they work.
- Component
- DataReference
- Environments
- Variables
- Blueprints
- Platforms
- FlowIR Scopes
- FlowIR options/variable inheritance sequence
- Key-outputs
- Interface and Properties
Component
The component element describes a step in the workflow. The definition for a component is (some fields omitted):
stage: integer greater or equal to 0 (optional - defaults to 0)name: the name of the component (must be unique in the same stage)command:executable: strarguments: strenvironment: (null, str)references:- <reference:str>workflowAttributes:
Override
Components have an override
key that allows overriding their definition based on the active platform. The definition of the override field is:
override:<platform name:str>:<component-field> #Any top-level field (with the sub-keys to be overriden) except name, stage command and reference
The main reason to do this is change variables
,workflowAttributes
resourceManager
or resourceRequest
, based on e.g. GPU or CPU deployment (see [platform][#platforms]).
Only the key/values specified are changed or added. Existing key/values that aren’t specified remain with their base values. For example:
...resourceRequest:numberThreads: 16threadsPerCore: 1memory: 100 MBioverride:bigmem:resourceRequest:memory: 1GBi
In this case on platform bigmem
this component would still ask for 16 cores but with an increased memory request.
Description of basic FlowIR component fields
- stage: integer greater or equal to 0 (optional - defaults to 0)
- name: the name of the component (must be unique in the same stage)
- command:
- executable: path to executable. It can be absolute, relative to the instance directory by prefixing the path with
bin/
or<application>/
. It can also be just the name of a binary. If the path is not absolute thest4sd-runtime
will look for the executable in the folders specified under$PATH
. - arguments: arguments to binary
- environment: Name of the
environment
to use. The definition will be searched in the top levelenvironments
field of FlowIR. - expandArguments: one of [“double-quote”, “none”] (default is “double-quote”). When set to “double-quote” ST4SD envelops commandline in double-quotes and perform bash expansion by feeding the resulting string to
echo
before using them to submit the task to the backend.
- executable: path to executable. It can be absolute, relative to the instance directory by prefixing the path with
- references: Each string in this list is a string representation of a DataReference to either a reference to a file, a folder, or a component. References to components and files produced by a component (i.e. under the working directory of a component) indicate a data dependency which the
st4sd-runtime
respects when scheduling the tasks for components. There are severalreference methods
, but these are the most commonly used::output
: thest4sd-runtime
will replace<component name>:output
references with thestdout
output of the referenced component:ref
: thest4sd-runtime
will replace<component name or file>:ref
references with the absolute path to the component or file on the filesystem.:copy
: thest4sd-runtime
will copy the file referenced by this DataReference into the working directory of the component which includes this reference. This DataReference method cannot be part of thecommand.arguments
field.
- workflowAttributes:
- replicate: If set to a positive number
N
thest4sd-runtime
will replicate this component and its downstream treeN
times (seeaggregate
below before you use this option). - aggregate: If this option is set to
True
and the component belongs in the downstream sub-tree of areplicate
component thest4sd-runtime
will stop replicating just before theaggregate
component. Each reference of theaggregate
component to thereplicate
component will be expanded toN
references (one for each upstream replicated component).
- replicate: If set to a positive number
- resourceRequest: provides hints to the backend about the resource requirements of this component tasks
- numberProcesses: integer, defaults to 1
- numberThreads: float, defaults to 1 (e.g. on kubernetes you can ask for half a thread)
- ranksPerNode: integer, defaults to 1
- threadsPerCore: integer, defaults to 1
- memory: In Bytes or as Mi/Gi (e.g. 128Mi, 16Gi)
- gpus: Only used by tasks that use the kubernetes or lsf backend
- resourceManager:
- config:
- backend: Which backend to use. Valid options are:
- local (default option)
- kubernetes
- docker (also supports other docker-line runtimes like podman)
- lsf
- walltime: Maximum execution time of a single task for this component (in minutes). This option is only valid for
kubernetes
andlsf
backends. The defaults is60
(one hour).
- backend: Which backend to use. Valid options are:
- kubernetes: Options to use when the
kubernetes
backend is selected for this component- image: which image to use
- gracePeriod: Kubernetes waits
gracePeriod
seconds between asking a container to terminate and forcing it to terminate. This applies to tasks that use the Kubernetes backend and their execution time exceedsresourceManager.config.walltime
minutes. - qos: One of “guaranteed” (default), “burstable”, “besteffort”. See Kubernetes documentation for the definition of Quality Of Service (QoS) classes.
- docker: Options to use when the
docker
backend is selected for this component. Supports otherdocker
-like runtimes via the elaunch.py parameter--dockerExecutableOverride
- image: which image to use
- imagePullPolicy: one of [“Always”, “Never”, “IfNotPresent”], default is “Always”
- lsf: Options to use when the
lsf
backend is selected for this component- queue: Name of queue to submit jobs to.
- resourceString: A LSF request string e.g.
"rusage[ngpus_physical=4.00] select[(v100&&infiniband)]"
- config:
- variables: A
key: value
collection of variables; can either override those defined in platform or introduce new ones. In both cases the value specified here is visible to this component only. See FlowIR options/variable inheritance sequence for details on how scope layering/inheritance functions in FlowIR.
Defining components
Components are placed inside the components
array:
components:- stage: intname: str<component-core>override:<platform name:str>:<component-core>
Components must have a unique (stage, name)
tuple. Here’s an extract from the sum-numbers
example:
components:# ...- stage: 1name: PartialSumcommand:executable: "bin/sum.py"arguments: "ExtractRow:output"references: ["ExtractRow:output"]
DataReference
DataReference
is the way to define references to data in FlowIR.
Components define their dependencies to other components in the graph and data external to the graph (e.g. input
, data
, and application-dependencies
which are custom directories that experiments may bundle) using DataReference
s.
Key-outputs also use DataReference
s.
A DataReference
can have two forms: an absolute
and a relative
representation. The latter is syntax sugar for the former.
Absolute representation of DataReference
stage<Index>.<producerName>:/<fileRef>:<method>
The DataReference
points to either a component
in the graph or a directory
in the root of the instance directory.
stage<Index>.
: is the stage of the producer. This is only valid forDataReference
s that point to components.Index
should be an integer greater than 0.producerName
: Either the name of a producercomponent
, or the name of thedirectory
in the root of the instance directory. The directories include all directories thatST4SD
an experiment instance directory contains (e.g.input
plus directories found in a standalone project such asdata
,conf
,hooks
,application-dependencies
, etc).</fileRef>
: Optional path, relative to the root directory of theproducer
. When omitted defaults to/
.method
: One ofref
,output
,copy
,link
. Themethod
determines how ST4SD interprets theDataReference
.ref
: TheDataReference
expands to the absolute path of the referenced file/directoryoutput
: TheDataReference
expands to thecontents
of the referenced file. If the reference is to a component with thefileref
”/” then theDataReference
is rewritten to point to the file containing the most recentstdout
of the component.copy
: TheDataReference
does not expand to anything. If acomponent
definition contains such aDataReference
in itsreferences
field, then the runtime will copy the referenced path inside the root directory of the component’s task right before the execution of the task.link
: Similar tocopy
above. The difference is that instead of copying the referenced path, the runtime will create a link to the referenced path.
Relative representation of a DataReference
The relative
representation of a DataReference
is just syntax sugar for the absolute
representation. The DataReference
can omit the stage<Index>
part.
- If the relative
DataReference
is in thereferences
field of acomponent
then theIndex
is the same as thecomponent.stage
field. - If the relative
DataReference
is in akey-output
definition then thekey-output
should also contain thestages
field. Seekey-output
documentation for more details.
Environments
The environment that components run in is defined within the environments
section of the FlowIR YAML. If you don’t define anything in this section ST4SD will create a default environment.
Example:
environments:<platform-name>:myDefinedEnvironment:ENV-VAR1: value/for/env-var1ENV-VAR2: value/for/env-var2
A component uses a defined environment by setting command.environment
to the environment name. For example:
components:- name: myComponentcommand:executable: app.exeenvironment: myDefinedEnvironment
You can set command.environment
to "none"
to instruct ST4SD to only inject a couple of auto-generated environment variables. Note, backends that ST4SD uses e.g. k8s, docker, lsf, may add env-vars afterwards.
If command.environment
is not explicitly set, the st4sd-runtime
will default to using a built-in, environment called environment
. This contains the environment from which elaunch.py
was run.
You can override the definition of environment
if you wish, for example:
environments:default:environment:ENV-VAR1: sensible/default/for/env-var1ENV-VAR2: sensible/default/for/env-var2ENV-VAR3: sensible/default/for/env-var3
The runtime always injects a couple of variables to the environments of components (INSTANCE_DIR
, FLOW_EXPERIMENT_NAME
, and FLOW_RUN_ID
).
For more information, see our environment resolution rules.
Variables
The variables
field follows the format below:
variables:<platform name:str>:Optional(global):<variable name:str>: <value: str, int, bool, float>Optional(stages):<stage index: int>:<variable name:str>: <value: str, int, bool, float>
Variables are grouped under a platform, and can either be global or stage-specific. This example uses the following variables definition:
variables:default:global:numberOfPoints: 3stages:2:addToSum: 10artifactory:
Using Variables
You refer to variables in FlowIR with the syntax %($VARIABLE_NAME)s
.
FlowIR supports using variables to define:
- values of fields
- values of other variables
For example:
variables:default:global:salutation: "hello"subject: "world"message: "%(salutation)s %(subject)s"components:- name: hello-message
Here we use the value of the message
variable in the arguments of the hello-message
component.
The value assigned to the message
variable itself uses two other variables, salutation
and subject
.
Variables can contain space separated arrays
You can also treat a variable as an array of space separated items.
Here’s you can reference the <index>
-th entry of a <variable>
:
%(<variable>)s[<Index>]
Examples:
%(names)s[0]
: This resolves to the 1st entry in thenames
array.%(names)s[%(index)s]
: Indices may be variables too!
variables:default:global:# All variables are strings in FlowIRnames: Ann Bob# Even those that look like a numberpopulation: 2components:
Blueprints
ST4SD supports defining default options for (a) all components and/or (b) for components that belong in a specific stage, via the blueprint
top-level field:
blueprint:<platform name:str>:Optional(global):<component options>Optional(stages):<stage index:int>:<component options>
This example defines the blueprint for 2 platforms. It specifies the default options when using the 2 platforms (setting values for resourceManager, resourceRequest for all components when artifactory
is the chosen platform) and specializes components in stage 1 when using the artifactory
platform (increase their memory request)
blueprint:default:global:command:environment: environmentartifactory:global:resourceRequest:memory: 100Mi
Platforms
A platform is a named collection of blueprints, variables, overrides and environments.
You define the named platforms using the top-level platform
array
platforms:- bigmem- nvidia-gpu
When you run a workflow you specify the platform by name. Then the relevant sections of blueprints, variables, overrides and environments will become active.
Platforms are designed to assist in implementing generic components which are specialized for different purposes when specifying different platforms. This is particularly useful when working with packages that can utilize various kinds of HPC resources (e.g. a cluster fitted with LSF, a kubernetes installation, etc). For example, a component can be configured to utilize a certain amount of GPUs when it targets platform A but exclusively use CPUs on platform B.
In the sum-numbers example there exist 2 platforms: default
, and artifactory
. The default
platform leads to components executing as vanilla Operating System. Whereas, the artifactory
platform configures the workflow for execution on kubernetes.
default platform
The default
platform is special: The st4sd-runtime
fills in missing fields of the default
blueprint. See, this platform is intended to act as the base
layer for workflow environments, and component variables/options. When an option/variable/environment is defined within the default
platform it is automatically inherited by all other platforms (unless they explicitly override said option/variable/environment); read the FlowIR options/variable inheritance sequence section for more information on the options/variable layering aspect of ST4SD platforms.
In this example, the default
platform defines two variable (a global, and one that is only visible for components in stage 2), the special environment environment
, and a global blueprint which sets the default value of the command.environment
options for all components. See environments for more information about environments.
artifactory platform
The artifactory
platform overrides the default value (from 10
to -5
) for the stage 2 variable addToSum
, defines default options for all components which instruct the st4sd-runtime
to utilize the kubernetes
backend, and overrides the environment
environment. Moreover, it serves as an example on how to use the layering system of ST4SD to specialize the components which belong in a particular stage. Specifically, the artifactory
platform configures components belonging in stage 1 to use 150Mi
of memory instead of 100Mi
and 0.1
CPU-units instead of 0.25
.
FlowIR Scopes
The st4sd-runtime
supports nested scopes:
- global (i.e. visible to all components)
- visible to components within a specific stage
- visible to just one component
These scopes are layered in a specific order by the st4sd-runtime
.
FlowIR options/variable inheritance sequence
This is the full order of inheritance for component options.
- Builtin
st4sd-runtime
blueprint - Default
global
blueprint - Default
stage
blueprint - Platform
global
blueprint - Platform
stage
blueprint - Component definition
- Resolve interpreter option which may affect command.executable and command.arguments
Inheritance for variables
works in the same spirit (it’s effectively the same order of steps but without steps: 1
and 7
).
In the case of environments
, the st4sd-runtime
follows the rules below:
- If the environment is not set then the environment contains the default environment called “environment”. If the default environment is unset, then the default environment is the active shell environment.
- If the name is the literal string “none” then the environment contains {}
- Otherwise the
st4sd-runtime
uses the definition for the environment name from the selected platform. If there is no definition in the active platform thest4sd-runtime
falls back to thedefault
platform. - If an environment defines a
DEFAULTS
key then that key is expected to have the formatVAR1:VAR2:VAR3...
. Other options in the environment could reference the aforementioned vars using the$VAR
and${VAR}
notation and these options will be resolved using their matching keys in the default environment.- Any $VAR and ${VAR} references not matched by
DEFAULTS
keys will be resolved using the active shell(workflow launch environment). - If a variable is defined in
DEFAULTS
but there is no value for it in the default environment then treat it as if it was never in theDEFAULTS
option in the first place. This means that references to it will remain as is. The system that is executing the component’s task will resolve such environment variables just in time.
- Any $VAR and ${VAR} references not matched by
- The runtime injects a couple of variables to the environment (
INSTANCE_DIR
,FLOW_EXPERIMENT_NAME
, andFLOW_RUN_ID
).
Default options
The careful reader will notice that the default
platform does not contain an option for resourceManager.config.backend
. How does the st4sd-runtime
decide which backend to use?
Recall that the st4sd-runtime
injects default values for the default.global
blueprint which are then inherited by all components. The default value for resourceManager.config.backend
is local
which instructs the st4sd-runtime
to spawn component tasks as vanilla operating system processes. You can find a detailed list of the ST4SD default values in the ST4SD documentation.
Key-outputs
Key-Outputs are named DataReferences
for FlowIR virtual experiments and OutputReferences for DSL 2.0 virtual experiments. The key-outputs point to important paths which the virtual experiment produced.
Key-outputs for experiments written in DSL 2.0
Example that creates a key-output called OptimisationResults
which points to the file energies.csv
. This file is created by a component template instance called ExtractEnergies
which is a step of a workflow that the entry-instance of the experiment points to:
output:- name: OptimisationResultsdata-in: <entry-instance/ExtractEnergies>/energies.csv:refdescription: homo/lumo resultstype: csv
Above, output
is a list nested under the entrypoint
dictionary in DSL 2.0. The value of each entry is a dictionary with the following schema:
name: the unique name of the key outputdata-in: "an OutputReference to the output of an instance of a component template"# Optional fieldsdescription: "A human readable description of the file"type: "e.g. csv, pdf, etc - this only used to label key-output"
Key-outputs for experiments written in FlowIR
Example that creates a key-output called OptimisationResults
which points to the file energies.csv
. This file is created by a component called ExtractEnergies
which is in stage 1 of the experiment:
output:OptimisationResults:data-in: stage1.ExtractEnergies/energies.csv:refdescription: homo/lumo resultstype: csv
Above, output
is a top-level dictionary in FlowIR the keys of the output
dictionary are the names of the related key-outputs
. The value of each key is a dictionary with the following schema:
data-in: "a DataReference for FlowIR"# Optional fieldsdescription: "A human readable description of the file"type: "e.g. csv, pdf, etc - this only used to label key-output"stages:- stage0- stage1
Interface and Properties
The interface
of a virtual experiment (e.g. workflow) defines:
- The specification used to describe
input
systems it processes e.g. SMILEs for small molecules - Instructions to extract the
input
systems from input data - Instructions to extract the values of
properties
that the virtual experiment computes