FlowIR Specification
Use this page to learn what FlowIR elements are there and how they work.
- Component
- DataReference
- Environments
- Variables
- Blueprints
- Platforms
- FlowIR Scopes
- FlowIR options/variable inheritance sequence
- Key-outputs
- Interface and Properties
- Application dependencies
Component
The component element describes a step in the workflow. The definition for a component is (some fields omitted):
stage: integer greater or equal to 0 (optional - defaults to 0)name: the name of the component (must be unique in the same stage)command:executable: strarguments: strenvironment: (null, str)references:- <reference:str>workflowAttributes:
Override
Components have an override key that allows overriding their definition based on the active platform. The definition of the override field is:
override:<platform name:str>:<component-field> #Any top-level field (with the sub-keys to be overriden) except name, stage command and reference
The main reason to do this is change variables ,workflowAttributes resourceManager or resourceRequest, based on e.g. GPU or CPU deployment (see [platform][#platforms]).
Only the key/values specified are changed or added. Existing key/values that aren’t specified remain with their base values. For example:
...resourceRequest:numberThreads: 16threadsPerCore: 1memory: 100 MBioverride:bigmem:resourceRequest:memory: 1GBi
In this case on platform bigmem this component would still ask for 16 cores but with an increased memory request.
Description of basic FlowIR component fields
- stage: integer greater or equal to 0 (optional - defaults to 0)
- name: the name of the component (must be unique in the same stage)
- command:
- executable: path to executable. It can be absolute, relative to the instance directory by prefixing the path with
bin/or<application>/. It can also be just the name of a binary. If the path is not absolute thest4sd-runtimewill look for the executable in the folders specified under$PATH. - arguments: arguments to binary
- environment: Name of the
environmentto use. The definition will be searched in the top levelenvironmentsfield of FlowIR. - expandArguments: one of [“double-quote”, “none”] (default is “double-quote”). When set to “double-quote” ST4SD envelops commandline in double-quotes and perform bash expansion by feeding the resulting string to
echobefore using them to submit the task to the backend.
- executable: path to executable. It can be absolute, relative to the instance directory by prefixing the path with
- references: Each string in this list is a string representation of a DataReference to either a reference to a file, a folder, or a component. References to components and files produced by a component (i.e. under the working directory of a component) indicate a data dependency which the
st4sd-runtimerespects when scheduling the tasks for components. There are severalreference methods, but these are the most commonly used::output: thest4sd-runtimewill replace<component name>:outputreferences with thestdoutoutput of the referenced component:ref: thest4sd-runtimewill replace<component name or file>:refreferences with the absolute path to the component or file on the filesystem.:copy: thest4sd-runtimewill copy the file referenced by this DataReference into the working directory of the component which includes this reference. This DataReference method cannot be part of thecommand.argumentsfield.
- workflowAttributes:
- replicate: If set to a positive number
Nthest4sd-runtimewill replicate this component and its downstream treeNtimes (seeaggregatebelow before you use this option). - aggregate: If this option is set to
Trueand the component belongs in the downstream sub-tree of areplicatecomponent thest4sd-runtimewill stop replicating just before theaggregatecomponent. Each reference of theaggregatecomponent to thereplicatecomponent will be expanded toNreferences (one for each upstream replicated component).
- replicate: If set to a positive number
- resourceRequest: provides hints to the backend about the resource requirements of this component tasks
- numberProcesses: integer, defaults to 1
- numberThreads: float, defaults to 1 (e.g. on kubernetes you can ask for half a thread)
- ranksPerNode: integer, defaults to 1
- threadsPerCore: integer, defaults to 1
- memory: In Bytes or as Mi/Gi (e.g. 128Mi, 16Gi)
- gpus: Only used by tasks that use the kubernetes or lsf backend
- resourceManager:
- config:
- backend: Which backend to use. Valid options are:
- local (default option)
- kubernetes
- docker (also supports other docker-line runtimes like podman)
- lsf
- walltime: Maximum execution time of a single task for this component (in minutes). This option is only valid for
kubernetesandlsfbackends. The defaults is60(one hour).
- backend: Which backend to use. Valid options are:
- kubernetes: Options to use when the
kubernetesbackend is selected for this component- image: which image to use
- gracePeriod: Kubernetes waits
gracePeriodseconds between asking a container to terminate and forcing it to terminate. This applies to tasks that use the Kubernetes backend and their execution time exceedsresourceManager.config.walltimeminutes. - qos: One of “guaranteed” (default), “burstable”, “besteffort”. See Kubernetes documentation for the definition of Quality Of Service (QoS) classes.
- docker: Options to use when the
dockerbackend is selected for this component. Supports otherdocker-like runtimes via the elaunch.py parameter--dockerExecutableOverride- image: which image to use
- imagePullPolicy: one of [“Always”, “Never”, “IfNotPresent”], default is “Always”
- lsf: Options to use when the
lsfbackend is selected for this component- queue: Name of queue to submit jobs to.
- resourceString: A LSF request string e.g.
"rusage[ngpus_physical=4.00] select[(v100&&infiniband)]"
- config:
- variables: A
key: valuecollection of variables; can either override those defined in platform or introduce new ones. In both cases the value specified here is visible to this component only. See FlowIR options/variable inheritance sequence for details on how scope layering/inheritance functions in FlowIR.
Defining components
Components are placed inside the components array:
components:- stage: intname: str<component-core>override:<platform name:str>:<component-core>
Components must have a unique (stage, name) tuple. Here’s an extract from the sum-numbers example:
components:# ...- stage: 1name: PartialSumcommand:executable: "bin/sum.py"arguments: "ExtractRow:output"references: ["ExtractRow:output"]
DataReference
DataReference is the way to define references to data in FlowIR.
Components define their dependencies to other components in the graph and data external to the graph (e.g. input, data, and application-dependencies which are custom directories that experiments may bundle) using DataReferences.
Key-outputs also use DataReferences.
A DataReference can have two forms: an absolute and a relative representation. The latter is syntax sugar for the former.
Absolute representation of DataReference
stage<Index>.<producerName>:/<fileRef>:<method>
The DataReference points to either a component in the graph or a directory in the root of the instance directory.
stage<Index>.: is the stage of the producer. This is only valid forDataReferences that point to components.Indexshould be an integer greater than 0.producerName: Either the name of a producercomponent, or the name of thedirectoryin the root of the instance directory. The directories include all directories thatST4SDan experiment instance directory contains (e.g.inputplus directories found in a standalone project such asdata,conf,hooks,application-dependencies, etc).</fileRef>: Optional path, relative to the root directory of theproducer. When omitted defaults to/.method: One ofref,output,copy,link. Themethoddetermines how ST4SD interprets theDataReference.ref: TheDataReferenceexpands to the absolute path of the referenced file/directoryoutput: TheDataReferenceexpands to thecontentsof the referenced file. If the reference is to a component with thefileref”/” then theDataReferenceis rewritten to point to the file containing the most recentstdoutof the component.copy: TheDataReferencedoes not expand to anything. If acomponentdefinition contains such aDataReferencein itsreferencesfield, then the runtime will copy the referenced path inside the root directory of the component’s task right before the execution of the task.link: Similar tocopyabove. The difference is that instead of copying the referenced path, the runtime will create a link to the referenced path.
Relative representation of a DataReference
The relative representation of a DataReference is just syntax sugar for the absolute representation. The DataReference can omit the stage<Index> part.
- If the relative
DataReferenceis in thereferencesfield of acomponentthen theIndexis the same as thecomponent.stagefield. - If the relative
DataReferenceis in akey-outputdefinition then thekey-outputshould also contain thestagesfield. Seekey-outputdocumentation for more details.
Environments
The environment that components run in is defined within the environments section of the FlowIR YAML. If you don’t define anything in this section ST4SD will create a default environment containing all the environment variables of the runtime system process.
Example:
environments:<platform-name>:myDefinedEnvironment:ENV-VAR1: value/for/env-var1ENV-VAR2: value/for/env-var2DEFAULTS: ENV-VAR3:ENV-VAR4
The above defines an environment with 4 environment variables:
ENV-VAR1whose value isvalue/for/env-var1ENV-VAR2whose value isvalue/for/env-var2ENV-VAR3whose value is inherited from the environment variableENV-VAR3of the process running the runtime systemENV-VAR4whose value is inherited from the environment variableENV-VAR4of the process running the runtime system
In the above example, we use the DEFAULTS directive to inherit the values for a list of environment variables from the environment variables of the runtime system process. The value of the special “DEFAULTS” key is a list of environment variable name separated with ”:“.
A component uses a defined environment by setting command.environment to the environment name. For example:
components:- name: myComponentcommand:executable: app.exeenvironment: myDefinedEnvironment
You can set command.environment to "none" to instruct ST4SD to only inject a couple of auto-generated environment variables. Note, backends that ST4SD uses e.g. k8s, docker, lsf, may add env-vars afterwards.
If command.environment is not explicitly set, the st4sd-runtime will default to using a built-in, environment called environment. This contains the environment from which elaunch.py was run.
You can override the definition of environment if you wish, for example:
environments:default:environment:ENV-VAR1: sensible/default/for/env-var1ENV-VAR2: sensible/default/for/env-var2ENV-VAR3: sensible/default/for/env-var3
The runtime always injects a couple of variables to the environments of components (INSTANCE_DIR, FLOW_EXPERIMENT_NAME, and FLOW_RUN_ID).
For more information, see our environment resolution rules.
Variables
The variables field follows the format below:
variables:<platform name:str>:Optional(global):<variable name:str>: <value: str, int, bool, float>Optional(stages):<stage index: int>:<variable name:str>: <value: str, int, bool, float>
Variables are grouped under a platform, and can either be global or stage-specific. This example uses the following variables definition:
variables:default:global:numberOfPoints: 3stages:2:addToSum: 10artifactory:
Using Variables
You refer to variables in FlowIR with the syntax %($VARIABLE_NAME)s.
FlowIR supports using variables to define:
- values of fields
- values of other variables
For example:
variables:default:global:salutation: "hello"subject: "world"message: "%(salutation)s %(subject)s"components:- name: hello-message
Here we use the value of the message variable in the arguments of the hello-message component.
The value assigned to the message variable itself uses two other variables, salutation and subject.
Variables can contain space separated arrays
You can also treat a variable as an array of space separated items.
Here’s you can reference the <index>-th entry of a <variable>:
%(<variable>)s[<Index>]
Examples:
%(names)s[0]: This resolves to the 1st entry in thenamesarray.%(names)s[%(index)s]: Indices may be variables too!
variables:default:global:# All variables are strings in FlowIRnames: Ann Bob# Even those that look like a numberpopulation: 2components:
Blueprints
ST4SD supports defining default options for (a) all components and/or (b) for components that belong in a specific stage, via the blueprint top-level field:
blueprint:<platform name:str>:Optional(global):<component options>Optional(stages):<stage index:int>:<component options>
This example defines the blueprint for 2 platforms. It specifies the default options when using the 2 platforms (setting values for resourceManager, resourceRequest for all components when artifactory is the chosen platform) and specializes components in stage 1 when using the artifactory platform (increase their memory request)
blueprint:default:global:command:environment: environmentartifactory:global:resourceRequest:memory: 100Mi
Platforms
A platform is a named collection of blueprints, variables, overrides and environments.
You define the named platforms using the top-level platform array
platforms:- bigmem- nvidia-gpu
When you run a workflow you specify the platform by name. Then the relevant sections of blueprints, variables, overrides and environments will become active.
Platforms are designed to assist in implementing generic components which are specialized for different purposes when specifying different platforms. This is particularly useful when working with packages that can utilize various kinds of HPC resources (e.g. a cluster fitted with LSF, a kubernetes installation, etc). For example, a component can be configured to utilize a certain amount of GPUs when it targets platform A but exclusively use CPUs on platform B.
In the sum-numbers example there exist 2 platforms: default, and artifactory. The default platform leads to components executing as vanilla Operating System. Whereas, the artifactory platform configures the workflow for execution on kubernetes.
default platform
The default platform is special: The st4sd-runtime fills in missing fields of the default blueprint. See, this platform is intended to act as the base layer for workflow environments, and component variables/options. When an option/variable/environment is defined within the default platform it is automatically inherited by all other platforms (unless they explicitly override said option/variable/environment); read the FlowIR options/variable inheritance sequence section for more information on the options/variable layering aspect of ST4SD platforms.
In this example, the default platform defines two variable (a global, and one that is only visible for components in stage 2), the special environment environment, and a global blueprint which sets the default value of the command.environment options for all components. See environments for more information about environments.
artifactory platform
The artifactory platform overrides the default value (from 10 to -5) for the stage 2 variable addToSum, defines default options for all components which instruct the st4sd-runtime to utilize the kubernetes backend, and overrides the environment environment. Moreover, it serves as an example on how to use the layering system of ST4SD to specialize the components which belong in a particular stage. Specifically, the artifactory platform configures components belonging in stage 1 to use 150Mi of memory instead of 100Mi and 0.1 CPU-units instead of 0.25.
FlowIR Scopes
The st4sd-runtime supports nested scopes:
- global (i.e. visible to all components)
- visible to components within a specific stage
- visible to just one component
These scopes are layered in a specific order by the st4sd-runtime.
FlowIR options/variable inheritance sequence
This is the full order of inheritance for component options.
- Builtin
st4sd-runtimeblueprint - Default
globalblueprint - Default
stageblueprint - Platform
globalblueprint - Platform
stageblueprint - Component definition
- Resolve interpreter option which may affect command.executable and command.arguments
Inheritance for variables works in the same spirit (it’s effectively the same order of steps but without steps: 1 and 7).
In the case of environments, the st4sd-runtime follows the rules below:
- If the environment is not set then the environment contains the default environment called “environment”. If the default environment is unset, then the default environment is the active shell environment.
- If the name is the literal string “none” then the environment contains {}
- Otherwise the
st4sd-runtimeuses the definition for the environment name from the selected platform. If there is no definition in the active platform thest4sd-runtimefalls back to thedefaultplatform. - If an environment defines a
DEFAULTSkey then that key is expected to have the formatVAR1:VAR2:VAR3.... Other options in the environment could reference the aforementioned vars using the$VARand${VAR}notation and these options will be resolved using their matching keys in the default environment.- Any $VAR and ${VAR} references not matched by
DEFAULTSkeys will be resolved using the active shell(workflow launch environment). - If a variable is defined in
DEFAULTSbut there is no value for it in the default environment then treat it as if it was never in theDEFAULTSoption in the first place. This means that references to it will remain as is. The system that is executing the component’s task will resolve such environment variables just in time.
- Any $VAR and ${VAR} references not matched by
- The runtime injects a couple of variables to the environment (
INSTANCE_DIR,FLOW_EXPERIMENT_NAME, andFLOW_RUN_ID).
Default options
The careful reader will notice that the default platform does not contain an option for resourceManager.config.backend. How does the st4sd-runtime decide which backend to use?
Recall that the st4sd-runtime injects default values for the default.global blueprint which are then inherited by all components. The default value for resourceManager.config.backend is local which instructs the st4sd-runtime to spawn component tasks as vanilla operating system processes. You can find a detailed list of the ST4SD default values in the ST4SD documentation.
Key-outputs
Key-Outputs are named DataReferences for FlowIR virtual experiments and OutputReferences for DSL 2.0 virtual experiments. The key-outputs point to important paths which the virtual experiment produced.
Key-outputs for experiments written in DSL 2.0
Example that creates a key-output called OptimisationResults which points to the file energies.csv. This file is created by a component template instance called ExtractEnergies which is a step of a workflow that the entry-instance of the experiment points to:
output:- name: OptimisationResultsdata-in: <entry-instance/ExtractEnergies>/energies.csv:refdescription: homo/lumo resultstype: csv
Above, output is a list nested under the entrypoint dictionary in DSL 2.0. The value of each entry is a dictionary with the following schema:
name: the unique name of the key outputdata-in: "an OutputReference to the output of an instance of a component template"# Optional fieldsdescription: "A human readable description of the file"type: "e.g. csv, pdf, etc - this only used to label key-output"
Key-outputs for experiments written in FlowIR
Example that creates a key-output called OptimisationResults which points to the file energies.csv. This file is created by a component called ExtractEnergies which is in stage 1 of the experiment:
output:OptimisationResults:data-in: stage1.ExtractEnergies/energies.csv:refdescription: homo/lumo resultstype: csv
Above, output is a top-level dictionary in FlowIR the keys of the output dictionary are the names of the related key-outputs. The value of each key is a dictionary with the following schema:
data-in: "a DataReference for FlowIR"# Optional fieldsdescription: "A human readable description of the file"type: "e.g. csv, pdf, etc - this only used to label key-output"stages:- stage0- stage1
Interface and Properties
The interface of a virtual experiment (e.g. workflow) defines:
- The specification used to describe
inputsystems it processes e.g. SMILEs for small molecules - Instructions to extract the
inputsystems from input data - Instructions to extract the values of
propertiesthat the virtual experiment computes
You can find more information about writing an interface here and a tutorial on how to use an interface here
Application dependencies
Application dependencies are directories that appear in the root directory of your virtual experiment instance. The data source for these dependencies is specified at the point of launching your virtual experiment using the --applicationDependencySource=$appDepName:/path/to/source command-line argument of elaunch.py.
You can use an application dependency in your workflows in the same way that you use data and input files, by utilizing a DataReference.
To define application dependencies in your virtual experiment, use the top-level field application-dependencies in your configuration file. The following example illustrates how to define application dependencies for different platforms:
application-dependencies:default:- foo # an application dependency called foocustom-platform:- bar
In this example, when you execute the experiment using the default platform, a directory called foo will be created. If you switch to the custom-platform, a directory for a different application dependency called bar will be created instead. Note that platforms that do not override their application-dependencies will inherit them from the default platform.