info icon

Use this page to learn what FlowIR elements are there and how they work.

Here, we are introducing the old workflow syntax FlowIR, if you need to understand the new workflow specification syntax check out the DSL 2.0 docs.

Component
DataReference
Environments
Variables
Blueprints
Platforms
FlowIR Scopes
FlowIR options/variable inheritance sequence
Key-outputs
Interface and Properties
Application dependencies

Component

The component element describes a step in the workflow. The definition for a component is (some fields omitted):

stage: integer greater or equal to 0 (optional - defaults to 0)
name: the name of the component (must be unique in the same stage)
command:
  executable: str
  arguments: str
  environment: (null, str)
references:
  - <reference:str>
workflowAttributes:
Copy to clipboard

Override

Components have an override key that allows overriding their definition based on the active platform. The definition of the override field is:

override:
  <platform name:str>:
    <component-field> #Any top-level field (with the sub-keys to be overriden) except name, stage  command and reference
Copy to clipboard

The main reason to do this is change variables ,workflowAttributes resourceManager or resourceRequest, based on e.g. GPU or CPU deployment (see [platform][#platforms]).

Only the key/values specified are changed or added. Existing key/values that aren’t specified remain with their base values. For example:

...
resourceRequest:
  numberThreads: 16
  threadsPerCore: 1
  memory: 100 MBi
override:
   bigmem:
     resourceRequest:
       memory: 1GBi
Copy to clipboard

In this case on platform bigmem this component would still ask for 16 cores but with an increased memory request.

Description of basic FlowIR component fields

stage: integer greater or equal to 0 (optional - defaults to 0)
name: the name of the component (must be unique in the same stage)
command:
- executable: path to executable. It can be absolute, relative to the instance directory by prefixing the path with bin/ or <application>/. It can also be just the name of a binary. If the path is not absolute the st4sd-runtime will look for the executable in the folders specified under $PATH.
- arguments: arguments to binary
- environment: Name of the environment to use. The definition will be searched in the top level environments field of FlowIR.
- expandArguments: one of [“double-quote”, “none”] (default is “double-quote”). When set to “double-quote” ST4SD envelops commandline in double-quotes and perform bash expansion by feeding the resulting string to echo before using them to submit the task to the backend.
references: Each string in this list is a string representation of a DataReference to either a reference to a file, a folder, or a component. References to components and files produced by a component (i.e. under the working directory of a component) indicate a data dependency which the st4sd-runtime respects when scheduling the tasks for components. There are several reference methods, but these are the most commonly used:
- :output: the st4sd-runtime will replace <component name>:output references with the stdout output of the referenced component
- :ref: the st4sd-runtime will replace <component name or file>:ref references with the absolute path to the component or file on the filesystem.
- :copy: the st4sd-runtime will copy the file referenced by this DataReference into the working directory of the component which includes this reference. This DataReference method cannot be part of the command.arguments field.
workflowAttributes:
- replicate: If set to a positive number N the st4sd-runtime will replicate this component and its downstream tree N times (see aggregate below before you use this option).
- aggregate: If this option is set to True and the component belongs in the downstream sub-tree of a replicate component the st4sd-runtime will stop replicating just before the aggregate component. Each reference of the aggregate component to the replicate component will be expanded to N references (one for each upstream replicated component).
resourceRequest: provides hints to the backend about the resource requirements of this component tasks
- numberProcesses: integer, defaults to 1
- numberThreads: float, defaults to 1 (e.g. on kubernetes you can ask for half a thread)
- ranksPerNode: integer, defaults to 1
- threadsPerCore: integer, defaults to 1
- memory: In Bytes or as Mi/Gi (e.g. 128Mi, 16Gi)
- gpus: Only used by tasks that use the kubernetes or lsf backend
resourceManager:
- config:
  - backend: Which backend to use. Valid options are:
    - local (default option)
    - kubernetes
    - docker (also supports other docker-line runtimes like podman)
    - lsf
  - walltime: Maximum execution time of a single task for this component (in minutes). This option is only valid for kubernetes and lsf backends. The defaults is 60 (one hour).
- kubernetes: Options to use when the kubernetes backend is selected for this component
  - image: which image to use
  - gracePeriod: Kubernetes waits gracePeriod seconds between asking a container to terminate and forcing it to terminate. This applies to tasks that use the Kubernetes backend and their execution time exceeds resourceManager.config.walltime minutes.
  - qos: One of “guaranteed” (default), “burstable”, “besteffort”. See Kubernetes documentation for the definition of Quality Of Service (QoS) classes.
- docker: Options to use when the docker backend is selected for this component. Supports other docker-like runtimes via the elaunch.py parameter --dockerExecutableOverride
  - image: which image to use
  - imagePullPolicy: one of [“Always”, “Never”, “IfNotPresent”], default is “Always”
- lsf: Options to use when the lsf backend is selected for this component
  - queue: Name of queue to submit jobs to.
  - resourceString: A LSF request string e.g. "rusage[ngpus_physical=4.00] select[(v100&&infiniband)]"
variables: A key: value collection of variables; can either override those defined in platform or introduce new ones. In both cases the value specified here is visible to this component only. See FlowIR options/variable inheritance sequence for details on how scope layering/inheritance functions in FlowIR.

Defining components

Components are placed inside the components array:

components:
- stage: int
  name: str
  <component-core>
  override:
    <platform name:str>:
      <component-core>
Copy to clipboard

Components must have a unique (stage, name) tuple. Here’s an extract from the sum-numbers example:

components:
# ...
- stage: 1
  name: PartialSum
  command:
    executable: "bin/sum.py"
    arguments: "ExtractRow:output"
  references: ["ExtractRow:output"]
Copy to clipboard

DataReference

DataReference is the way to define references to data in FlowIR.

Components define their dependencies to other components in the graph and data external to the graph (e.g. input, data, and application-dependencies which are custom directories that experiments may bundle) using DataReferences. Key-outputs also use DataReferences.

A DataReference can have two forms: an absolute and a relative representation. The latter is syntax sugar for the former.

Absolute representation of DataReference

stage<Index>.<producerName>:/<fileRef>:<method>
Copy to clipboard

The DataReference points to either a component in the graph or a directory in the root of the instance directory.

stage<Index>.: is the stage of the producer. This is only valid for DataReferences that point to components. Index should be an integer greater than 0.
producerName: Either the name of a producer component, or the name of the directory in the root of the instance directory. The directories include all directories that ST4SD an experiment instance directory contains (e.g. input plus directories found in a standalone project such as data, conf, hooks, application-dependencies, etc).
</fileRef>: Optional path, relative to the root directory of the producer. When omitted defaults to /.
method: One of ref, output, copy, link. The method determines how ST4SD interprets the DataReference.
- ref: The DataReference expands to the absolute path of the referenced file/directory
- output: The DataReference expands to the contents of the referenced file. If the reference is to a component with the fileref ”/” then the DataReference is rewritten to point to the file containing the most recent stdout of the component.
- copy: The DataReference does not expand to anything. If a component definition contains such a DataReference in its references field, then the runtime will copy the referenced path inside the root directory of the component’s task right before the execution of the task.
- link: Similar to copy above. The difference is that instead of copying the referenced path, the runtime will create a link to the referenced path.

Relative representation of a DataReference

The relative representation of a DataReference is just syntax sugar for the absolute representation. The DataReference can omit the stage<Index> part.

If the relative DataReference is in the references field of a component then the Index is the same as the component.stage field.
If the relative DataReference is in a key-output definition then the key-output should also contain the stages field. See key-output documentation for more details.

Environments

The environment that components run in is defined within the environments section of the FlowIR YAML. If you don’t define anything in this section ST4SD will create a default environment containing all the environment variables of the runtime system process.

Example:

environments:
  <platform-name>:
    myDefinedEnvironment:
      ENV-VAR1: value/for/env-var1
      ENV-VAR2: value/for/env-var2
      DEFAULTS: ENV-VAR3:ENV-VAR4
Copy to clipboard

The above defines an environment with 4 environment variables:

ENV-VAR1 whose value is value/for/env-var1
ENV-VAR2 whose value is value/for/env-var2
ENV-VAR3 whose value is inherited from the environment variable ENV-VAR3 of the process running the runtime system
ENV-VAR4 whose value is inherited from the environment variable ENV-VAR4 of the process running the runtime system

In the above example, we use the DEFAULTS directive to inherit the values for a list of environment variables from the environment variables of the runtime system process. The value of the special “DEFAULTS” key is a list of environment variable name separated with ”:“.

A component uses a defined environment by setting command.environment to the environment name. For example:

components:
  - name: myComponent
    command:
      executable: app.exe
      environment: myDefinedEnvironment
Copy to clipboard

You can set command.environment to "none" to instruct ST4SD to only inject a couple of auto-generated environment variables. Note, backends that ST4SD uses e.g. k8s, docker, lsf, may add env-vars afterwards.

If command.environment is not explicitly set, the st4sd-runtime will default to using a built-in, environment called environment. This contains the environment from which elaunch.py was run.

You can override the definition of environment if you wish, for example:

environments:
  default:
    environment:
      ENV-VAR1: sensible/default/for/env-var1
      ENV-VAR2: sensible/default/for/env-var2
      ENV-VAR3: sensible/default/for/env-var3
Copy to clipboard

The runtime always injects a couple of variables to the environments of components (INSTANCE_DIR, FLOW_EXPERIMENT_NAME, and FLOW_RUN_ID).

For more information, see our environment resolution rules.

Variables

The variables field follows the format below:

variables:
  <platform name:str>:
    Optional(global):
      <variable name:str>: <value: str, int, bool, float>
    Optional(stages):
      <stage index: int>:
        <variable name:str>: <value: str, int, bool, float>
Copy to clipboard

Variables are grouped under a platform, and can either be global or stage-specific. This example uses the following variables definition:

variables:
  default:
    global:
      numberOfPoints: 3
    stages:
      2:
        addToSum: 10

  artifactory:
Copy to clipboard

Using Variables

You refer to variables in FlowIR with the syntax %($VARIABLE_NAME)s.

FlowIR supports using variables to define:

values of fields
values of other variables

For example:

variables:
  default:
    global:
      salutation: "hello"
      subject: "world"
      message: "%(salutation)s %(subject)s"

components:
- name: hello-message
Copy to clipboard

Here we use the value of the message variable in the arguments of the hello-message component. The value assigned to the message variable itself uses two other variables, salutation and subject.

The first character of the value for a YAML field cannot be % so remember to enclose fields that contain variables references in quotes.

Variables can contain space separated arrays

You can also treat a variable as an array of space separated items. Here’s you can reference the <index>-th entry of a <variable>:

%(<variable>)s[<Index>]

The 1st entry is at index 0.

Examples:

%(names)s[0]: This resolves to the 1st entry in the names array.
%(names)s[%(index)s]: Indices may be variables too!

variables:
  default:
    global:
      # All variables are strings in FlowIR
      names: Ann Bob
      # Even those that look like a number
      population: 2

components:
Copy to clipboard

Variables are all strings in FlowIR. If ST4SD expects a field to have a certain type then it will coerce the value that variable references resolve to into the appropriate type. In the example above workflowAttributes.replicate expects an integer value. ST4SD will convert the value of the variable population population to the integer value 2.

Blueprints

ST4SD supports defining default options for (a) all components and/or (b) for components that belong in a specific stage, via the blueprint top-level field:

blueprint:
  <platform name:str>:
    Optional(global):
      <component options>
    Optional(stages):
      <stage index:int>:
        <component options>
Copy to clipboard

This example defines the blueprint for 2 platforms. It specifies the default options when using the 2 platforms (setting values for resourceManager, resourceRequest for all components when artifactory is the chosen platform) and specializes components in stage 1 when using the artifactory platform (increase their memory request)

blueprint:
  default:
    global:
      command:
        environment: environment
  artifactory:
    global:
      resourceRequest:
        memory: 100Mi
Copy to clipboard

Platforms

A platform is a named collection of blueprints, variables, overrides and environments.

You define the named platforms using the top-level platform array

platforms:
   - bigmem
   - nvidia-gpu
Copy to clipboard

When you run a workflow you specify the platform by name. Then the relevant sections of blueprints, variables, overrides and environments will become active.

Platforms are designed to assist in implementing generic components which are specialized for different purposes when specifying different platforms. This is particularly useful when working with packages that can utilize various kinds of HPC resources (e.g. a cluster fitted with LSF, a kubernetes installation, etc). For example, a component can be configured to utilize a certain amount of GPUs when it targets platform A but exclusively use CPUs on platform B.

In the sum-numbers example there exist 2 platforms: default, and artifactory. The default platform leads to components executing as vanilla Operating System. Whereas, the artifactory platform configures the workflow for execution on kubernetes.

default platform

The default platform is special: The st4sd-runtime fills in missing fields of the default blueprint. See, this platform is intended to act as the base layer for workflow environments, and component variables/options. When an option/variable/environment is defined within the default platform it is automatically inherited by all other platforms (unless they explicitly override said option/variable/environment); read the FlowIR options/variable inheritance sequence section for more information on the options/variable layering aspect of ST4SD platforms.

In this example, the default platform defines two variable (a global, and one that is only visible for components in stage 2), the special environment environment, and a global blueprint which sets the default value of the command.environment options for all components. See environments for more information about environments.

artifactory platform

The artifactory platform overrides the default value (from 10 to -5) for the stage 2 variable addToSum, defines default options for all components which instruct the st4sd-runtime to utilize the kubernetes backend, and overrides the environment environment. Moreover, it serves as an example on how to use the layering system of ST4SD to specialize the components which belong in a particular stage. Specifically, the artifactory platform configures components belonging in stage 1 to use 150Mi of memory instead of 100Mi and 0.1 CPU-units instead of 0.25.

FlowIR Scopes

The st4sd-runtime supports nested scopes:

global (i.e. visible to all components)
visible to components within a specific stage
visible to just one component

These scopes are layered in a specific order by the st4sd-runtime.

FlowIR options/variable inheritance sequence

This is the full order of inheritance for component options.

Builtin st4sd-runtime blueprint
Default global blueprint
Default stage blueprint
Platform global blueprint
Platform stage blueprint
Component definition
Resolve interpreter option which may affect command.executable and command.arguments

Inheritance for variables works in the same spirit (it’s effectively the same order of steps but without steps: 1 and 7).

In the case of environments, the st4sd-runtime follows the rules below:

If the environment is not set then the environment contains the default environment called “environment”. If the default environment is unset, then the default environment is the active shell environment.
If the name is the literal string “none” then the environment contains {}
Otherwise the st4sd-runtime uses the definition for the environment name from the selected platform. If there is no definition in the active platform the st4sd-runtime falls back to the default platform.
If an environment defines a DEFAULTS key then that key is expected to have the format VAR1:VAR2:VAR3.... Other options in the environment could reference the aforementioned vars using the $VAR and ${VAR} notation and these options will be resolved using their matching keys in the default environment.
1. Any $VAR and ${VAR} references not matched by DEFAULTS keys will be resolved using the active shell(workflow launch environment).
2. If a variable is defined in DEFAULTS but there is no value for it in the default environment then treat it as if it was never in the DEFAULTS option in the first place. This means that references to it will remain as is. The system that is executing the component’s task will resolve such environment variables just in time.
The runtime injects a couple of variables to the environment (INSTANCE_DIR, FLOW_EXPERIMENT_NAME, and FLOW_RUN_ID).

Tip: You can use the ccommand.py utility to get a list of all FlowIR details for a particular component of a workflow package using the -f (i.e. --flowirReport) option. Use --env to also view the contents of the component environment. Try targeting different platforms via the -p argument.

Default options

The careful reader will notice that the default platform does not contain an option for resourceManager.config.backend. How does the st4sd-runtime decide which backend to use?

Recall that the st4sd-runtime injects default values for the default.global blueprint which are then inherited by all components. The default value for resourceManager.config.backend is local which instructs the st4sd-runtime to spawn component tasks as vanilla operating system processes. You can find a detailed list of the ST4SD default values in the ST4SD documentation.

Key-outputs

Key-Outputs are named DataReferences for FlowIR virtual experiments and OutputReferences for DSL 2.0 virtual experiments. The key-outputs point to important paths which the virtual experiment produced.

Key-outputs for experiments written in DSL 2.0

Example that creates a key-output called OptimisationResults which points to the file energies.csv. This file is created by a component template instance called ExtractEnergies which is a step of a workflow that the entry-instance of the experiment points to:

output:
  - name: OptimisationResults
    data-in: <entry-instance/ExtractEnergies>/energies.csv:ref
    description: homo/lumo results
    type: csv
Copy to clipboard

Above, output is a list nested under the entrypoint dictionary in DSL 2.0. The value of each entry is a dictionary with the following schema:

name: the unique name of the key output
data-in: "an OutputReference to the output of an instance of a component template"

# Optional fields
description: "A human readable description of the file"
type: "e.g. csv, pdf, etc - this only used to label key-output"
Copy to clipboard

Key-outputs for experiments written in FlowIR

Example that creates a key-output called OptimisationResults which points to the file energies.csv. This file is created by a component called ExtractEnergies which is in stage 1 of the experiment:

output:
  OptimisationResults:
    data-in: stage1.ExtractEnergies/energies.csv:ref
    description: homo/lumo results
    type: csv
Copy to clipboard

Above, output is a top-level dictionary in FlowIR the keys of the output dictionary are the names of the related key-outputs. The value of each key is a dictionary with the following schema:

data-in: "a DataReference for FlowIR"

# Optional fields
description: "A human readable description of the file"
type: "e.g. csv, pdf, etc - this only used to label key-output"
stages:
- stage0
- stage1
Copy to clipboard

The DataReference in "data-in" can use one of the following reference methods: :ref, :copy, or :output. Here, :copy is just an alias for :ref (i.e paths are not copied). Finally, :output is an alias to out.stdout:ref if the DataReference does not have a /fileRef. Otherwise, :output becomes an alias to /fileRef:ref.

If "data-in" does not contain a "stage$index." prefix then you can list stages to use as the "stage" prefix. If you provide multiple stages then later stages will override the path.

Interface and Properties

The interface of a virtual experiment (e.g. workflow) defines:

The specification used to describe input systems it processes e.g. SMILEs for small molecules
Instructions to extract the input systems from input data
Instructions to extract the values of properties that the virtual experiment computes

You can find more information about writing an interface here and a tutorial on how to use an interface here

Application dependencies

Application dependencies are directories that appear in the root directory of your virtual experiment instance. The data source for these dependencies is specified at the point of launching your virtual experiment using the --applicationDependencySource=$appDepName:/path/to/source command-line argument of elaunch.py.

You can use an application dependency in your workflows in the same way that you use data and input files, by utilizing a DataReference.

To define application dependencies in your virtual experiment, use the top-level field application-dependencies in your configuration file. The following example illustrates how to define application dependencies for different platforms:

application-dependencies:
  default:
    - foo    # an application dependency called foo
  custom-platform:
    - bar
Copy to clipboard

In this example, when you execute the experiment using the default platform, a directory called foo will be created. If you switch to the custom-platform, a directory for a different application dependency called bar will be created instead. Note that platforms that do not override their application-dependencies will inherit them from the default platform.

The application-dependencies field is used to specify the application dependencies for each platform, allowing you to customize the dependencies based on the platform being used.

Edit this page on GitHub

ST4SD Core: Best Practices

ST4SD Cloud: Getting Started

FlowIR Specification