Skip to main contentIBM ST4SD

Workflow Specification

Use this page to learn what FlowIR elements are there and how they work.

Component

The component element describes a step in the workflow. The definition for a component is (some fields omitted):

stage: integer greater or equal to 0 (optional - defaults to 0)
name: the name of the component (must be unique in the same stage)
command:
executable: str
arguments: str
environment: (null, str)
references:
- <reference:str>
workflowAttributes:

Override

Components have an override key that allows overriding their definition based on the active platform. The definition of the override field is:

override:
<platform name:str>:
<component-field> #Any top-level field (with the sub-keys to be overriden) except name, stage command and reference

The main reason to do this is change variables ,workflowAttributes resourceManager or resourceRequest, based on e.g. GPU or CPU deployment (see [platform][#platforms]).

Only the key/values specified are changed or added. Existing key/values that aren’t specified remain with their base values. For example:

...
resourceRequest:
numberThreads: 16
threadsPerCore: 1
memory: 100 MBi
override:
bigmem:
resourceRequest:
memory: 1GBi

In this case on platform bigmem this component would still ask for 16 cores but with an increased memory request.

Description of basic FlowIR component fields

  • stage: integer greater or equal to 0 (optional - defaults to 0)
  • name: the name of the component (must be unique in the same stage)
  • command:
    • executable: path to executable. It can be absolute, relative to the instance directory by prefixing the path with bin/ or <application>/. It can also be just the name of a binary. If the path is not absolute the st4sd-runtime will look for the executable in the folders specified under $PATH.
    • arguments: arguments to binary
    • environment: Name of the environment to use. The definition will be searched in the top level environments field of FlowIR.
    • expandArguments: one of [“double-quote”, “none”] (default is “double-quote”). When set to “double-quote” ST4SD envelops commandline in double-quotes and perform bash expansion by feeding the resulting string to echo before using them to submit the task to the backned.
  • references: Each string in this list is a string representation of a DataReference to either a reference to a file, a folder, or a component. References to components and files produced by a component (i.e. under the working directory of a component) indicate a data dependency which the st4sd-runtime respects when scheduling the tasks for components. There are several reference methods, but these are the most commonly used:
    • :output: the st4sd-runtime will replace <component name>:output references with the stdout output of the referenced component
    • :ref: the st4sd-runtime will replace <component name or file>:ref references with the absolute path to the component or file on the filesystem.
    • :copy: the st4sd-runtime will copy the file referenced by this DataReference into the working directory of the component which includes this reference. This DataReference method cannot be part of the command.arguments field.
  • workflowAttributes:
    • replicate: If set to a positive number N the st4sd-runtime will replicate this component and its downstream tree N times (see aggregate below before you use this option).
    • aggregate: If this option is set to True and the component belongs in the downstream sub-tree of a replicate component the st4sd-runtime will stop replicating just before the aggregate component. Each reference of the aggregate component to the replicate component will be expanded to N references (one for each upstream replicated component).
  • resourceRequest: provides hints to the backend about the resource requirements of this component tasks
    • numberProcesses
    • numberThreads
    • ranksPerNode
    • threadsPerCore
    • memory: In Bytes or as Mi/Gi (e.g. 128Mi, 16Gi)
    • gpus: Only used by tasks that execute on the Kubernetes or LSF backend
  • resourceManager:
    • config:
      • backend: Which backend to use. Valid options are:
        • local (default option)
        • kubernetes
        • docker (also supports other docker-line runtimes like podman)
        • lsf
      • walltime: Maximum execution time of a single task for this component (in minutes). This option is only valid for kubernetes and lsf backends. The defaults is 60 (one hour).
    • kubernetes: Options to use when the kubernetes backend is selected for this component
      • image: which image to use
      • gracePeriod: Kubernetes waits gracePeriod seconds between asking a container to terminate and forcing it to terminate. This applies to tasks that use the Kubernetes backend and their execution time exceeds resourceManager.config.walltime minutes.
      • qos: One of “guaranteed” (default), “burstable”, “besteffort”. See Kubernetes documentation for the definition of Quality Of Service (QoS) classes.
    • docker: Options to use when the docker backend is selected for this component. Supports other docker-like runtimes via the elaunch.py parameter --dockerExecutableOverride
      • image: which image to use
      • imagePullPolicy: one of [“Always”, “Never”, “IfNotPresent”], default is “Always”
    • lsf: Options to use when the lsf backend is selected for this component
      • queue: Name of queue to submit jobs to.
      • resourceString: A LSF request string e.g. "rusage[ngpus_physical=4.00] select[(v100&&infiniband)]"
  • variables: A key: value collection of variables; can either override those defined in platform or introduce new ones. In both cases the value specified here is visible to this component only. See FlowIR options/variable inheritance sequence for details on how scope layering/inheritance functions in FlowIR.

Defining components

Components are placed inside the components array:

components:
- stage: int
name: str
<component-core>
override:
<platform name:str>:
<component-core>

Components must have a unique (stage, name) tuple. Here’s an extract from the sum-numbers example:

components:
# ...
- stage: 1
name: PartialSum
command:
executable: "bin/sum.py"
arguments: "ExtractRow:output"
references: ["ExtractRow:output"]

DataReference

DataReference is the way to define references to data in FlowIR.

Components define their dependencies to other components in the graph and data external to the graph (e.g. input, data, and application-dependencies) using DataReferences. Key-outputs also use DataReferences.

A DataReference can have two forms: an absolute and a relative representation. The latter is syntax sugar for the former.

Absolute representation of DataReference

stage<Index>.<producerName>:/<fileRef>:<method>

The DataReference points to either a component in the graph or a directory in the root of the instance directory.

  • stage<Index>.: is the stage of the producer. This is only valid for DataReferences that point to components. Index should be an integer greater than 0.
  • producerName: Either the name of a producer component, or the name of the directory in the root of the instance directory. The directories include all application-dependencies and directories that ST4SD manages (e.g. input, data, conf, hooks, etc).
  • </fileRef>: Optional path, relative to the root directory of the producer. When omitted defaults to /.
  • method: One of ref, output, copy, link. The method determines how ST4SD interprets the DataReference.
    • ref: The DataReference expands to the absolute path of the referenced file/directory
    • output: The DataReference expands to the contents of the referenced file. If the reference is to a component with the fileref ”/” then the DataReference is rewritten to point to the file containing the most recent stdout of the component.
    • copy: The DataReference does not expand to anything. If a component definition contains such a DataReference in its references field, then the runtime will copy the referenced path inside the root directory of the component’s task right before the execution of the task.
    • link: Similar to copy above. The difference is that instead of copying the referenced path, the runtime will create a link to the referenced path.

Relative representation of a DataReference

The relative representation of a DataReference is just syntax sugar for the absolute representation. The DataReference can omit the stage<Index> part.

  • If the relative DataReference is in the references field of a component then the Index is the same as the component.stage field.
  • If the relative DataReference is in a key-output definition then the key-output should also contain the stages field. See key-output documentation for more details.

Environments

The environment that components run in is defined within the environments section of the FlowIR YAML. If you don’t define anything in this section ST4SD will create a default environment.

Example:

environments:
<platform-name>:
myDefinedEnvironment:
ENV-VAR1: value/for/env-var1
ENV-VAR2: value/for/env-var2

A component uses a defined environment by setting command.environment to the environment name. For example:

components:
- name: myComponent
command:
executable: app.exe
environment: myDefinedEnvironment

You can set command.environment to "none" to instruct ST4SD to only inject a couple of auto-generated environment variables. Note, backends that ST4SD uses e.g. k8s, docker, lsf, may add env-vars afterwards.

If command.environment is not explicitly set, the st4sd-runtime will default to using a built-in, environment called environment. This contains the environment from which elaunch.py was run.

You can override the definition of environment if you wish, for example:

environments:
default:
environment:
ENV-VAR1: sensible/default/for/env-var1
ENV-VAR2: sensible/default/for/env-var2
ENV-VAR3: sensible/default/for/env-var3

The runtime always injects a couple of variables to the environments of components (INSTANCE_DIR, FLOW_EXPERIMENT_NAME, and FLOW_RUN_ID).

For more information, see our environment resolution rules.

Variables

The variables field follows the format below:

variables:
<platform name:str>:
Optional(global):
<variable name:str>: <value: str, int, bool, float>
Optional(stages):
<stage index: int>:
<variable name:str>: <value: str, int, bool, float>

Variables are grouped under a platform, and can either be global or stage-specific. This example uses the following variables definition:

variables:
default:
global:
numberOfPoints: 3
stages:
2:
addToSum: 10
artifactory:

Using Variables

You refer to variables in FlowIR with the syntax %($VARIABLE_NAME)s.

FlowIR supports using variables to define:

  • values of fields
  • values of other variables

For example:

variables:
default:
global:
salutation: "hello"
subject: "world"
message: "%(salutation)s %(subject)s"
components:
- name: hello-message

Here we use the value of the message variable in the arguments of the hello-message component. The value assigned to the message variable itself uses two other variables, salutation and subject.

The first character of the value for a YAML field cannot be % so remember to enclose fields that contain variables references in quotes.

Variables can contain space separated arrays

You can also treat a variable as an array of space separated items. Here’s you can reference the <index>-th entry of a <variable>:

%(<variable>)s[<Index>]

The 1st entry is at index 0.

Examples:

  • %(names)s[0]: This resolves to the 1st entry in the names array.
  • %(names)s[%(index)s]: Indices may be variables too!
variables:
default:
global:
# All variables are strings in FlowIR
names: Ann Bob
# Even those that look like a number
population: 2
components:

Variables are all strings in FlowIR. If ST4SD expects a field to have a certain type then it will coerce the value that variable references resolve to into the appropriate type. In the example above workflowAttributes.replicate expects an integer value. ST4SD will convert the value of the variable population population to the integer value 2.

Blueprints

ST4SD supports defining default options for (a) all components and/or (b) for components that belong in a specific stage, via the blueprint top-level field:

blueprint:
<platform name:str>:
Optional(global):
<component options>
Optional(stages):
<stage index:int>:
<component options>

This example defines the blueprint for 2 platforms. It specifies the default options when using the 2 platforms (setting values for resourceManager, resourceRequest for all components when artifactory is the chosen platform) and specializes components in stage 1 when using the artifactory platform (increase their memory request)

blueprint:
default:
global:
command:
environment: environment
artifactory:
global:
resourceRequest:
memory: 100Mi

Platforms

A platform is a named collection of blueprints, variables, overrides and environments.

You define the named platforms using the top-level platform array

platforms:
- bigmem
- nvidia-gpu

When you run a workflow you specify the platform by name. Then the relevant sections of blueprints, variables, overrides and environments will become active.

Platforms are designed to assist in implementing generic components which are specialized for different purposes when specifying different platforms. This is particularly useful when working with packages that can utilize various kinds of HPC resources (e.g. a cluster fitted with LSF, a kubernetes installation, etc). For example, a component can be configured to utilize a certain amount of GPUs when it targets platform A but exclusively use CPUs on platform B.

In the sum-numbers example there exist 2 platforms: default, and artifactory. The default platform leads to components executing as vanilla Operating System. Whereas, the artifactory platform configures the workflow for execution on kubernetes.

default platform

The default platform is special: The st4sd-runtime fills in missing fields of the default blueprint. See, this platform is intended to act as the base layer for workflow environments, and component variables/options. When an option/variable/environment is defined within the default platform it is automatically inherited by all other platforms (unless they explicitly override said option/variable/environment); read the FlowIR options/variable inheritance sequence section for more information on the options/variable layering aspect of ST4SD platforms.

In this example, the default platform defines two variable (a global, and one that is only visible for components in stage 2), the special environment environment, and a global blueprint which sets the default value of the command.environment options for all components. See environments for more information about environments.

artifactory platform

The artifactory platform overrides the default value (from 10 to -5) for the stage 2 variable addToSum, defines default options for all components which instruct the st4sd-runtime to utilize the kubernetes backend, and overrides the environment environment. Moreover, it serves as an example on how to use the layering system of ST4SD to specialize the components which belong in a particular stage. Specifically, the artifactory platform configures components belonging in stage 1 to use 150Mi of memory instead of 100Mi and 0.1 CPU-units instead of 0.25.

FlowIR Scopes

The st4sd-runtime supports nested scopes:

  • global (i.e. visible to all components)
  • visible to components within a specific stage
  • visible to just one component

These scopes are layered in a specific order by the st4sd-runtime.

FlowIR options/variable inheritance sequence

This is the full order of inheritance for component options.

  1. Builtin st4sd-runtime blueprint
  2. Default global blueprint
  3. Default stage blueprint
  4. Platform global blueprint
  5. Platform stage blueprint
  6. Component definition
  7. Resolve interpreter option which may affect command.executable and command.arguments

Inheritance for variables works in the same spirit (it’s effectively the same order of steps but without steps: 1 and 7).

In the case of environments, the st4sd-runtime follows the rules below:

  1. If the environment is not set then the environment contains the default environment called “environment”. If the default environment is unset, then the default environment is the active shell environment.
  2. If the name is the literal string “none” then the environment contains {}
  3. Otherwise the st4sd-runtime uses the definition for the environment name from the selected platform. If there is no definition in the active platform the st4sd-runtime falls back to the default platform.
  4. If an environment defines a DEFAULTS key then that key is expected to have the format VAR1:VAR2:VAR3.... Other options in the environment could reference the aforementioned vars using the $VAR and ${VAR} notation and these options will be resolved using their matching keys in the default environment.
    1. Any $VAR and ${VAR} references not matched by DEFAULTS keys will be resolved using the active shell(workflow launch environment).
    2. If a variable is defined in DEFAULTS but there is no value for it in the default environment then treat it as if it was never in the DEFAULTS option in the first place. This means that references to it will remain as is. The system that is executing the component’s task will resolve such environment variables just in time.
  5. The runtime injects a couple of variables to the environment (INSTANCE_DIR, FLOW_EXPERIMENT_NAME, and FLOW_RUN_ID).

Tip: You can use the ccommand.py utility to get a list of all FlowIR details for a particular component of a workflow package using the -f (i.e. --flowirReport) option. Use --env to also view the contents of the component environment. Try targeting different platforms via the -p argument.

Default options

The careful reader will notice that the default platform does not contain an option for resourceManager.config.backend. How does the st4sd-runtime decide which backend to use?

Recall that the st4sd-runtime injects default values for the default.global blueprint which are then inherited by all components. The default value for resourceManager.config.backend is local which instructs the st4sd-runtime to spawn component tasks as vanilla operating system processes. You can find a detailed list of the ST4SD default values in the ST4SD documentation.

Key-outputs

Key-Outputs are named DataReferences that point to important paths which the virtual experiment produced.

Example:

output:
OptimisationResults:
data-in: stage1.ExtractEnergies/energies.csv:ref
description: homo/lumo results
type: csv

Above, output is a top-level dictionary whose keys are names of key-outputs. Each key points to a dictionary with this schema:

data-in: "a DataReference"
# Optional fields
description: "A human readable description of the file"
type: "e.g. csv, pdf, etc - this only used to label key-output"
stages:
- stage0
- stage1
The DataReference in "data-in" can use one of the following reference methods: :ref, :copy, or :output. Here, :copy is just an alias for :ref (i.e paths are not copied). Finally, :output is an alias to out.stdout:ref if the DataReference does not have a /fileRef. Otherwise, :output becomes an alias to /fileRef:ref.
If "data-in" does not contain a "stage$index." prefix then you can list stages to use as the "stage" prefix. If you provide multiple stages then later stages will override the path.

Interface and Properties

The interface of a virtual experiment (e.g. workflow) defines:

  • The specification used to describe input systems it processes e.g. SMILEs for small molecules
  • Instructions to extract the input systems from input data
  • Instructions to extract the values of properties that the virtual experiment computes

You can find more information about writing an interface here and a tutorial on how to use an interface here