Skip to main contentIBM ST4SD

Project Types

This page describes how to structure, test, and run your ST4SD virtual experiment projects

ST4SD virtual experiments often consist of a variety of files — such as configuration files, scripts, and restart hooks.

ST4SD supports two ways of structuring projects in order to collect these files together so your project can be tested and also so those fields will be present when you run the virtual experiment.

Standard projects are flexible, allowing for multiple virtual experiment definitions to be bundled together and share files, like scripts and restart hooks.

Standalone projects only support a single virtual experiment and are best suited for workflows with many artifacts or resources that are actively changing (i.e., they have multiple commits).

Quick Guide to Selecting a Project Type

To identify the best project type for storing your virtual experiment find the statement that is the closest match to your situation:

  • I have assets that need to be present when the virtual experiment runs. These assets are shared between multiple experiments and I can choose where to put them.
    • Use a standard project.
    • Recommendation: If you have an internet accessible location for hosting git repositories e.g. GitHub, store the workflows there.
  • I have assets that need to be present when the virtual -experiment runs. However these assets cannot be placed in a git repo due to size restrictions OR they live in a defined COS bucket they can’t be moved from.
    • Use a standard project but specify the assets location as a runtime application-dependency source - see application dependencies.
    • Recommendation: If you have an internet accessible location for hosting git repositories e.g. GitHub, store the experiments there.
  • I have a production virtual experiment with multiple restart hooks and/or many required configuration files. I need to have strong version control and also automated regression testing.

Before beginning

Before diving into the below sections there are a couple of things to be aware of.

Virtual Experiment Dependencies

The dependencies of a virtual experiment are a set of directories it requires to be present in the top-level of its instance directory when it runs. These directories are in addition to input, stages, and output which are always created.

They are specified in the virtual experiment configuration YAML using the key application-dependencies:

myworkflow.yaml
application-dependencies:
default: # platform name
- $DIRECTORY_NAME_ONE
- $DIRECTORY_NAME_TWO
- ...

In the configuration file these names can be used in references e.g. $DIRECTORY_NAME_ONE/myconftemplate.dat:ref.

When testing the configuration the parser checks that all direct references to directories that aren’t called input are listed under application-dependencies; i.e., that the directories you are using in the workflow configuration are specified and will be present.

To populate these directories at run time with the correct files we need to specify the dependency sources using one of the following options:

  1. Specifying them in a manifest, if using a standard project.

  2. Explicitly creating them, if using a standalone project .

  3. Specifying them when submitting an experiment via the st4sd-runtime-service.

Each of these options is explain in the section below

Use case

In the following sections we will illustrate the two project types with a workflow called myworkflow.yaml that has:

  • Two required configuration files:
    • configuration_template.txt
    • default_model.dat
  • Two restart hooks:
    • component_one_restart.py
    • component_two_restart.py.

Standard Project

This type of project:

  • Requires a manifest.

  • Allows storing multiple virtual experiments in the same project.

  • Required files for execution are stored in the project itself.

You can test standard projects using the tool etest.py available with st4sd-runtime-core See testing projects for more details.

Virtual experiments belonging to a standard project live under a single root directory (which can have any name e.g., “my-experiments”) and contain at least one configuration file and optionally a related manifest file.

The configuration is a YAML file containing the FlowIR definition of the virtual experiment (more on FlowIR here), while the manifest is a YAML file defining which directories in the project the particular virtual experiment needs and where they will be accessible from when the workflow is running.

The directories specified in the manifest must be under project root directory

The manifest and configuration files for an experiment can have any name and be stored in separate directories, although a common pattern is for them to be stored together in one directory beneath the root direcory and be called manifest.yaml and with the same name of the workflow (e.g. myworkflow.yaml), respectively.

For running via the st4sd-runtime-service the root directory must be stored on a remotely accessible git repository or a COS bucket. See projects and cloud execution for more details.

Standard projects allow having multiple (manifest+configuration file) pairs in the same project.

Writing a manifest

The content of the manifest file is a YAML dictionary whose keys/values are:

$APPLICATION_DEPENDENCY_NAME: $RELATIVE_PATH_TO_SOURCE_DIRECTORY[:$METHOD]

Where:

  • APPLICATION_DEPENDENCY_NAME is the name you use to refer to the directory containing the files in the virtual experiment configuration. At runtime a directory will be created in the top-level of the workflow instance with this name and populated with contents from the source directory based on METHOD (see below).
  • RELATIVE_PATH_TO_SOURCE_DIRECTORY is the path to the directory under the root directory of the standard project that you want to access when the virtual experiment runs.
  • METHOD defines how you want the directory to be made available. It can be copy or link. If no method is specified copy is used, as it ensures the data will be present in the instance when it finishes.

Remember: the application dependencies in your manifest file must also be listed in your virtual experiment configuration under the top-level key application-dependencies (see here). This is required for testing as it allows checking that the manifest associates source-folders to the application dependencies the workflow expects.

If you host your standard project on GitHub and your manifest file contains application dependency sources that point to other folders in your workflow package, you should use the copy method. If you use link, the files under the application dependency sources will only be visible to components that use the local backend.

Example Layout

Here is an example of how the use case could be structured using the standard method:

myworkflow/
myworkflows/
workflowOne/
- myworkflow.yaml
- manifest.yaml
shared_data/
- configuration_template.txt
- default_model.dat
hooks/

The manifest file would be:

manifest.yaml
data: ../shared_data
hooks: ../hooks

The configuration file would be:

myworfklow.yaml
application-dependencies:
default:
- data
- hooks

When such a virtual experiment is executed the hooks will be automatically run and the data files will be available in data/. For example, the path to default_model.dat file would be used in myworkflow.yaml via the reference data/default_model.dat:ref.

Standalone Project

This type of project:

  • Does not require a manifest or any modification of the virtual experiment configuration.

  • Allows only one virtual experiment per project.

  • Stores resources/artifacts in the project itself.

You can test experiment packages using the tool etest.py available with the st4sd-runtime-core.

With this method the project is dedicated to a single virtual experiment, containing a single root directory with all the associated files inside of it. The manifest file is also created automatically from all the directories in the root of the project, and the developer does not need to create one.

This project structure ensures that:

  • All the artifacts related to a virtual experiment are kept together in a single version control history.
  • The history of the repository coincides with the history of a single virtual experiment, with no contamination from changes to other virtual experiments.

As such, this method is best suited for complex virtual experiments or ones that require tighter version controls.

Example Layout

Here is an example of how the use case could be structured using the standalone method:

myworkflow/
conf/
- flowir_package.yaml
data/
- configuration_template.txt
- default_model.dat
hooks/
- __init__.py
- component_one_restart.py

When such a virtual experiment is executed the hooks will be automatically run and the data files will be available in data/. For example the following reference is valid data/default_model.dat:ref

See sum-numbers for a simple example of a virtual experiment defined in this way that you can also run.

Testing Projects

Virtual experiments defined in any of the ways described above can be tested using the tool etest.py available in st4sd-runtime-core. See here for instructions on how to install it locally (the command will be available system-wide).

If you are in JupyterLab etest.py will be available if you open a terminal session. This allows you to clone the repository containing your workflow package into JupyterLab and test there.

Standalone projects

In the case of standalone projects simply cd to the experiment folder and execute:

etest.py --notestExecutables

Standard projects

Since standard projects can contain multiple virtual experiments testing may require some information on what to test as such:

etest.py --manifest=$PATH_TO_MANIFEST $PATH_TO_CONFIGURATION_FILE

Projects and Cloud Execution

In ST4SD 2.0 we will add a method to quickly test virtual experiments without the need to store them in git or object-store or manually create a parameterised package definition

To execute virtual experiments on OpenShift/Kubernetes (independently from their project type) they need to be placed in an accessible location.

There are two options:

  1. A remote git repository (e.g., GitHub).
  2. A Cloud Object Store (COS) bucket.

The storage location is normally decided at the start of development, with updates being pushed regularly as changes are made.

Storing in Git

It is strongly advised to use a git repository, at least for source code management. To create a local repository, in the top level of your project type:

git init
git add .
git commit *

To then push the project to a remote git repository use:

git remote add origin $REMOTE_GIT_REPO
git push -u origin main

Storing in COS

Create a COS bucket as described in using Cloud Object Store and upload/copy your project directory to the created bucket.

Running the Virtual Experiments in the Project

Once a project has been stored remotely there are two steps to run the experiments it defines:

Providing external data to experiments

If dependencies, like configuration files, cannot be placed in the virtual experiment project, they can be supplied at runtime. This, however, requires users to be aware of the requirements.

The first two options have been already explained. To use the third option you add the following fields when submitting your workflow to run:

experimentConfiguration = {
#Experiment input options
"volumes": [
{
"type": {"dataset": "$DATA_SET_NAME"},
"applicationDependency": "$APPLICATION_DEPENDENCY_NAME",
"subPath": "$PATH_TO_APPLICATION_DEPENDENCY_SOURCE_RELATIVE_TO_TOP_LEVEL_OF_BUCKET"
}
]

The Datashim framework needs to be installed to supply application dependencies at runtime. This itself requires OpenShift v4.X.

Example Layout

To use this method create one or more COS buckets contaning the workflow’s dependency sources. The example use-case could be packaged using the following layout in a singe bucket

mybucket/
data/
- configuration_template.txt
- default_model.dat
hooks/
- __init__.py
- component_one_restart.py
- component_two_restart.py

Next create a Dataset for the bucket - we’ll call it my-workflow-deps. For this step follow instructions here. This is a one time action.

Now when launching the workflow use:

experimentConfiguration = {
#Experiment input options
"volumes": [
{
"type": {"dataset":"my-workflow-deps"},
"applicationDependency": "hooks",
"subPath": hooks/
},
{