Project Types
This page describes how to structure, test, and run your ST4SD virtual experiment projects
- Quick guide to selecting a project type
- Before beginning
- Standard project
- Standalone project
- Testing projects
- Projects and cloud execution
- Providing external data to experiments
ST4SD virtual experiments often consist of a variety of files — such as configuration files, scripts, and restart hooks.
ST4SD supports two ways of structuring projects in order to collect these files together so your project can be tested and also so those fields will be present when you run the virtual experiment.
Standard projects are flexible, allowing for multiple virtual experiment definitions to be bundled together and share files, like scripts and restart hooks.
Standalone projects only support a single virtual experiment and are best suited for workflows with many artifacts or resources that are actively changing (i.e., they have multiple commits).
Quick Guide to Selecting a Project Type
To identify the best project type for storing your virtual experiment find the statement that is the closest match to your situation:
- I have assets that need to be present when the virtual experiment runs. These assets are shared between multiple experiments and I can choose where to put them.
- Use a standard project.
- Recommendation: If you have an internet accessible location for hosting git repositories e.g. GitHub, store the workflows there.
- I have assets that need to be present when the virtual -experiment runs. However these assets cannot be placed in a git repo due to size restrictions OR they live in a defined COS bucket they can’t be moved from.
- Use a standard project but specify the assets location as a runtime application-dependency source - see application dependencies.
- Recommendation: If you have an internet accessible location for hosting git repositories e.g. GitHub, store the experiments there.
- I have a production virtual experiment with multiple restart hooks and/or many required configuration files. I need to have strong version control and also automated regression testing.
- Use a standalone project.
Before beginning
Before diving into the below sections there are a couple of things to be aware of.
Virtual Experiment Dependencies
The dependencies of a virtual experiment are a set of directories it requires to be present in the top-level of its instance directory when it runs.
These directories are in addition to input
, stages
, and output
which are always created.
They are specified in the virtual experiment configuration YAML using the key application-dependencies
:
myworkflow.yamlapplication-dependencies:default: # platform name- $DIRECTORY_NAME_ONE- $DIRECTORY_NAME_TWO- ...
In the configuration file these names can be used in references e.g. $DIRECTORY_NAME_ONE/myconftemplate.dat:ref
.
When testing the configuration the parser checks that all direct references to directories that aren’t called input
are listed under application-dependencies
; i.e., that the directories you are using in the workflow configuration are specified and will be present.
To populate these directories at run time with the correct files we need to specify the dependency sources using one of the following options:
Specifying them in a
manifest
, if using a standard project.Explicitly creating them, if using a standalone project .
Specifying them when submitting an experiment via the
st4sd-runtime-service
.
Each of these options is explain in the section below
Use case
In the following sections we will illustrate the two project types with a workflow called myworkflow.yaml
that has:
- Two required configuration files:
configuration_template.txt
default_model.dat
- Two restart hooks:
component_one_restart.py
component_two_restart.py
.
Standard Project
Virtual experiments belonging to a standard project live under a single root directory (which can have any name e.g., “my-experiments”) and contain at least one configuration file and optionally a related manifest file.
The configuration is a YAML file containing the FlowIR definition of the virtual experiment (more on FlowIR here), while the manifest is a YAML file defining which directories in the project the particular virtual experiment needs and where they will be accessible from when the workflow is running.
The manifest and configuration files for an experiment can have any name and be stored in separate directories, although a common pattern is for them to be stored together in one directory beneath the root direcory and be called manifest.yaml
and with the same name of the workflow (e.g. myworkflow.yaml
), respectively.
Writing a manifest
The content of the manifest file is a YAML dictionary whose keys/values are:
$APPLICATION_DEPENDENCY_NAME: $RELATIVE_PATH_TO_SOURCE_DIRECTORY[:$METHOD]
Where:
APPLICATION_DEPENDENCY_NAME
is the name you use to refer to the directory containing the files in the virtual experiment configuration. At runtime a directory will be created in the top-level of the workflow instance with this name and populated with contents from the source directory based onMETHOD
(see below).RELATIVE_PATH_TO_SOURCE_DIRECTORY
is the path to the directory under the root directory of the standard project that you want to access when the virtual experiment runs.METHOD
defines how you want the directory to be made available. It can becopy
orlink
. If no method is specifiedcopy
is used, as it ensures thedata
will be present in the instance when it finishes.
Example Layout
Here is an example of how the use case could be structured using the standard method:
myworkflow/myworkflows/workflowOne/- myworkflow.yaml- manifest.yamlshared_data/- configuration_template.txt- default_model.dathooks/
The manifest file would be:
manifest.yamldata: ../shared_datahooks: ../hooks
The configuration file would be:
myworfklow.yamlapplication-dependencies:default:- data- hooks
When such a virtual experiment is executed the hooks
will be automatically run and the data files will be available in data/
.
For example, the path to default_model.dat
file would be used in myworkflow.yaml
via the reference data/default_model.dat:ref
.
Standalone Project
With this method the project is dedicated to a single virtual experiment, containing a single root directory with all the associated files inside of it. The manifest file is also created automatically from all the directories in the root of the project, and the developer does not need to create one.
This project structure ensures that:
- All the artifacts related to a virtual experiment are kept together in a single version control history.
- The history of the repository coincides with the history of a single virtual experiment, with no contamination from changes to other virtual experiments.
As such, this method is best suited for complex virtual experiments or ones that require tighter version controls.
Example Layout
Here is an example of how the use case could be structured using the standalone method:
myworkflow/conf/- flowir_package.yamldata/- configuration_template.txt- default_model.dathooks/- __init__.py- component_one_restart.py
When such a virtual experiment is executed the hooks
will be automatically run and the data files will be available in data/
.
For example the following reference is valid data/default_model.dat:ref
See sum-numbers for a simple example of a virtual experiment defined in this way that you can also run.
Testing Projects
Virtual experiments defined in any of the ways described above can be tested using the tool etest.py
available in st4sd-runtime-core
.
See here for instructions on how to install it locally (the command will be available system-wide).
Standalone projects
In the case of standalone projects simply cd
to the experiment folder and execute:
etest.py --notestExecutables
Standard projects
Since standard projects can contain multiple virtual experiments testing may require some information on what to test as such:
etest.py --manifest=$PATH_TO_MANIFEST $PATH_TO_CONFIGURATION_FILE
Projects and Cloud Execution
To execute virtual experiments on OpenShift/Kubernetes (independently from their project type) they need to be placed in an accessible location.
There are two options:
- A remote git repository (e.g., GitHub).
- A Cloud Object Store (COS) bucket.
Storing in Git
It is strongly advised to use a git repository, at least for source code management. To create a local repository, in the top level of your project type:
git initgit add .git commit *
To then push the project to a remote git repository use:
git remote add origin $REMOTE_GIT_REPOgit push -u origin main
Storing in COS
Create a COS bucket as described in using Cloud Object Store and upload/copy your project directory to the created bucket.
Running the Virtual Experiments in the Project
Once a project has been stored remotely there are two steps to run the experiments it defines:
Run it using the ST4SD Python API or command line tools.
Providing external data to experiments
The first two options have been already explained. To use the third option you add the following fields when submitting your workflow to run:
experimentConfiguration = {#Experiment input options"volumes": [{"type": {"dataset": "$DATA_SET_NAME"},"applicationDependency": "$APPLICATION_DEPENDENCY_NAME","subPath": "$PATH_TO_APPLICATION_DEPENDENCY_SOURCE_RELATIVE_TO_TOP_LEVEL_OF_BUCKET"}]
Example Layout
To use this method create one or more COS buckets contaning the workflow’s dependency sources. The example use-case could be packaged using the following layout in a singe bucket
mybucket/data/- configuration_template.txt- default_model.dathooks/- __init__.py- component_one_restart.py- component_two_restart.py
Next create a Dataset
for the bucket - we’ll call it my-workflow-deps
.
For this step follow instructions here.
This is a one time action.
Now when launching the workflow use:
experimentConfiguration = {#Experiment input options"volumes": [{"type": {"dataset":"my-workflow-deps"},"applicationDependency": "hooks","subPath": hooks/},{