The elaunch.py command line tool
The elaunch.py
command line tool executes and monitors virtual experiments. You can use it to run experiments on your laptop or on a High Performance Computing Cluster. When you submit a virtual experiment via ST4SD API it is executed by elaunch.py
.
- Running experiments with elaunch.py
- Checking if your experiment worked
- What is the output of my experiment ?
- What is the status of my experiment ?
- Troubleshooting
- How do I select an execution platform ?
- How to restart an experiment ?
Install elaunch.py
If you haven’t already, install the st4sd-runtime-core
python package:
pip install "st4sd-runtime-core[develop]"
Running experiments with elaunch.py
With elaunch.py you can run experiments - sets of files describing computational workflows - given their path: simply run elaunch.py <path to experiment>
.
Note, running experiments locally with requires using either Linux or MacOS. Windows users should either use a Virtual Machine (e.g. VirtualBox etc) or the Windows Subsystem for Linux (WSL).
Some experiments, like the below example, also require you to install docker
on your machine too.
For example, you can run the workflow nanopore-geometry-experiment
like so:
: # Get the directory containing the virtual experiment and cd into itgit clone https://github.com/st4sd/nanopore-geometry-experiment.gitcd nanopore-geometry-experiment: # Run elaunch.py specifying certain files in the directory created aboveelaunch.py -i docker-example/cif_files.dat -l 40 --nostamp\--applicationDependencySource="nanopore-database=cif:copy" \nanopore-geometry-experiment.package
The experiment should take about 5 minutes to complete. The -l 40
option keeps the log printouts to the bare minimum so don’t worry if the command is silent for a few minutes. When the experiment completes expect to see something similar to the following text on your terminal:
completed-on=2024-03-15 14:39:29.909105cost=0created-on=2024-03-15 14:38:04.402969current-stage=stage3exit-status=Successexperiment-state=finishedstage-progress=1.0stage-state=finishedstages=['stage0', 'stage1', 'stage2', 'stage3']
Running an experiment creates a directory which contains the outputs. In this example we set the argument --nostamp
which instructs ST4SD to not include a timestamp in the directory it creates for the experiment. As such it creates the directory nanopore-geometry-experiment.instance
. If you run the same command a second time elaunch.py will complain that the experiment instance already exists. Either remove the --nostamp
argument or delete the directory and retry elaunch.py
. See Section What is the output of my experiment? for more information.
Experiment project types
Experiments can be packaged in two different ways. One way is the standalone
project which is the example we show above. This type of experiments only support a single virtual experiment and are best suited for workflows with many artifacts or resources that are actively changing (i.e., they have multiple commits).
Standalone projects contain:
- a
conf
directory with the experiment definition files - (optional) a
data
directory with data files that the workflow steps can reference and the users may override at execution time - (optional) additional custom directories that the workflow developers include for the workflow steps to reference
Another type is the standard
project. These are flexible, allowing for multiple virtual experiment definitions to be bundled together and share files, like scripts and restart hooks. They consist of
- a YAML file that contains the experiment definition
- (optional) manifest YAML file listing the directories that the virtual experiment needs and where they will be accessible from when it is running.
Providing input files
Experiments typically require inputs to function properly. To view them, you can use the command einputs.py <path to experiment>
. Refer to the documentation of the experiment you’re trying to run to find out more about the necessary inputs.
To pass inputs to your experiment, you can use the -i ${path to input file}
option in elaunch.py
. In the above example we provide the input file cif_files.dat
which is located in the directory docker-example
.
If you want to use an input file whose name is not the same as the one the experiment expects, you must map them explicitly with --input $local_path:$input_name
. For example, to use the contents of the file /tmp/my-file.dat
as the input file cif_files.dat
above you would specify --input /tmp/my-file.dat:cif_files.dat
.
Setting configuration options
Experiments may also come with configuration options that you can optionally override. We call these options variables
and you can use einputs.py
to get the list of variables (and their default) values for an experiment.
For example, here is the relevant section from the output of einputs.py nanopore-geometry-experiment.package
:
optional:variables:global:numberOfNanopores: 1probeRadius_A: 1.4zeo_memory: 2Gi
Typically the experiment documentation explains what these variables control. To configure their values, put together a variables
file using the format:
global:parameterName: value
Use this variables file with your experiment by specifying the elaunch.py
argument -a ${path to variables file}
. Take care when formatting the variables.yaml
file, it should follow the indentation and syntax of YAML files.
Checking if your experiment worked
If the experiment works elaunch.py
prints exit-status=Success
before it terminates and then exits with return code 0. You can find more information about the status of your experiment under the file ${package_name}-${timestamp}.instance/output/status.txt
. For example, here’s a status.txt
for a successful run of an experiment:
completed-on=2024-03-15 14:39:29.909105cost=0created-on=2024-03-15 14:38:04.402969current-stage=stage3exit-status=Successexperiment-state=finishedstage-progress=1.0stage-state=finishedstages=['stage0', 'stage1', 'stage2', 'stage3']
If the experiment fails you will see the line exit-status=Failed
in the logs of elaunch.py
and it will exit with a return code other than 0
. If the experiment failed after the instance directory was created you will see this information in the output/status.txt
file too. Common reasons for failures are invalid syntax, missing input files, or requesting a compute resource that is not available. For more information and dealing with these errors see our Troubleshooting section.
What is the output of my experiment ?
All outputs of the experiment are placed in the experiment instance directory. By default, this directory is${package-name}-${timestamp}.instance
and you will find it under the directory you were in when you ran elaunch.py
. If you specify the --nostamp
argument then elaunch.py will not omit the -${timestamp}
part.
The experiment instance directory contains several nested directories, of which the most noteworthy are output
and stages
. Here is the full list of directories and their description:
stages
: contains one directory per stage of your experiment. Each stage directory contains one directory for each of the working directories of the components in that stage. Components store any files their produce, as well as text they print to the terminal under their working directoryoutput
: contains the runtime logs and files with metadata about the outputs and status of your experimentinputs
: contains the input files you provided, including any variable filesdata
: (optional) contains files that the workflow definition bundles and the workflow steps can reference. Users may optionally override those files when they launch an experimentconf
: contains the experiment definition
The output directory
It contains the following files:
- experiment.log: the logs of the
elaunch.py
process - status.txt: the final status of the experiment (see the status printout above for an example)
- status_details.json: Similar to above but easier to consume programmatically
- output.txt: contains metadata about key files that your experiments produce i.e. key-outputs. This file gets updated when when the key named files that one of your tasks produced. It contains information such as their path relative to the root of the instance directory, modification time, etc.
- output.json: Similar to above but easier to consume programmatically
- properties.csv: (optional) If your experiment defines its interface, then this file contains the measured properties of your experiment,
- input-ids.json: (optional) If your experiment defines its interface, then this file contains an array with the input ids that your experiment processed
- additional_input_data.json: (optional) If your experiment defines its interface, then this file contains dictionary whose keys are input ids and values are additional input data (e.g. absolute paths) associated with the corresponding input id
The stages directory
A virtual experiment is a computational workflow that executes tasks. Task outputs are organized under the stages
directory like so: stages/stage{$index}/${task-name}
. To find out the tasks that are in your experiment read the experiment definition or look at the file structure of the stages
directory.
Components specify which stage they belong to and by default they are all part of stage 0
. Generally, stages help you create logical groups of components. They do not really play a role in scheduling decisions, except for some special cases which are outside the scope of the information in this document
Understanding an experiment’s execution requirements
The experiment documentation should explain what is required to execute it. For example, an experiment contains a set of tasks and elaunch.py submits those tasks to the backends that the tasks select. This means that if the machine on which you run elaunch.py does not support the backend that a task selects then elaunch.py cannot run that task.
How to run elaunch with LSF ?
Some experiments can launch tasks on using the batch scheduler LSF (IBM Spectrum). If an experiment supports execution on LSF it should say so in its documentation and explain how to launch using it.
To launch an experiment that supports LSF you need to also install the official lsf-python-api
python module:
. /path/to/profile.lsfgit clone https://github.com/IBMSpectrumComputing/lsf-python-api.gitcd lsf-python-apipython3 setup.py buildpython3 setup.py install
Check the homepage of lsf-python-api
for more information.
How to override experiment configuration data files
Experiments may optionally bundle data files which you may override. The experiment documentation should explain what these files are and what your options are for overriding. Additionally einputs.py
displays the names of the data files that an experiment references.
Store outputs to S3
Experiments may optionally upload their key-outputs
to S3 after termination. You can instruct elaunch.py
to upload these files to S3 using the --s3StoreToURI
parameter. When using this parameter, you must also specify exactly one of the parameters --s3AuthWithEnvVars
or --s3AuthBearer64
.
Example:
export bucket="a-bucket"export path_in_bucket="optional/path"export S3_ACCESS_KEY_ID="s3 access key id"export S3_SECRET_ACCESS_KEY="s3 secret access key"export S3_END_POINT="s3 end point"elaunch.py --s3StoreToURI s3://${bucket}/${path_in_bucket} \--s3AuthWithEnvVars path/to/experiment
When --s3StoreToURI
is set, after the experiment terminates, elaunch.py
will start uploading the key-outputs
to the S3 bucket you provided under the specified ${path_in_bucket}
. elaunch.py
replaces occurrences of the %(instanceDir)s
literal in --s3StoreToURI
with the name of the experiment instance. For example, you can use this to store the key-outputs
of multiple workflow instances in the same bucket.
Alternatively, you can base64-encode the JSON representation of the dictionary {"S3_ACCESS_KEY_ID": "val", "S3_SECRET_ACCESS_KEY": "val", "S3_END_POINT": "val"}
and use the --s3AuthBearer64
parameter instead:
export bucket="a-bucket"export path_in_bucket="optional/path"export json="{\"S3_ACCESS_KEY_ID\": \"val\", \"S3_SECRET_ACCESS_KEY\": \"val\", \"S3_END_POINT\": \"val\"}"export s3_auth=`echo "${json}" | base64`elaunch.py --s3StoreToURI s3://${bucket}/${path_in_bucket} \--s3AuthBearer64 path/to/experiment
What is the status of my experiment ?
The elaunch.py
script will periodically store information about the status of your experiment instance under its $instanceDir
directory. You can use einspect.py
to see the current status of tasks in your experiment instance.
Here is an example output of running einspect.py
after a sum-numbers
experiment terminates.
cd sum-numbers-2024-03-15T143804.402969.instanceeinspect.py -f allWARNING MainThread root : <module> 2024-03-15 14:39:50,782: No instance given - checking if inside one========== STAGE 0 ==========Components using engine-type: engine
You may also see a summary of your status in the $instanceDir/output/status.txt
file:
completed-on=2024-03-15 14:39:29.909105cost=0created-on=2024-03-15 14:38:04.402969current-stage=stage3exit-status=Successexperiment-state=finishedstage-progress=1.0stage-state=finishedstages=['stage0', 'stage1', 'stage2', 'stage3']
The current status of your experiment is the value of exit-status
.
Troubleshooting
If the exit-status
of your experiment instance is Failed
then this means that at least one of your components was unable to terminate successfully. You can find the name of the component that caused the experiment to fail in the status
file and printout.
Here is an example:
completed-on=2024-05-23 09:44:03.757679cost=0created-on=2024-05-23 09:42:19.223491current-stage=stage1error-description=Stage 1 failed. Reason:\\\\n3 jobs failed unexpectedly.\\\\nJob: stage1.PartialSum0. Returncode 1. Reason KnownIssue\\\\nJob: stage1.PartialSum2. Returncode 1. Reason KnownIssue\\\\nJob: stage1.PartialSum1. Returncode 1. Reason KnownIssue\\\\nexit-status=Failedexperiment-state=finishedstage-progress=0.5stage-state=failed
The error reports that multiple components failed: stage1.PartialSum0
, stage1.PartialSum1
, stage1.PartialSum2
.
You may also get a full view of the state of the experiment by using the einspect.py -f all
tool.
========== STAGE 0 ==========Components using engine-type: enginereference, state, backend, isWaitingOnOutput, engineExitReason, lastTaskRunTime, lastTaskRunStatestage0.GenerateInput, finished, local, True, Success, 0:00:00.358677, finished========== STAGE 1 ==========Components using engine-type: engine
After you spot a Failed
component, try looking at the files it produced, including its stdout and stderr (for some backends both streams get fed into stdout). Recall that you can find these files under $INSTANCE_DIR/stages/stage<stage index>/<component name>/
. Look for the out.stdout
and out.stderr
files.
Sometimes, a component fails because one of its predecessors (direct, or indirect) produced unexpected output. To find the predecessors of a component, look at the $INSTANCE_DIR/conf/flowir_instance.yaml
, locate the component you are investigating and then follow its predecessors by looking at the references
of the component. You can then investigate the output files and stdout/stderr of those components to see if you can spot why the downstream component failed.
Advanced experiments may also use Restart Hooks to customize the restart logic of components. Additionally, a restart hook may print logs to the terminal using a logger that is associated with the execution engine of the component it controls. These logs will contain the text eng.${lowercase of the component reference}
. For example, the logger of the component Foo
in stage 0
will contain the text eng.stage0.foo
in its logs. You can find the logs of the restart hooks in the terminal output of the elaunch.py
process (which is also archived under $INSTANCE_DIR/output/experiment.log
).
How do I select an execution platform ?
Often, workflows have support for multiple execution environments such as Cloud (e.g. Kubernetes/OpenShift), HPC, or even personal devices like laptops. ST4SD uses the concept of execution platform to help workflow developers define how their workflows should execute under different execution environments. Platforms are designed to assist in implementing generic components which are specialized for different purposes when specifying different platforms. This is particularly useful when working with packages that can utilize various kinds of HPC resources (e.g. a cluster fitted with LSF, a kubernetes installation, etc). For example, a component can be configured to utilize a certain amount of GPUs when it targets platform A but exclusively use CPUs on platform B. You can find more information about platforms in our docs.
Use einputs.py <path to package>
to find a list of the platforms that an experiment supports. The experiment documentation should explain the requirements for executing the experiment with any of the platforms. To select a platform use the --platform
commandline argument of elaunch.py
. If you don’t provide the commandline parameter --platform
then elaunch.py
will select the platform called default
which is the default platform of experiments.
How to restart an experiment ?
Sometimes it’s useful to restart a previously completed experiment instance instead of starting a brand new instance. For example, you can modify a script that a component in the instance used and then restart all components starting from a specific stage index and onwards.
To restart an existing instance from a given stage index use elaunch.py --restart <stageindex> ... path/to/dir.instance
. All components from stage <stageindex>
and onwards will be restarted. This means that elaunch.py
will run the logic of their restart hook
and may re-run a component depending on the output of the restart method. Use the --noRestartHooks
option with --restart
to skip the restart hook logic and re-run the components. Find out more information about restart
hooks in our docs.
Learn more
Write experiments
Get an introduction to writing virtual experiment with ST4SD Core.
Exploring the Registry UI
Learn about all the features of our web interface for browsing and examining virtual experiments packages and runs. You can visit the ST4SD Global Registry for a first look.
No Code, No Fuss creation of Experiments
Use an interactive Build Canvas and a Graph Library to create and modify experiments straight from your browser.