Memoization
ST4SD can automatically reuse results of previous calculations rather than executing them again. This feature is called memoization.
Memoization can dramatically increase efficiency as many virtual experiments internally perform the same calculations on input systems, even if they measure different properties.
An example is molecule geometry optimization via Density Functional Theory. Using memoization ST4SD can recognise when an experiment instance wants to perform a geometry optimization on a molecule that has already been completed, even if this was executed in a different type of experiment. Instead of running this potentially expensive calculation again, ST4SD uses the existing results.
How to use memoization
To use memoization you simply add the following information when starting a virtual experiment i.e. to the api.api_experiment_start()
payload:
payload['additionalOptions'] = ['--useMemoization=true']
When this option is specified before executing each component in the virtual experiment the st4sd-runtime
will check if that component has executed before. If it has, then it will use the already computed results rather than executing the component.
The key concept users need to understand is how the st4sd-runtime
decides when components are the same. This is covered in the next sections.
How the runtime decides if a component has run before
Objective
The objective of the ST4SD memoization system is, given a ready-to-execute component, to reuse the result of a previously executed component if that component had:
- The same program version
- With the same command line arguments
- With the same input data
Additional Requirements
- A match should be found if the previous execution was on a different architecture or if the resource request was different (number CPUs or CPU v GPU).
- Support matching even if inconsequential environment or time specific information e.g. timestamps, are in the inputs making them not strictly equivalent
- Memoization should be significantly faster than executing the component. In particular we want to avoid directly comparing the contents of directories with many large input files e.g. simulation outputs 10-100GBs size, which could be computationally expensive
- The system should be transparent i.e. developers do not need to perform additional development tasks to enable it.
Method
ST4SD uses a fast and transparent method to generate a high-fidelity memoization key for each component. If the runtime finds a previously run component has a matching key then that component’s results are re-used.
Briefly, for all nodes the memoization key includes information on
- The command line
- The container image name (if using a container)
- Hash of directly referenced input files
- For a reference to component directories: The memoization key of the producer node, followed by the relative path of the reference to the working directory of the component
Note: Resource requests are not included in the memoization key. This satisfies additional requirement (1).
Note: The contents of component directories are not hashed if the directory as whole is referenced. This directly fulfils additional requirement (3). It also fulfils additional requirement (2) as you can reference the directory instead of the files if you have this issue.
Our algorithm provides high-fidelity even without inspecting the contents of directories because
- In this case a component’s memoization key includes the keys of a component’s producer nodes.
- The input files references by source nodes are hashed
This means if a component matches a previously executed one, then they are part of the same execution chain and the inputs to the source of the chain are identical.
You can read more details of our method in this paper.
Potential Vulnerabilities
We call the cases where our memoization key will not correctly identify reuse opportunities w.r.t. the stated objectives vulnerabilities. These either cause false-negatives (did not reuse previous results when it could have) or false-positives (reused previous results when it shouldn’t have), which are of greater concern.
- False-Positive: If the code of a program changes e.g. version change without any indication of the change, results produced with previous executions will be identified as equivalent.
- To avoid this developers should ensure they indicate changes in versions of codes in their virtual experiment by modifying e.g. a version tag in the container name or adding the component to program name
- If you use full sha for images this cannot happen. However this would not allow memoization with multi-arch images
- False-Positive: If a component consumes non-component directories then, if the contents of those directories change and no-other inputs change, two runs of this will match
- As mentioned above developers should avoid references to non-component directories
- False-Negative: If a component consumes the same information (e.g. a molecule SMILE) from of one of its producer components, but references the producer directory rather than a specific file in that directory, previous calculations it has performed on this molecule may not be reused.
- This is because the memoization key of the producer changes when its inputs change.
- For example if the same molecule appears in two different lists provided to the producer then its memoization key will be different. This in turn leads to a different memoization key for the consumer and hence no match.
- False-Negative: If a component can itself process multiple systems e.g. molecules, then if a subset of the same molecules are passed to the same component there will be no match.
- This is because the input to the component has changed and its output will be different even though partially similar to previous runs
- This situation lies outside the bounds of ST4SDs memoization objective. If developers want memoization to function in this case they need to split the inputs in the workflow.
- Technically this requires knowledge of how the specific component functions which is counter to ST4SDs adoption of a black-box scope and aim of transparent memoization - see this paper for more.
How memoization is implemented
The memoization mechanism of ST4SD is built on top of the st4sd-datastore microservices.
At runtime the st4sd-runtime-core
always generates a memoization key for each component in the virtua experiment it executes. This is regardless if memoization is on or not.
By default every virtual experiment you run is registered to the st4sd-datastore
. Part of the information stored for each component in the run is its generated memoization key.
When memoization is switched on, after generating the memoization key for a to-be-executed component, the st4sd-runtime-core
queries the st4sd-datastore
to see if any previous component has the same key. If it does then st4sd-runtime-core
fetches the output of the previous component from the st4sd-datastore
and places it in the working directory of the to-be-executed component. It then skips execution of that component and moves to its dependencies.