Writing a Virtual Experiment Interface
Use this page to learn how to write a virtual experiment interface.
A core-concept in ST4SD is a virtual experiment. This is a computational workflow that takes as input one or more systems of a given type, and produces as output values of properties of those systems.
This document describes how ST4SD developers can describe this information in their virtual experiments via an interface
The interface of a virtual experiment defines:
- The specification used to describe
inputsystems it processes e.g. SMILEs for small molecules - Instructions to extract the
inputsystems from input data - Instructions to extract the values of
propertiesthat the virtual experiment computes
Once a virtual experiment has an interface ST4SD can return a pandas.DataFrame containing the properties calculated by instances of the virtual experiment, as well as the ids of the input systems that an instance processed. This functionality is provided via the st4sd-datastore API and the st4sd-runtime-service API. See using a virtual experiment interface for further information.
Interface Definition
An interface is an optional top-level FlowIR key which describes what input and properties of a virtual experiment, as well as how to extract their values. For experiments using DSL the output and interface fields are direct children of the entrypoint field instead. You can find an example here.
The general scheme of an interface is
interface:description: #A description of the virtual experiment. OptionalinputSpec:namingScheme: #The scheme/specification used to define your inputs e.g. SMILESinputExtractionMethod:$INPUT_EXTRACTION_METHOD_NAME: #The name of an input extraction method - see "Input Extraction Method" section for possibilitiessource: #Optional source method used to provide input to the extraction method.. See the "Source Methods" section for potential values....args: #Optional arguments for the extraction method
The 2 main fields are:
interface.inputSpec: A dictionary that describes the inputs of the virtual experiment and how to extract theminterface.propertiesSpec: An array of dictionaries (one perproperty) that describes how to extract the values of theproperty
Within both fields the developer defines extraction methods which tell ST4SD how to extract values that the virtual experiment reads (input ids) and writes (property values).
- See input extraction methods for details on choices for that field
- See property extraction methods for details on choices for that field
Both input extraction methods and property extraction methods can have 2 sub-fields, source and args which may be optional. If the source method is present it must be one of the options outlined in source methods
Input Extraction Methods
Input extraction methods are used by to retrieve a list of the input system ids
csvColumn
Use the csvColumn extraction method if the input ids of your experiment are defined in a column of an input CSV file which has column headers.
Options
source:path: #The path SOURCE-METHOD. See source-methods for moreargs:column: #The name of the column in the CSV file containing the ids (the column header)
Example
interface:inputSpec:namingScheme: 'SMILES'inputExtractionMethod:csvColumn:source:path: 'input/input_smiles.csv'args:column: "SMILES"
hookGetInputIds
Use hookGetInputIds when you want to provide your own python function for getting the input ids.
def get_input_ids(input_id_file: str, variables: Dict[str, str]) -> List[str]:'''Params:input_id_file (str): The path to the location of the file that contains input ids of the input systems. This comes from the `source.path` option in the interface YAML.variables (dict): A dictionary of the global and user variables passed to the virtual experiment instanceReturns:A list of strings each of which is the id of an input system'''
Options
source:path: #A path relative to the root directory of the virtual experiment instance. It points to the CSV file that contains the `input-ids`.
Example
interface:inputSpec:namingScheme: 'SMILES'inputExtractionMethod:hookGetInputIds:source:path: 'input/input_smiles.csv'
Property Extraction Methods
Property extraction methods conceptually produce a properties table which contains at least 2 columns: (input-id, $propertyName)where $propertyName is the name of the property in the propertiesSpec element using the extraction method. Note: in practice propertyName will be transformed to lowercase.
csvDataFrame
Use this method if
- there is a single CSV file to extract the values of a particular property from for all input
- The properties are stored in a column of this CSV file
- The input ids are stored in a column of this CSV file
Note:
The table created by this method must have column headers input-id and $PROPERTYNAME. The csvDataFrame property extractor can change the column names to these correct values using the renameColumns option (see Example)
Options
source:$SOURCE_METHOD_NAME # Name of the source methods and its options. See below.args:renameColumns: #Optional: Dictionary whose keys are column names in the CSV file and values are the names to rename the associated key columns. Output column names are implicitly converted to `lowercase``${name}: ${value}`: #(Optional) Arguments to the `pandas.read_csv()` method. The default arguments are `engine="python"` and `sep=None`.
Example
propertiesSpec:- name: 'band-gap'propertyExtractionMethod:csvDataFrame:source:keyOutput: 'FinalEnergies'args:renameColumns:SMILE: "input-id"
hookGetProperties
UsehookGetProperties when you want to provide your own python function for getting the property values.
def get_properties(property_name:str, property_output_file: str, input_id_file: str, variables: Dict[str, str]) -> pandas.DataFrame'''Params:property_name (str): The name of the property the function should return the values of.property_output_file (str): The path to the file containing the propertiesinput_id_file (str): The path to the file containing the input_idsvariables (dict): A dictionary of the global and user variables passed to the virtual experiment instanceReturns:
If hookGetProperties is defined as the propertyExtractionMethod for property idx the values passed to the parameters of this function are determined as follows
property_name: The value ofinterface.propertiesSpec[idx].nameproperty_output_file: The value returned by theinterface.propertiesSpec[idx].propertyExtractionMethod.hookGetProperties.sourcemethodinput_id_file: The value ofinterface.inputSpec.inputExtractionMethod.$METHOD.source
Note: The column headers in the returned pandas DataFrame will be converted to lowercase by ST4SD.
Options
hookGetInputIds:source: #A source method - see below for details
Example
propertiesSpec:- name: 'band-gap'propertyExtractionMethod:hookGetProperties:source:keyOutput: 'FinalEnergies'
Source methods
Source methods define different ways of defining a source file-path that is used by input or property extraction methods
path
Use this method if you know the full path of the source file.
Options
path: $PATH #A path relative to the root directory of the virtual experiment instance. It points to the CSV file that contains the `input-ids`.
Example
propertyExtractionMethod:hookGetProperties:source:path: "stages/stage1/EnergiesExtraction/energies.csv"
keyOutput
Use this method if the properties are in a key-output of the experiment. This method avoids having to know the path to the file (which could change if storage methods change)
Options
# The name of a key-output in the experiment.# These are keys of the top-level FlowIR field `output`.keyOutput: $KEYOUTPUT
Example
propertyExtractionMethod:hookGetProperties:source:keyOutput: "FinalEnergies"
Example
In this example we have a simple virtual experiment that counts vowels and letters in strings. Here is the FlowIR definition:
output:vowels:data-in: stage0.count-vowels/vowels.csv:refletters:data-in: stage0.count-letters/letters.csv:refcomponents:- name: count-vowelsreferences:
Here is an input words.csv file:
word;hello;awesome;world;
When we process the above input file with this workflow we get 2 outputs:
The output vowels contains the CSV file:
a;e;i;o;u;word;vowels0;1;0;1;0;hello;21;2;0;1;0;awesome;40;0;0;1;0;world;1
The output letters contains the CSV file:
word;lettershello;5awesome;7world;5
Interface
An interface to this experiment is shown below. This interface used csvColumn input extraction method and the csvDataFrame property extraction method. These methods mean the developer does not have to write any other code.
interface:description: Counts vowels in wordsinputSpec:namingScheme: wordsinputExtractionMethod:csvColumn:source:path: input/words.csvargs:
Run Details
Adding the interface definition will cause instances of the virtual experiment to generate 2 new files:
${INSTANCE_DIR}/output/properties.csv: This is a;delimitedCSVfile that contains thepropertiescolumns produced by each property defined inpropertySpec.${INSTANCE_DIR}/outputs/input-ids.json: A JSON file that contains an array of strings. Each string is the id of an input system.
For the above example we would get the following in ${INSTANCE_DIR}/output/properties.csv:
input-id;vowels;lettershello;2;5awesome;4;7world;1;5
The input ids file (${INSTANCE_DIR}/outputs/input-ids.json) looks like this:
["hello","awesome","world"]