Specification Files

All specification files rely on the usage of YAML. Here some examples are given.

Operation Chains

For defining an operation chain you need to specify with YAML a dictionary with three entries:

input_path:

path to the summary of input datasets relative to the The Data Directory (storage)

runs:

number of repetitions for each operation to handle random effects

(optional, default: 1)

operations:

list of operation specification files, being executed in the given order

The specification files should be in the operations subfolder of your specs folder.

Example of an operation chain specification file

# An example of an operation chain specification file.
# The input used for the operation chain is specified as the value of
# "input_path", and the sequence of operations as value of "operations"
# In this example, first the operation specified in 
# "example_operation1.yaml" would be executed directly on example_data.
# The result would act as input for the operation
# specified in "example_operation2.yaml". The output of
# this operation would be the output of the whole operation chain.
# *runs* specifies how often the whole procedure is repeated for different
# random seeds. This is useful when randomized components are contained
# like crossvalidation splitters etc.

input_path : "example_data"
runs: 1

operations:
   -
       example_operation1.yaml
   -
       example_operation2.yaml

Operations

For defining an operation you need to specify a dictionary with YAML syntax with these entries:

type:

name of the operation type you want to use (e.g. node_chain, merge, mmlf, statistic, shuffle)

input:

path to the summary of input datasets relative to the data folder

This parameter is irrelevant for operations in an operation chain.

backend:

overwrites the default backend of the command line call (e.g. “mcore”, “serial”)

runs:

number of repetitions for this operation to handle random effects

This parameter is irrelevant for operations in an operation chain.

(optional, default: 1)

. . .:

Each operation has its own additional parameters. For the details have a look at the documentation of your specific operation type or at the corresponding example file in the pySPACE specification folder in the subfolder operations.

Examples of operation specification files

Example of a NodeChainOperation:

# An example of a *node_chain* specification file.
# The specified input is the value of the entry
# with the key "input_path", the templates the value of "templates".
# This template is parametrized with two parameters called
# "__LOWER_CUTOFF__", and "__UPPER_CUTOFF__". Optionally, some "constraints"
# on the allowed parameter combinations can be defined. For instance,
# the constraint "__LOWER_CUTOFF__ < __UPPER_CUTOFF__" prevents that
# the combination where both __LOWER_CUTOFF__ and __UPPER_CUTOFF__ are 2.0
# is tested. For each combination of the given values for these two parameters
# that fulfills all constraints and the datasets of the
# input summary, one result dataset is created. This result dataset
# consists of the results of 10 independent runs with the
# instantiated template performed to the respective input dataset.
# Each such run is an independent process.

# The optional parameter "backend" allows to overwrite the backend specification provided
# via the command-line. This is useful if the operation is part of a chain
# and different operations of the chain should
# not be executed on the same modality.

type: node_chain

input_path: "example_data"
templates : ["example_flow.yaml"]
backend: "local"
parameter_ranges :
    __LOWER_CUTOFF__ : [0.1, 1.0, 2.0]
    __UPPER_CUTOFF__ : [2.0, 4.0]
constraints:
    - "__LOWER_CUTOFF__ < __UPPER_CUTOFF__"


runs : 10

Example of a weka classification operation:

# An example of a WekaClassification-Operation specification file.
# The specified input is the value of the entry with the key
# "input_path", the weka template is "classification". The available
# templates are stored in specs/operations/weka_templates/.

type: weka_classification
input_path: "tutorial_data"
template: classification

# The specific classifiers to be used within the operation can be specified 
# using the keyword "parameter_settings". The example below would compare four 
# different parametrizations of a linear svm (complexity either 0.1 or 1.0 and 
# weight for class 2 either 1.0 or 2.0). Note that 'libsvm_lin' is an 
# abbreviation which must be defined in abbreviations.yaml.
# parameter_settings:
#     -
#         classifier: 'libsvm_lin'
#         # "ir_class_index": index of the class performance metrics are
#         # calculated for; index begins with 1
#         ir_class_index: 1
#         complexity: 0.1
#         w0: 1.0
#         w1: 1.0
#     -
#         classifier: 'libsvm_lin'
#         ir_class_index: 1
#         complexity: 1.0
#         w0: 1.0
#         w1: 1.0
#     -
#         classifier: 'libsvm_lin'
#         ir_class_index: 1
#         complexity: 0.1
#         w0: 1.0
#         w1: 2.0
#     -
#         classifier: 'libsvm_lin'
#         ir_class_index: 1
#         complexity: 1.0
#         w0: 1.0
#         w1: 2.0

# Alternatively to specific parameter settings one could also specify ranges for 
# each parameter. This is indicated by the usage of "parameter_ranges" instead of
# "parameter_settings". *parameter_ranges* are automatically converted  into
# *parameter_settings* by creating the crossproduct of all parameter ranges. The
# parameter_ranges in the comment below
# result in the same parameter_setting as the one given above.
#
# parameter_ranges :
#     complexity : [0.1, 1.0]
#     w0 : [1.0]
#     w1 : [1.0, 2.0 ]
#     ir_class_index: [1]
#     classifier: ['libsvm_lin']
parameter_ranges :
    complexity : [0.1, 0.5, 1.0, 5.0, 10.0]
    w0 : [1.0]
    w1 : [1.0]
    ir_class_index: [1]
    classifier: ['libsvm_lin']

The Configuration File

The pySPACE configuration file is mainly used in the command-line interface.

Here all the relevant general parameters for the execution of pySPACE are specified. The most important parameters are the storage and the spec_dir. If you want to debug you may want to change the logging levels. Note, that when using the command-line interface, it is good to activate the default serial backend.

To find out all the defaults and possibilities have a look at the the default configuration file:

---
# This is the standard default configuration file.
# Each possible parameter is mentioned here and its default value.
# Normally the default is quite useful und you won't have to change something,
# especially, when using the pySPACEcenter default configuration file.

# ===================
# = Main Parameters =
# ===================
# These parameters are the most important for pySPACE.
# The others are only relevant for special components.

# The directory from which data is loaded and stored to.
# To specify this directory is very very important.
# Default: $home_dir/pySPACEcenter/storage
storage:    ~/pySPACEcenter/storage
# The directory in which the configuration/specification files for operations, 
# operation chains, WEKA and pySPACE related options are stored.
# Default: $home_dir/pySPACEcenter/specs
spec_dir:    ~/pySPACEcenter/specs/
# The minimum level, log message must have to be printed to the stdout.
# Levels are based on the Python logging package
# possible levels are logging.{DEBUG, INFO, WARNING, CRITICAL, ERROR, FATAL}
# When using backends like the loadl backend the stdout is redirected to a file.
# If you get to much output, just use 'logging.CRITICAL'.
console_log_level:      logging.WARNING
# The minimum level, log message must have to be written to the operation log file.
# This file can be then found in your currently result folder.
# Be careful, that the file can get quite large when using DEBUG or INFO.
file_log_level:         logging.INFO
# The Python path that should be used during the experiment
# Paths normally available in Python do not have to be mentioned
# This part of setting paths is especially good to use alternative libraries,
# since the paths here get priority.
# Default: empty list
# external_nodes: ["~/pySPACEcenter/external_nodes","/opt/local/lib/python2.7/site-packages/pySPACE_extra_nodes"]
#python_path:
#        - /usr/lib/python2.5/site-packages
#        - /usr/lib/python2.5/lib-dynload/
#        - /usr/lib/python2.5
#        - /var/lib/python-support/python2.5/
#        - /usr/lib/python2.5/lib-tk/

# =========================
# = Node specific options =
# =========================

# If you want to have your own nodes outside the normal pySPACE structure
# this parameter lists external folders which where also scanned for nodes.
# Furthermore, the corresponding path is added to the local system path.
# Note, that still double naming is forbidden and crashes the software.
# See: pySPACE.missions.nodes.external
# external_nodes = [~/pySPACEcenter/external_nodes]

# ==============================
# = Operation specific options =
# ==============================

# WEKA operation
# The java class path used for WEKA
# weka_class_path:        ~weka-3-6-0/weka.jar:/home/user/weka-additional

# ============================
# = Backend specific options =
# ============================

# ===Local===

# Number of used CPUs for parallelization.
# By default the total number of available CPUs is used.
# pool_size : 1

# ===LoadLeveler===
# Specify parameters for the loadl backend for the cluster.
#
# Class name of your committed jobs. Default is 'general'.
# Depending on the class and the configuration of the cluster
# jobs with more important class name get a higher priority
job_class: general # one of [ critical, general, longterm]
# Maximal memory one process will use, since this should be known by the 
# scheduler to appropriately start more jobs or NOT. The value only effects
# the scheduling (loadleveler) but not the system, so nothing will happen to
# your jobs if they exceed the specified value.
# Default is set to the available memory divided by the number of CPUs of one
# blade (3250mb). If you do not expect such large memory, decrease the value.
consumable_memory: 3250mb # number and unit (gb,mb,kb)
# Maximal number of CPUs one job needs, since this should be known by the 
# scheduler to appropriately start more jobs or NOT. The value only effects
# the scheduling (loadleveler) but not the system, so nothing will happen to
# your jobs if they exceed the specified value.
# The default is 1.
consumable_cpus: 1
# optionally specify which nodes are used for calculation, e.g. 
# anodes: (Machine == "anode05.dfki.uni-bremen.de") || (Machine == "anode02.dfki.uni-bremen.de")


first_call : True # Internal Parameter for first call of software to give detailed information and welcome screen. It should remain at the last line!

After using python setup.py this file should be located at: ~/pySPACEcenter/config.yaml.

Note

By default, the configuration file is searched for in the folder ~/pySPACEcenter/, but you can manually specify the location of the configuration directory in your bash or the bash configuration file (bash-profile, bashrc, ...) using

export PYSPACE_CONF_DIR=<myconfdir>

In your IDE you would have to add this variable to your environment variables.