Writing new nodes

The first step to implementing a new node in pySPACE is to design the algorithm of the node. This is normally a pen-and-paper task which requires the designer to:

  1. plan the desired algorithm
  2. implement the algorithm using the simplified ‘pySPACE’ structure.

This process is highly simplified by the inheritance from BaseNode. The new node can thus choose to inherit a set of common methods and overwrite their content with the algorithm specific content.

In the following tutorial, we will give a general overview of the methods available for overwriting in pySPACE and discuss good-coding practices in pySPACE. The tutorial itself is accompanied by a hands-on programming tutorial which is available at templates, on the documentation of the base_node as well as by the unit testing tutorial and the documentation of the GenericTestCase.

pySPACE architecture

The first line in each module/class/function docstring should be short, in imperative, without stop and explain, what it does. You should not copy it from elsewhere, because if you overwrite existing routines, they should be different and such differ in documentation. If you really can not help copying code, you should at least document, where you took it from.

Detailed information on documentation guidelines can be found here. Before writing your own node, you should have a look at it, because every node must be documented! You should also have a look at the documentation of the nodes package.

Note

Some parts of the guidelines, will be copied to this place, because of their importance.

Naming conventions

The nodes package supports automatic loading of nodes. The easiest access then is via the node chains in the node chain operation. Therefore, your newly implemented node needs a name. To be detected as node, your node has to end with `Node`. Node names are written in CamelCase. Automatic node names are the class name and the class name without the ending Node. If you want to have extra names or backward compatibility after changing names, you can define or extend a dictionary with the name _NODE_MAPPING at the end of each module implementation. As keys you use the new names and the value is the corresponding class name. Be careful to use meaningful names or abbreviations, which are not already in use.

When the documentation is generated, the available names are added automatically to the documentation of each node and additionally complete lists are generated to give information on available nodes and their functionality.

Base nodes should always include “Base” in their class name and if they are not directly callable they should not end with “Node”.

Finding the Module

Before implementing your own node, you should first find out, where to put it. The nodes has several modules of algorithms and you should check their documentation to find out, where your node belongs to.

As the next step, you should find out, if there is already a fitting module in there, which fits to your algorithm or needs only small change in documentation, to be fitting.

If there is no module, you have to open up your own new one.

Warning

Be careful, when creating new modules. The module name should be meaningful. The module should include some basic documentation and most importantly describe a general concept of a group of algorithms and not repeat the documentation of your new class. The integration of node modules and packages into the documentation structure happens automatically.

Note

When the documentation is generated, at the end of each module documentation, a summary of its functions and classes is generated. You should not do this manually.

Finding the Class Name

The class name is written with the above mentioned conventions. It is recommended that, before you start building a new class, you check whether there is already a class that already implements your desired algorithm. This does not only prevent redundancy but might also save you a lot of time. It might also happen that one of the pySPACE classes implements a very similar version of your algorithm and as such needs only small changes in order to suit your needs.

If you do not manage to find a class that implements your desired algorithms, you should check if your algorithm is developed for a very special purpose but could be generalized. In that case, please consider implementing the generalized version of the algorithm.

As a last step, you should decide on a short but meaningful algorithm name, which underlines the main characteristics of your algorithm and differentiates it from the existing ones.

Base Nodes

For some highly sophisticated types of nodes you can find a corresponding base node in the package. The visualization package is the best example in this case. These nodes define a special interface for their algorithms and you will only have to implement some special functions for these nodes, which can be found in their documentation.

Currently the default is, that your node is not inheriting from any special generalizing base node but only from the basic BaseNode. When used in a processing chain, this node is by default just forwarding the data, but implements all functionality a node needs.

Basic node structure

The Main Principle

Every node, which is no base node should inherit from the base node.

from pySPACE.missions.nodes.base_node import BaseNode

Implementing a node now is nothing more than a process of carefully overwriting the default methods of the BaseNode. Depending on the complexity of your algorithm, this might be very easy or a bit more complicated.

Note

The templates module contains some hands-on examples of newly implemented nodes. You should also consult it before building new nodes.

The Processed Data Types - Input and Output

For processing only, special input and output types are required which are subclasses of numpy arrays. All currently available types can be found in pySPACE.resources.data_types.

The currently available data types are: TimeSeries, FeatureVector and PredictionVector. Once you have decided what type of input and output your node has, it is recommended to hardcode these possible input and output types into the node itself.

For input this can be done easily by setting the input_types variable to whatever input your node accepts, e.g.

input_types = ["TimeSeries", "FeatureVector"]

For the output type, the procedure is fulfilled by overwriting the get_output_type() function, e.g.:

def get_output_type(self, input_type, as_string=True):
    return self.string_to_class("TimeSeries")

Warning

Since nodes are called via YAML syntax almost automatically, it is of the utmost importance that the input and output types of the nodes are mentioned inside the code. If the input/output types are not correctly set, it might happen that inconsistencies occur when the node is connected to a node chain and the wrong input/output is given to a node.

General Concept of a Node

The discussion of how to build a node in pySPACE starts from the basic theoretical constructs of machine learning. What this entails is that, through pySPACE, on can implement different types of nodes that make use of:

  • supervised or unsupervised learning algorithms,
  • trainable or non-trainable algorithms, or
  • other extension options.

The BaseNode was built such that it provides dummy methods for all algorithms. For a full list of the available functions that can be overwritten for an algorithms specific needs, please consult the Base node documentation.

A general remark is that nodes are built such that they offer the possibility of being connected one after the other inside a node_chain.

../_images/node1.png

Integration of Nodes in a node_chain

As mentioned above, the nodes can be connected to one another by means of a node_chain. The node chain can be described by the following whereby input data is sent through a collection of processing algorithms, each of which brings certain modifications to the original data.

../_images/node_chain1.png

Usage of node chains

After the node chain has been constructed, it can be used in 2 different ways and namely for:

  • benchmark execution, i.e. the test data set is static and saved to disk
  • live execution, i.e. the node receives live data from a transmitter.
../_images/launch_live1.png

Visualization of splitter Nodes

Another possible use of the new node is as part of a SubNodeChain which is executed in parallel on the initial dataset.

../_images/splitter1.png

YAML syntax

pySPACE was built such that its components could be called through YAML syntax. Thus, when building a new node, one must take into account that the parameters passed to the __init__ of the new node will be read inside a YAML call to the node.

In order to show a theoretical usage of the node, it is of the utmost importance that, once the node has been written, an Exemplary Call is appended to its documentation. This not only ensures that future users have an example of how to use the node but also serves as the basis for testing the functionality of the node as shall be explained later in Unit testing.

Exemplary calls

An exemplary call to a node is best explained by the means of an example

-
    node : BLDA_Classifier
    parameters :
        class_labels : ["Target","Standard"]

In the above, one can see the two components that must be part of each Exemplary Call and namely the name of the node, which is made available under the node parameters and the initialization parameters that the node takes which are made available under a structured YAML dictionary that bears the name of parameters.

The construct might seem oversimplified but that is exactly the point of using Exemplary Calls. The user should be able to easily call the nodes and not be cluttered by excessively long Python syntax.

Node chain

The above principle is then implemented when a node chain operation is called. A detailed explanation of this process can be found in the dedicated YAML tutorial.

Unit testing

The exemplary calls are important also due to their necessity in the automated unit testing framework. The testing framework that pySPACE comes with is built on the Exemplary Call of each node. The testing framework checks the node for the existence of documentation and looks for the Exemplary Call. In order to test the node, the framework then attempts to run the Exemplary Call and reports the results.

Another important aspect of exemplary calls is that they can be easily used to check the output of the node given a specific input. This makes the testing of new nodes an extremely accessible process that requires the user to just customize the already available scripts.

More on unit testing can be found out in the unit testing tutorial.

Good coding practices

PEP8

Before programming a node or overwriting a method, have a look at its documentation and the documentation in the BaseNode, because otherwise you might get unwanted side effects. Furthermore, you should have a look at existing code and mainly follow the PEP 8 coding style

Further Minor Information

  • Randomization should be done be setting a fixed seed with the help of self.run_number to enable reproducibility.
  • A special end of a node chain can be a sink node. It is defined by implementing the method get_result_dataset. Its function is to gather all data.
  • The variable self.temp_dir an be used to store some data temporarily, e.g., for security reasons.