Processing Benchmark Data - A First Usage Example

In this tutorial, we process the example data, which comes with the software. Our first step is to get some performance results and the next step is to compare some algorithms.

Before we start

First of all you need to Download and install the software including the run of setup.py such that we can assume, that you have the default configuration in ~/pySPACEcenter/.

Resources

We are going to process some simple example benchmark data. You can have a look at it in ~/pySPACEcenter/storage/example_summary/. When browsing through the data, you may have a look at one metadata file, e.g.:

author: David Feess
date: 20110324
storage_format: [csvUnnamed, real]
type: feature_vector
file_name: titanic.csv
label_column: 4
source: 'http://www.cs.toronto.edu/~delve/data/titanic/desc.html'
class_meaning: {"Target":"dead", "Standard":"survived"}
dataset_description: >
    The IDA Version of this dataset
    http://mldata.org/repository/data/viewslug/titanic-ida/
    is somehow broken, as it has only 24 instances.
    
    Therefore, we used the dataset available at
    http://www.cs.toronto.edu/~delve/data/titanic/desc.html
    and transformed as follows:
        
        adult -> 1.0
        child -> -1.0
        
        male -> 1.0
        female -> -1.0
        
        1st -> 4.0
        2nd -> 3.0
        3rd -> 2.0
        crew -> 1.0
        
        survived -> Standard
        dead -> Target

This data already comes with the whole structure for processing with pySPACE.

First Processing

After having a look at the data, we now want to apply a classification algorithm. This is done by applying an operation:

# concatenation of algorithms
type: node_chain
# path relative to storage
input_path: "example_summary"
runs : 2 # number of repetitions to catch randomness
node_chain: # the algorithms
    -   # load the data
        node: FeatureVectorSourceNode
    -   # random splitting into 60% training, 40% testing
        node: TrainTestSplitter
        parameters :
            train_ratio: 0.6
            random: True
    -   # normalize each feature to have
        node: GaussianFeatureNormalizationNode
    -   # a standard svm classifier (affine version)
        node : SorSvmNode
        parameters :
            complexity : 1
            kernel_type : "LINEAR"
            class_labels : ["Standard","Target"]
            max_iterations : 10
    -   # gather results and calculate performance
        node: PerformanceSinkNode
        parameters:
            ir_class: "Target"

You can start it in the command line directly in the pySPACE center by invoking:

python launch.py --operation examples/classification.yaml

Alternatively, you could change your current directory to pySPACE/run beforehand if you did not use the setup.py to create the required links. Now you should get some information on your setup and finally a progress bar. The result can now be found at ~/pySPACEcenter/storage/operation_results/CURRENT_DATE_TIME, where the time tag in the folder name corresponds to the start time of your algorithm. You may have a look at the short_result.csv. If you want to browse the result tabular, start performance_results_analysis.

For having a faster execution using all cores of your PC, simply change the command to:

python launch.py --mcore --operation examples/classification.yaml

If you now want to compare different algorithms you can execute the following operation:

# concatenation of algorithms
type: node_chain
# path relative to storage
input_path: "example_summary"
runs : 3 # number of repetitions to catch randomness
parameter_ranges :
    __C__ : [0.05, 1.25]
    __Normalization__ : [GaussianFeatureNormalization,
                        EuclideanFeatureNormalization]
node_chain: # the algorithms
    -   # load the data
        node: FeatureVectorSourceNode
    -   # random splitting: 40% training, 60% testing
        node: TrainTestSplitter
        parameters :
            train_ratio: 0.4
            random: True
    -   # normalize each feature
        node: ${__Normalization__}
    -   # standard svm classifier (affine version)
        node : SorSvmNode
        parameters :
            complexity : ${__C__}
            kernel_type : "LINEAR"
            class_labels : ["Standard","Target"]
            max_iterations : 10
    -   # gather results and calculate performance
        node: PerformanceSinkNode
        parameters:
            ir_class: "Target"

This is done with the command:

python launch.py --mcore --operation examples/bench.yaml

When now browsing through the results as described above, you see a lot of more possible parameters.

Which parameter combination was the best? Complexity of 0.1 and GausianFeatureNormalization?

Using now List of all Nodes you can check the available algorithms and their parameters and play around with the specification files and change it. They should be found at ~/pySPACEcenter/specs/operations/examples/.