Processing Benchmark Data - A First Usage Example¶
In this tutorial, we process the example data, which comes with the software. Our first step is to get some performance results and the next step is to compare some algorithms.
Before we start¶
First of all you need to Download and install
the software including the run of setup.py
such that we can assume,
that you have the default configuration in ~/pySPACEcenter/
.
Resources¶
We are going to process some simple example benchmark data.
You can have a look at it in ~/pySPACEcenter/storage/example_summary/
.
When browsing through the data, you may have a look at one metadata file,
e.g.:
author: David Feess
date: 20110324
storage_format: [csvUnnamed, real]
type: feature_vector
file_name: titanic.csv
label_column: 4
source: 'http://www.cs.toronto.edu/~delve/data/titanic/desc.html'
class_meaning: {"Target":"dead", "Standard":"survived"}
dataset_description: >
The IDA Version of this dataset
http://mldata.org/repository/data/viewslug/titanic-ida/
is somehow broken, as it has only 24 instances.
Therefore, we used the dataset available at
http://www.cs.toronto.edu/~delve/data/titanic/desc.html
and transformed as follows:
adult -> 1.0
child -> -1.0
male -> 1.0
female -> -1.0
1st -> 4.0
2nd -> 3.0
3rd -> 2.0
crew -> 1.0
survived -> Standard
dead -> Target
This data already comes with the whole structure for processing with pySPACE.
First Processing¶
After having a look at the data, we now want to apply a classification algorithm. This is done by applying an operation:
# concatenation of algorithms
type: node_chain
# path relative to storage
input_path: "example_summary"
runs : 2 # number of repetitions to catch randomness
node_chain: # the algorithms
- # load the data
node: FeatureVectorSourceNode
- # random splitting into 60% training, 40% testing
node: TrainTestSplitter
parameters :
train_ratio: 0.6
random: True
- # normalize each feature to have
node: GaussianFeatureNormalizationNode
- # a standard svm classifier (affine version)
node : SorSvmNode
parameters :
complexity : 1
kernel_type : "LINEAR"
class_labels : ["Standard","Target"]
max_iterations : 10
- # gather results and calculate performance
node: PerformanceSinkNode
parameters:
ir_class: "Target"
You can start it in the command line directly in the pySPACE center by invoking:
python launch.py --operation examples/classification.yaml
Alternatively, you could change your current directory to pySPACE/run
beforehand if you did not use the setup.py to create the required links.
Now you should get some information on your setup and finally a progress bar.
The result can now be found at
~/pySPACEcenter/storage/operation_results/CURRENT_DATE_TIME
,
where the time tag in the folder name corresponds to the start time of
your algorithm.
You may have a look at the short_result.csv
.
If you want to browse the result tabular,
start performance_results_analysis
.
For having a faster execution using all cores of your PC, simply change the command to:
python launch.py --mcore --operation examples/classification.yaml
If you now want to compare different algorithms you can execute the following operation:
# concatenation of algorithms
type: node_chain
# path relative to storage
input_path: "example_summary"
runs : 3 # number of repetitions to catch randomness
parameter_ranges :
__C__ : [0.05, 1.25]
__Normalization__ : [GaussianFeatureNormalization,
EuclideanFeatureNormalization]
node_chain: # the algorithms
- # load the data
node: FeatureVectorSourceNode
- # random splitting: 40% training, 60% testing
node: TrainTestSplitter
parameters :
train_ratio: 0.4
random: True
- # normalize each feature
node: ${__Normalization__}
- # standard svm classifier (affine version)
node : SorSvmNode
parameters :
complexity : ${__C__}
kernel_type : "LINEAR"
class_labels : ["Standard","Target"]
max_iterations : 10
- # gather results and calculate performance
node: PerformanceSinkNode
parameters:
ir_class: "Target"
This is done with the command:
python launch.py --mcore --operation examples/bench.yaml
When now browsing through the results as described above, you see a lot of more possible parameters.
Which parameter combination was the best? Complexity of 0.1 and GausianFeatureNormalization?
Using now List of all Nodes you can check the available algorithms and their
parameters and play around with the specification files and change it.
They should be found at ~/pySPACEcenter/specs/operations/examples/
.