A regression tutorial in pySPACE

Introduction

As with any machine learning software, pySPACE is also capable of performing regression on selected data sets. In this introductory tutorial, we will be looking at exactly how some simple regression tasks can be run through the pySPACE suite.

Contents of the tutorial

Preparing the data

For the pupose of this tutorial, we will use a dataset downloaded from the UCI repository and namely the Wine Quality Dataset. For this dataset, the taks is rather simple: based on the 11 physicochemical input features and on the accopamnying quality score, train a regression algorithm that will be able to predict the quality of a new wine sample.

Before we can perform the regression, we must prepare the dataset that we just downloaded such that it can be processed by pySPACE. We thus go to the pySPACEcenter directory(which was initialized when the software was first installed) and make a new folder inside the storage folder. Let’s call our new folder winedata and make two subfolders inside this folder namely red and white. Your directory structure should now look like this:

pySPACEcenter\
        ...\
        storage\
                ...\
                winedata\
                        red\
                        white\

At this point, you are ready tow download the datasets. Go to the Wine Quality data directory and save the two *.csv files each as data.csv in the corresponding directory. Thus, the original winequality-red.csv is now reddata.csv and likewise for the white wine. Youe directory structure should now look like:

pySPACEcenter\
        ...\
        storage\
                ...\
                winedata\
                        red\
                                data.csv
                        white\
                                data.csv

The final step in preparing the data-set is to generate the metadata.yaml files. This can be done either by hand or through the use of an automatic script namely md_creator. If you want to use the automatic script, first go to you data directory(either the red or the white one) and then type:

$ python <PATH_TO_PYSPACE>/pySPACE/run/scripts/md_creator.py

The script will then ask you to a couple of questions regarding the format of your data and namely

Running meta data creator ...

Please enter the name of the file. --> data.csv
Please enter the storage_format of the data.
 one of arff, csv (csv with header), csvUnnamed (csv without header)--> csv
Please enter csv delimiter/separator.
(e.g. ','  , ' ' , ';' or '     ' for a tab, default:',')-->;
Please enter all rows that can be ignored, separated by comma or range.
 eg. [1, 2, 3] or [1-3] -->
Please enter all columns that can be ignored, separated by comma or range.
The first row gets number 1.eg. [1, 2, 3] or [1-3] -->
Please enter the column that contains the label. -1 for last column
 --> -1

Meta data file metadata.yaml written.

At this point, there is a new metadata file in your red/white directory which should read

author: <YOUR_NAME>
date: <TODAY>
type: feature_vector
ignored_rows: []
storage_format: [csv, real]
file_name: data.csv
label_column: -1
ignored_columns: []

The final version of your directory structure(for the purposes of this tutorial) should therefore be:

pySPACEcenter\
        ...\
        storage\
                ...\
                winedata\
                        red\
                                data.csv
                                metadata.yaml
                        white\
                                data.csv
                                metadata.yaml

Building the node chain

Now that we have nicely organized data, we can start doing something with it. The following example is based on nodes that are direct implementations of scikit-learn. Therefore, in order to run the following node chain, you need to install scikit-learn.

We plan to do the following to our dataset:

  1. Preprocessing(so that the data is nicely formatted,
    not too high-dimensional and normalized)
  2. Ridge Regression(extending this to another regression model
    is a matter of changing a couple of lines in the definition of the node chain)
  3. Analyze the results(and implicitly see how well our
    algorithms have performed)

In order to do all of the above, we need to define a node_chain under a YAML format. For that, go to the pySPACEcenter/specs/operations/ directory and open your favorite text editor. If you already want to give the file a name, save this new file under winedata.yaml. The first lines of the file should say what the file represents i.e. a node_chain and where to look for the input data(relative to the pySPACEcenter/storage/ directory). If you have followed the above steps for saving your input, the first lines of your winedata.yaml file should read:

type: node_chain
runs: 1
input_path: "winedata"

Next up is the content of the node chain itself. Whenever a node chain is defined, it must start with a SourceNode. In our case, we will be using the feature_vector_source, since we want to cast our data into ~pySPACE.resources.data_types.feature_vector. Your winedata.yaml file should now look like:

type: node_chain
runs: 1
input_path: "winedata"
node_chain:
    -
        node: FeatureVectorSourceNode

Now that the node chain has a source node, we can start the preprocessing. Since for the purpose of this tutorial we want to keep things simple, we will just implement two failsafe methods(in case the initial data contains int or NaN values) through Int2FloatNode and NaN2NumberNode. We will then split our data set into test and training data using TrainTestSplitterNode and normalize the values using OutlierFeatureNormalizationNode. Translating this into YAML directives yields:

type: node_chain
runs: 1
input_path: "winedata"
node_chain:
    -
        node: FeatureVectorSourceNode
    -
        node: NaN2Number
    -
        node: Int2Float
    -
        node : TrainTestSplitter
        parameters :
            train_ratio : 0.7
            random : False
    -
        node: OutlierFeatureNormalization

Good. Now our data is well behaved and we can perform regression on it. For this purpose, we will pick a regressor node from the sklearn suite and namely KNeighborsRegressor. While this node might not be the optimal choice, it is definitely a well behaved and well understood choice and therefore suitable for the purposes of this tutorial.

Now that is done, we just have to add a sink node at the end of our node chain. Since we want this sink node to check the performance of our node, we will be using PerformanceSinkNode. The final version of your winedata.yaml file should look like:

type: node_chain
runs: 1
input_path:
node_chain:
    -
        node: FeatureVectorSourceNode
    -
        node: NaN2Number
    -
        node: Int2Float
    -
        node : TrainTestSplitter
        parameters :
            train_ratio : 0.7
            random : False
    -
        node: OutlierFeatureNormalization
    -
        node: KNeighborsRegressorSklearnNode
    -
        node: PerformanceSinkNode
        parameters:
            evaluation_type: "regression"

Congratulations! You have just finished writing your first regression node-chain in pySPACE! In order to run the code, go to your pySPACEcenter and type the following command in the terminal

$ python launch.py --operation winedata.yaml