A regression tutorial in pySPACE¶
Introduction¶
As with any machine learning software, pySPACE is also capable of performing regression on selected data sets. In this introductory tutorial, we will be looking at exactly how some simple regression tasks can be run through the pySPACE suite.
Contents of the tutorial
Preparing the data¶
For the pupose of this tutorial, we will use a dataset downloaded from the UCI repository and namely the Wine Quality Dataset. For this dataset, the taks is rather simple: based on the 11 physicochemical input features and on the accopamnying quality score, train a regression algorithm that will be able to predict the quality of a new wine sample.
Before we can perform the regression, we must prepare the dataset that we just downloaded such that it can be processed by pySPACE. We thus go to the pySPACEcenter directory(which was initialized when the software was first installed) and make a new folder inside the storage folder. Let’s call our new folder winedata and make two subfolders inside this folder namely red and white. Your directory structure should now look like this:
pySPACEcenter\
...\
storage\
...\
winedata\
red\
white\
At this point, you are ready tow download the datasets. Go to the Wine Quality data directory and save the two *.csv files each as data.csv in the corresponding directory. Thus, the original winequality-red.csv is now reddata.csv and likewise for the white wine. Youe directory structure should now look like:
pySPACEcenter\
...\
storage\
...\
winedata\
red\
data.csv
white\
data.csv
The final step in preparing the data-set is to generate the metadata.yaml
files. This can be done either by hand or through the use of an automatic script
namely md_creator
. If you want to use the automatic
script, first go to you data directory(either the red or the white one) and
then type:
$ python <PATH_TO_PYSPACE>/pySPACE/run/scripts/md_creator.py
The script will then ask you to a couple of questions regarding the format of your data and namely
Running meta data creator ...
Please enter the name of the file. --> data.csv
Please enter the storage_format of the data.
one of arff, csv (csv with header), csvUnnamed (csv without header)--> csv
Please enter csv delimiter/separator.
(e.g. ',' , ' ' , ';' or ' ' for a tab, default:',')-->;
Please enter all rows that can be ignored, separated by comma or range.
eg. [1, 2, 3] or [1-3] -->
Please enter all columns that can be ignored, separated by comma or range.
The first row gets number 1.eg. [1, 2, 3] or [1-3] -->
Please enter the column that contains the label. -1 for last column
--> -1
Meta data file metadata.yaml written.
At this point, there is a new metadata file in your red/white directory which should read
author: <YOUR_NAME>
date: <TODAY>
type: feature_vector
ignored_rows: []
storage_format: [csv, real]
file_name: data.csv
label_column: -1
ignored_columns: []
The final version of your directory structure(for the purposes of this tutorial) should therefore be:
pySPACEcenter\
...\
storage\
...\
winedata\
red\
data.csv
metadata.yaml
white\
data.csv
metadata.yaml
Building the node chain¶
Now that we have nicely organized data, we can start doing something with it. The following example is based on nodes that are direct implementations of scikit-learn. Therefore, in order to run the following node chain, you need to install scikit-learn.
We plan to do the following to our dataset:
- Preprocessing(so that the data is nicely formatted,
- not too high-dimensional and normalized)
- Ridge Regression(extending this to another regression model
- is a matter of changing a couple of lines in the definition of the node chain)
- Analyze the results(and implicitly see how well our
- algorithms have performed)
In order to do all of the above, we need to define a node_chain under a YAML format. For that, go to the pySPACEcenter/specs/operations/ directory and open your favorite text editor. If you already want to give the file a name, save this new file under winedata.yaml. The first lines of the file should say what the file represents i.e. a node_chain and where to look for the input data(relative to the pySPACEcenter/storage/ directory). If you have followed the above steps for saving your input, the first lines of your winedata.yaml file should read:
type: node_chain
runs: 1
input_path: "winedata"
Next up is the content of the node chain itself. Whenever a node chain is
defined, it must start with a SourceNode
.
In our case, we will be using the feature_vector_source
,
since we want to cast our data into ~pySPACE.resources.data_types.feature_vector.
Your winedata.yaml file should now look like:
type: node_chain
runs: 1
input_path: "winedata"
node_chain:
-
node: FeatureVectorSourceNode
Now that the node chain has a source node, we can start the preprocessing.
Since for the purpose of this tutorial we want to keep things simple, we
will just implement two failsafe methods(in case the initial data contains
int or NaN values) through
Int2FloatNode
and NaN2NumberNode
.
We will then split our data set into test and training data using
TrainTestSplitterNode
and normalize the values using
OutlierFeatureNormalizationNode
.
Translating this into YAML directives yields:
type: node_chain
runs: 1
input_path: "winedata"
node_chain:
-
node: FeatureVectorSourceNode
-
node: NaN2Number
-
node: Int2Float
-
node : TrainTestSplitter
parameters :
train_ratio : 0.7
random : False
-
node: OutlierFeatureNormalization
Good. Now our data is well behaved and we can perform regression on it. For this purpose, we will pick a regressor node from the sklearn suite and namely KNeighborsRegressor. While this node might not be the optimal choice, it is definitely a well behaved and well understood choice and therefore suitable for the purposes of this tutorial.
Now that is done, we just have to add a sink node at the end of our node chain.
Since we want this sink node to check the performance of our node, we will be
using PerformanceSinkNode
.
The final version of your winedata.yaml file should look like:
type: node_chain
runs: 1
input_path:
node_chain:
-
node: FeatureVectorSourceNode
-
node: NaN2Number
-
node: Int2Float
-
node : TrainTestSplitter
parameters :
train_ratio : 0.7
random : False
-
node: OutlierFeatureNormalization
-
node: KNeighborsRegressorSklearnNode
-
node: PerformanceSinkNode
parameters:
evaluation_type: "regression"
Congratulations! You have just finished writing your first regression node-chain in pySPACE! In order to run the code, go to your pySPACEcenter and type the following command in the terminal
$ python launch.py --operation winedata.yaml