cv_splitter

Module: missions.nodes.splitter.cv_splitter

Create splits of the data into train and test data used for cross-validation

Inheritance diagram for pySPACE.missions.nodes.splitter.cv_splitter:

Inheritance diagram of pySPACE.missions.nodes.splitter.cv_splitter

CrossValidationSplitterNode

class pySPACE.missions.nodes.splitter.cv_splitter.CrossValidationSplitterNode(splits=10, stratified=True, random=True, time_dependent=False, stratified_class=None, *args, **kwargs)[source]

Bases: pySPACE.missions.nodes.base_node.BaseNode

Perform (stratified) cross-validation

During benchmarking, n pairs of training and test data are generated, where n is configurable via the parameter splits. The n test datasets are pairwise disjunct. Internally, the available data is partitioned into n pairwise disjunct sets s_1, ..., s_n of equal size (the “splits”). The i-th pair of training and test data is generated by using s_i as test data and the union of the remaining datasets as training data.

The partitioning is stratified per default, i.e. the splits have the same class ratio as the overall dataset. Per default, the partitioning is based on shuffling the data randomly. In this case, the partitioning of the data into s_1, ..., s_n is determined solely based on the run number (used as random seed), yielding the same split for the same run_number and different ones for two different run_numbers.

Parameters

splits:

The number of splits created internally. If n data points exist and m splits are created, each of these splits consists of approx. m/n data points.

(optional, default: 10)

stratified:

If true, the cross-validation is stratified, i.e. the overall class-ratio is retained in each split (as good as possible).

(optional, default: True)

random:

If true, the order of the data is randomly shuffled.

(optional, default: True)

time_dependent:

If True splitting is done separately for different (= not overlapping) time windows to ensure that instances corresponding to the same marker will be in the same split.

Note

Stratification is only allowed here if there is only one class label for one marker.

(optional, default: False)

stratified_class:
 

If time_dependent is True and stratified_class is specified stratification is only done for the specified class label (String). The other class is filling the split preserving the time order of the data. This also means that random has no effect here.

(optional, default: None)

Exemplary Call

-
    node : CV_Splitter
    parameters :
          splits : 10
          stratified : True
Author:

Jan Hendrik Metzen (jhm@informatik.uni-bremen.de)

Created:

2008/12/16

POSSIBLE NODE NAMES:
 
  • CrossValidationSplitterNode
  • CrossValidationSplitter
  • CV_Splitter
POSSIBLE INPUT TYPES:
 
  • PredictionVector
  • FeatureVector
  • TimeSeries

Class Components Summary

__hyperparameters
_create_splits() Create the split of the data for n-fold cross-validation
input_types
is_split_node() Return whether this is a split node
request_data_for_testing() Returns the data for testing of subsequent nodes
request_data_for_training(use_test_data) Returns the data for training of subsequent nodes
train_sweep(use_test_data) Performs the actual training of the node.
use_next_split() Use the next split of the data into training and test data.
__init__(splits=10, stratified=True, random=True, time_dependent=False, stratified_class=None, *args, **kwargs)[source]
is_split_node()[source]

Return whether this is a split node

use_next_split()[source]

Use the next split of the data into training and test data.

Returns True if more splits are available, otherwise False.

This method is useful for benchmarking

train_sweep(use_test_data)[source]

Performs the actual training of the node.

Note

Split nodes cannot be trained

request_data_for_training(use_test_data)[source]

Returns the data for training of subsequent nodes

request_data_for_testing()[source]

Returns the data for testing of subsequent nodes

_create_splits()[source]

Create the split of the data for n-fold cross-validation

__hyperparameters = set([NoOptimizationParameter<input_dim>, NoOptimizationParameter<dtype>, NoOptimizationParameter<output_dim>, NoOptimizationParameter<random>, NoOptimizationParameter<retrain>, NoOptimizationParameter<time_dependent>, NoOptimizationParameter<kwargs_warning>, NoOptimizationParameter<store>, NoOptimizationParameter<stratified>])
input_types = ['PredictionVector', 'FeatureVector', 'TimeSeries']