Module: resources.dataset_defs.feature_vector

Load and store data sets containing Feature Vectors

Inheritance diagram for pySPACE.resources.dataset_defs.feature_vector:

Inheritance diagram of pySPACE.resources.dataset_defs.feature_vector


class pySPACE.resources.dataset_defs.feature_vector.FeatureVectorDataset(dataset_md=None, classes_names=[], feature_names=None, num_features=None, **kwargs)[source]

Bases: pySPACE.resources.dataset_defs.base.BaseDataset

Feature vector dataset class

This class is most importantly for loading and storing FeatureVector to the file system. You can load it using a feature_vector_source node. It can be saved, using a feature_vector_sink node in a NodeChainOperation

The constructor expects the argument dataset_md that contains a dictionary with all the meta data. It is normally loaded from the metadata.yaml file.

It is able to load csv-Files, arff-files and pickle files, where one file is always responsible for one training or test set. The name conventions are the same as described in TimeSeriesDataset. It is important that a metadata.yaml file exists, giving all the relevant information of the data set, especially the storage format, which can be pickle, arff, csv or csvUnnamed. The last format is only for loading data without heading, and with the labels being not in the last column.


See TimeSeriesDataset for name conventions (in the tutorial).



This format was introduced to connect pySPACE with weka. So when using weka, you need to choose this file format as the parameter storage format in the preprocessing operation’s spec file.

CSV (comma separated values)

These are tables in a simple text format. Therefore each column is separated with a comma and each row with a new line. Normally the first line gives the feature names and one row is giving the class labels. Therefore several parameters need to be specified in the metadata.yaml file.

If no collection meta data is available for the input data, the ‘metadata.yaml’ file can be generated with md_creator. Please consider also some important parameters, described in the get_data function.

Preferably the labels are in the last column. This corresponds to label_column being -1 in the metadata.yaml file.

Special CSV Parameters


Column containing the labels

Normally this column looses its heading. when saving the csv file, the default, -1, is used.

(recommended, default: -1)


List of numbers containing the numbers of irrelevant columns, e.g., [1,2,8,42]

After the data is loaded, this parameter becomes obsolete.

(optional, default: [])


Replace row in description of ‘ignored_columns’

(optional, default: [])


Symbol which separates the csv entries

Typically , is used or the tabulator ` . When storing, `, is used.

(recommended, default: ‘,’)


dataset_md:dictionary containing meta data for the collection to be loaded

The following 3 Parameters contain standard information for a feature vector data set. Normally they are not yet needed (used), because a dataset_md is given and real data is loaded, and so this information could be loaded from the data. Nevertheless these are important entries, which should be found in each dataset_md, giving information about the data set.


list of the used class labels


list of the feature names

The feature names are either determined during the loading of the data, if available in the respective storage_format, or they are later on set with a default string (e.g., feature_0_0.000sec).


number of the given features

Class Components Summary

add_sample(sample, label, train[, split, run]) Add a sample to this collection
dump(result_path, name) Dumps this collection into a file.
get_data(run_nr, split_nr, train_test) Loads the data from the feature file of the current input collection depending on the storage_format.
store(result_dir[, s_format]) Stores this collection in the directory result_dir.
__init__(dataset_md=None, classes_names=[], feature_names=None, num_features=None, **kwargs)[source]

Read out the data from the given collection


main loading concept copied from time series collection check needed if code can be sent to upper class

add_sample(sample, label, train, split=0, run=0)[source]

Add a sample to this collection

Adds the sample sample along with its class label label to this collection.



The respective data sample


The label of the data sample


If train, this sample has already been used for training


The number of the split this sample belongs to. (optional, default: 0)


The run number this sample belongs to

(optional, default: 0)

dump(result_path, name)[source]

Dumps this collection into a file.

Dumps (i.e. pickle) this collection object into a bz2 compressed file. In contrast to store this method stores the whole collection in a file. No meta data are stored in a YAML file etc.

The method expects the following parameters:
  • result_path The path to the directory in which the pickle file will be written.
  • name The name of the pickle file
get_data(run_nr, split_nr, train_test)[source]

Loads the data from the feature file of the current input collection depending on the storage_format. Separates the actual vectors from the names and returns both as lists.

The method expects the following


feature_file:the file of feature vectors to be loaded
storage_format:One of the first components in [‘arff’, ‘real’], [‘csv’, ‘real’], [‘csvUnnamed’, ‘real’] or . Format in which the feature_file was saved. Information need to be present in meta data.

For arff and pickle files documentation see to the class description (docstring). Pickle format files do not need any special loading because they already have the perfect format.


If no collection meta data is available for the input data, the ‘metadata.yaml’ file can be generated with pySPACE.run.node_chain_scripts.md_creator.

If you created the csv file with pySPACE, you automatically have the standard csv format with the feature names in the first row and the labels in the last column.

If you have a csv tabular without headings, you have the csvUnnamed format, and in your ‘label_column’ column, specified in your spec file, the labels can be found.


main loading concept copied from time series collection check needed if code can be sent to upper class

store(result_dir, s_format=['pickle', 'real'])[source]

Stores this collection in the directory result_dir.

In contrast to dump this method stores the collection not in a single file but as a whole directory structure with meta information etc. The data sets are stored separately for each run, split, train/test combination.

The method expects the following parameters:
  • result_dir The directory in which the collection will be stored

  • name The prefix of the file names in which the individual data sets are stored. The actual file names are determined by appending suffixes that encode run, split, train/test information. Defaults to “features”.

  • format A list with information about the format in which the

    actual data sets should be stored. The first entry specifies the file format. If it is “arff” the second entry specifies the attribute format.

    Examples: [“arff”, “real”], [“arff”, “{0,1}”]

    To store the data in comma separated values, use [“csv”, “real”].

    (optional, default: [“pickle”, “real”])