HOWTO: Data Handling¶
When starting with pySPACE the first question, which often arises is: “What do I have to do to get my data in this framework and how can I get the right output?” The documentation on this issue is distributed and so we will give an overview with the needed links.
Internal data structures in pySPACE are defined in the
resources
package.
Here, we distinguish between samples and datasets.
Samples are grouped to datasets and datasets can be grouped to summaries.
Note
Summaries are no defined data structure in pySPACE, because they are nothing more than the address to a folder name with one subfolder for each contained dataset (as depicted in The Data Directory (storage)).
Data Handling for Benchmarking Applications¶
The Summary¶
The easiest way of getting your data into pySPACE
is to transform it to a summary.
When using the setup.py
in the installation process,
an example summary will be saved on your hard disk.
For doing so, you first create a folder with the summary name in your storage folder.
Note
A summary should only include datasets of the same kind (e.g., streaming data or feature vector data) and no mixing. Otherwise the processing will not be compatible.
The Dataset¶
As a next step you need at least one subfolder for the dataset you want to process with pySPACE. This folder then needs two components:
- the data file(s),
- and a file called
metadata.yaml
, which contains all meta information including the type and storage format of the data.
The loading procedure of datasets is defined in the
dataset_defs
.
For reading new data, there are three main types of dataset definitions for:
The type name is in the metadata file is written with underscore and directly
corresponds to the module name.
The respective class name is the same but written in camel case notation and with
Dataset
added to the name.
The first two types deliver samples of
times series objects
and the last one
feature vector objects
.
As a next step, you need to check, if your data format is supported by the
corresponding dataset type.
Therefore, you should check the respective documentation.
All types support the csv format.
For feature vector data, the arff format is supported, too.
For streaming data BrainProducts eeg format, the EDF2 file format and
the EEGLAB set format are supported.
When processing streaming data with a
node chain operation
you will also need an additional windower file,
specifying how the data stream is segmented into time series objects.
This is documented in the respective source node.
If your storage format is supported, you just have to add the used
storage_format
parameter into your meta data file, as documented in
the dataset definition.
Case of not supported storage format¶
If your storage format is not supported, there are two possibilities. You can use an external tool, which converts your data to a compatible format, or you can integrate the code for loading your format into the dataset definition.
Defining New Types of Data¶
If you cannot use the existing data types, extra effort is needed and it is probably a good idea to search the discussion with the software developers.
- A new
dataset definitions
need to be implemented- and new
data_types
respectively.- A
source
node will be required for getting the data into anode chain operation
- and a
sink
node for getting the data format out of anode chain operation
.nodes
oroperations
.need to be implemented or modified, to process this data.
A good example would be to integrate picture or video data into the framework. This data could be handled as feature vector or time series data, too, but a special format might be a better choice.
Data Handling for Direct Processing in Applications¶
For the application case, the aforementioned hard disc usage is infeasible and
data needs to be directly forwarded to the node chain which shall process
the data probably using launch_live
.
Therefore, an iterator is needed, which produces objects of pySPACE
data_types
. To achieve this the
ExternalGeneratorSourceNode
is used.
For demonstration purposes this functionality is implemented in the
live
tools. It contains a C++ based streaming software
which can access EEG acquisition devices manufactured by BrainProducts
(requires proprietary driver for Windows, see
eegmanager tutorial).
The data is sent via TCP/IP and can be unpacked and formatted accordingly by the
client side
. The received data is
then handed over to the windower inside the
LiveEegStreamManager
.
The created windows are then fed into the current node chain using the
ExternalGeneratorSourceNode
.
To have your own data processed in pySPACE live you have to replace the
LiveEegStreamManager
to fit to your custom
protocol or medium, your raw-data gets transmitted over. Currently this involves
replacing the use of this class by hand - but in future releases a
modular architecture is intended when handling different kinds of live-data.