csv_analysis

Module: tools.csv_analysis

Deal with csv files in general, and in particular after classification

The functions provided here focus on two issues:

  1. Manipulation of csv files (load, save, change)
  2. Repair csv files after unsuccessful classification, e.g. to be able to perform an analysis operation

Examples

  1. Loading csv file, extracting relevant data, saving new csv file:

    Problem:

    A csv file is existing, but the file is huge and you need only certain values, which are all entries with

    • Parameter __Range__=500
    • Parameter __Start__=100
    • Parameter __SamplingFreq__=25

    Solution:

    import csv_analysis
    data=csv_analysis.csv2dict('results.csv')
    conditions=csv_analysis.empty_dict(data)
    conditions['__Range__'].append('500')
    conditions['__Start__'].append('200')
    conditions['__SamplingFreq__'].append('25')
    new_dict=csv_analysis.strip_dict(data, conditions)
    csv_analysis.dict2csv('new_results.csv', new_dict)
    
  2. Build results.csv after classification failure and complement with reconstructed conditions:

    Problem:

    A classification procedure failed or has been aborted. What is needed is a procedure that

    1. builds a results.csv from conditions that were ready
    2. identifies the conditions which were not ready
    3. reconstructs missing conditions according to parameters inferable from path and user defined default values (e.g. AUC=0.5 and F_measure=0)
    4. merges into existing results and saves.

    Solution short:

    from pySPACE.tools import csv_analysis
    from pySPACE.resources.dataset_defs.performance_result import PerformanceResultSummary
    mydefaults=dict()
    mydefaults['AUC']=0.5
    mydefaults['F_measure']=0
    PerformanceResultSummary.repair_csv(datapath, default_dict=mydefaults)
    

    Solution long:

    import csv_analysis
    from pySPACE.resources.dataset_defs.performance_result import PerformanceResultSummary
    num_splits=52
    PerformanceResultSummary.merge_performance_results(datapath)
    csv_dict = csv_analysis.csv2dict(datapath + '/results.csv')
    oplist=csv_analysis.check_op_libSVM(datapath)
    failures = csv_analysis.report_failures(oplist, num_splits)
    mydefaults=dict()
    mydefaults['AUC']=0.5
    mydefaults['F_measure']=0
    final_dict=csv_analysis.reconstruct_failures(csv_dict, failures,
                                        num_splits, default_dict=mydefaults)
    csv_analysis.dict2csv(datapath + '/repaired_results.csv', final_dict)
    
Author:Sirko Straube (sirko.straube@dfki.de), Mario Krell, Anett Seeland, David Feess
Created:2010/11/09

Function Summary

csv2dict(filename[, filter_keys, delimiter]) Load a csv file and return content in a dictionary
dict2csv(filename, data_dict[, delimiter]) Write a dictionary to a csv file in a sorted way
empty_dict(old_dict) Return a dictionary of empty lists with exactly the same keys as old_dict
strip_dict(data_dict, cond_dict[, ...]) Return a stripped dictionary according to the conditions specified with cond_dict and invert_mask
merge_dicts(dict1, dict2) Merge two dictionaries into a new one
merge_multiple_dicts(dictlist) Merge multiple dictionaries into a single one
add_key(orig_dict, key_str, key_list) Add a key to the dictionary with as many elements (rows) as other entries
extend_dict(orig_dict, extension_dict[, ...]) Extend one dictionary with another
average_rows(data_dict, key_list[, n, new_n]) Average across all values of the specified columns
parse_data(data_dict) Parse the data of type string to int and float values where possible
check_for_failures(data, num_splits, conditions) Compute a list of conditions for which the classification failed
check_op_libSVM([input_dir, delete_file]) Perform terminal operation to identify possible classification failures on the basis of number of files.
report_failures(oplist, num_splits) Sort output of terminal operation (e.g.
reconstruct_failures(csv_dict, ...[, ...]) Reconstruct classification failures in csv dictionary according to known parameters and default values.

Functions

csv2dict()

pySPACE.tools.csv_analysis.csv2dict(filename, filter_keys=None, delimiter=', ', **kwargs)[source]

Load a csv file and return content in a dictionary

The dictionary has n list elements, with n being equal to the number of columns in the csv file. Additional keyword arguments are passed to the reader instance, e.g. a different delimiter than ‘,’ (see csv.Reader).

Parameters

filename:Contains the filename as a string.
filter_keys:If a list of filter keys is specified, only the specified keys are left and all other are discarded
delimiter:The delimiter between columns in the csv. Defaults to ‘,’, as csv actually stands for comma separated, but sometimes different symbols are used.
Author:Sirko Straube (sirko.straube@dfki.de)
Created:2010/11/09

dict2csv()

pySPACE.tools.csv_analysis.dict2csv(filename, data_dict, delimiter=', ')[source]

Write a dictionary to a csv file in a sorted way

The function converts the dictionary into a list of dictionaries, with each entry representing one row in the final csv file. The dictionary can be of the form returned by csv2dict.

The sorting is in alphabetic order, large characters first and variables starting with ‘__’ first.

Parameters
filename:Contains the filename as a string.
data_dict:Dictionary containing data as a dictionary of lists (one list for each column identified by the key).
delimiter:The delimiter between columns in the csv. Defaults to ‘,’, as csv actually stands for comma separated, but sometimes different symbols are used.
Author:Sirko Straube, Mario Krell
Created:2010/11/09

empty_dict()

pySPACE.tools.csv_analysis.empty_dict(old_dict)[source]

Return a dictionary of empty lists with exactly the same keys as old_dict

Parameters

old_dict:Dictionary of lists (identified by the key).
Author:Sirko Straube
Created:2010/11/09

strip_dict()

pySPACE.tools.csv_analysis.strip_dict(data_dict, cond_dict, invert_mask=False, limit2keys=None)[source]

Return a stripped dictionary according to the conditions specified with cond_dict and invert_mask

This function is useful, if only some parameter combinations are interesting. Then the values of interest can be stored in cond_dict and after execution of mynewdict=strip_dict(data_dict, cond_dict) all unnecessary information is eliminated in mynewdict.

Parameters

data_dict:Dictionary of lists (identified by the key). E.g. as returned by csv2dict.
cond_dict:Dictionary containing all keys and values that should be used to strip data_dict. E.g. constructed by empty_dict(data_dict) and subsequent modifications.
invert_mask:optional: If set to False, the cond_dict will be interpreted as positive list, i.e. only values are kept that are specified in cond_dict. If set to True, the cond_dict will be interpreted as negative list, i.e. only values are kept that are NOT specified in cond_dict. default=False
limit2keys:optional: Contains a list of key names (strings) that should be included in the returned dictionary. All other keys (i.e. columns) are skipped. default=None
Author:Sirko Straube
Created:2010/11/09

merge_dicts()

pySPACE.tools.csv_analysis.merge_dicts(dict1, dict2)[source]

Merge two dictionaries into a new one

Both have ideally the same keys and lengths. The merge procedure is performed even if the keys are not identical, but a warning is elicited.

Parameters

dict1:the one dictionary
dict2:the other dictionary
Author:Mario Michael Krell
Created:2010/11/09

merge_multiple_dicts()

pySPACE.tools.csv_analysis.merge_multiple_dicts(dictlist)[source]

Merge multiple dictionaries into a single one

This function merges every dictionary into a single one. The merge procedure is performed even if the keys are not identical (or of identical length), but a warning is elicited once.

Parameters

dictlist:a list of dictionaries to merge
Author:Sirko Straube
Created:2011/04/20

add_key()

pySPACE.tools.csv_analysis.add_key(orig_dict, key_str, key_list)[source]

Add a key to the dictionary with as many elements (rows) as other entries

When called, this function adds one key in the dictionary (which is equal to adding one column in the csv table. The name of the key is specified in key_str, and the elements are specified in key_list. Note that the latter has to be a list. If key_list has only one element, it is expanded according to the number of rows in the table. If the key is already existing, the original dictionary is returned without any modification.)

Parameters

orig_dict:the dictionary to modify
key_str:string containing name of the dict key
key_list:either list containing all elements or list with one element which is appended n times
Author:Sirko Straube
Created:2011/04/20

extend_dict()

pySPACE.tools.csv_analysis.extend_dict(orig_dict, extension_dict, retain_unique_items=True)[source]

Extend one dictionary with another

Note

This function returns a modified dictionary, even if the extension dictionary is completely different (i.e. there is no check if the extension makes sense to guarantee maximal functionality).

Parameters

orig_dict:the dictionary to be extended and returned
extension_dict:the dictionary defining the extension
Author:Sirko Straube, Mario Michael Krell
Created:2010/11/09

average_rows()

pySPACE.tools.csv_analysis.average_rows(data_dict, key_list, n=None, new_n=None)[source]

Average across all values of the specified columns

Reduces the number of rows, i.e., the number of values in the lists, by averaging all values of a specific key, e.g., across all splits or subjects.

Note

It is assumed that for two parameters A and B which have a and b different values the number of rows to average is a*b. If you have certain constraints so that the number of rows to average is not a*b, you have to specify them explicitly.

Parameters

data_dict:Dictionary as returned by csv2dict.
key_list:List of keys (equals column names in a csv table) over which the average is computed.
n:Number of rows that are averaged. If None it is determined automatically. default=None.
new_n:Number of rows after averaging. If None it is determined automatically. default=None.

parse_data()

pySPACE.tools.csv_analysis.parse_data(data_dict)[source]

Parse the data of type string to int and float values where possible

Parameters

data_dict:Dictionary as returned by csv2dict.

check_for_failures()

pySPACE.tools.csv_analysis.check_for_failures(data, num_splits, conditions, remove_count=False)[source]

Compute a list of conditions for which the classification failed

Given a possibly incomplete results.csv and a set of parameters as defined in an operation.yaml, this function compares all the expected combinations of parameters with what has actually been evaluated according to results.csv. It returns a list of failures, i.e., a list of dictionaries, each representing one combination of parameters for which results are missing.

Besides the actual parameters, the dictionaries in failures have one additional key ‘count’. The value of ‘count’ is the number of times this particular parameter setting occurred in the results file. The expected number of occurrences is the number of splits, ‘num_splits’. If the failures list is to be further used, it might be necessary to remove the count key again - if remove_count=True, this will be done automatically.

Note

Even though __Dataset__ is not explicitly stated in the operation.yaml, this function needs you to specify the collections as parameter all the time. See the following example.

Note

This implementation is highly inefficient as it just loops through the results list and the list of expected parameter settings instead of making use of any sophisticated search algorithms. Large problem might thus take some time.

Parameters

data:Dictionary as returned by csv2dict. Usually this dictionary should contain the (incomplete) analysis results, hence it will in most cases be the product of something like csv2dict(‘results.csv’).
num_splits:Number of splits. The decision if the condition is interpreted as failure depends on this parameter.
conditions:A dictionary containing the parameter ranges as specified in the operation.yaml. Additionally, __Dataset__ has to be specified. See the following example.
remove_count:optional: controls if the count variable will be removed from the entries in the failures list. default=False

** Examplary Workflow **

import csv_analysis
data=csv_analysis.csv2dict('results.csv')
conditions={}
conditions['__CLASSIFIER__']=['1RMM', '2RMM']
conditions['__C__']=[0.01, 0.1, 1.0, 10.0]
conditions['__Dataset__']=['Set1','Set2','Set3']
nsplits = 10
failures=csv_analysis.check_for_failures(data,nsplits,conditions,True)
Author:David Feess
Created:2011/04/05

check_op_libSVM()

pySPACE.tools.csv_analysis.check_op_libSVM(input_dir='.', delete_file=True)[source]
Perform terminal operation to identify possible classification failures
on the basis of number of files.

This works only for libSVM classification with stored results, as it relies on files stored in the persistency directories.

This function navigates to input_dir (which is the result directory of the classification) and checks the number of files starting with ‘features’ in ‘persistency_run0/LibSVMClassifierNode/’ in each subdirectory. In case the classification was successfully performed, the number of files here should equal the number of splits used. If not, this is a hint that something went wrong! The list returned by this function contains alternating (i) name of ‘root directory’ for the respective condition (ii) number of files ...

Note

This function only works if the feature*.pickle files are explicitly saved in your NodeChain!

Parameters

input_dir:optional: string with the path where csv files are stored. default=’.’
delete_file:optional: controls if the file ‘temp_check_op.txt’ will be removed default=True
Author:Sirko Straube, Anett Seeland
Created:2010/11/09

report_failures()

pySPACE.tools.csv_analysis.report_failures(oplist, num_splits)[source]

Sort output of terminal operation (e.g. performed by check_op_libSVM).

This function returns a list where each element contains the parameters of a condition where the classification probably failed. This judgment is made according to the number of files which are expected according to the used number of splits. See also: check_op_libSVM

Parameters

oplist:An iterable that has to contain (i) name of ‘root directory’ for the respective condition (ii) number of files ...

This parameter can either be the list returned by check_op_libSVM or a file type object (pointing to a manually constructed file).

num_splits:Number of splits. The decision if the condition is interpreted as failure depends on this parameter.
Author:Mario Krell, Sirko Straube
Created:2010/11/09

reconstruct_failures()

pySPACE.tools.csv_analysis.reconstruct_failures(csv_dict, missing_conds, num_splits, default_dict=None)[source]

Reconstruct classification failures in csv dictionary according to known parameters and default values.

This function takes the csv-dictionary (probably constructed using merge_performance_results from PerformanceResultSummary) and reconstructs the classification failures defined in missing_conds (probably constructed using report_failures) according to known parameters (given in missing_conds) and some default values that may be specified in default_dict (probably constructed with the help of empty_dict and a subsequent modification). All other keys are specified with the ‘unknown’ value. Finally the reconstructed dictionary is merged with the original csv-dictionary and returned.

Parameters

csv_dict:

The data dictionary. Has the form returned by csv2dict.

missing_conds:

A list of dictionaries specifying the missing conditions. Has the form returned by report_failures.

num_splits:

Number of splits used for classification.

default_dict:

optional: A dictionary specifying default values for missing conditions. This dictionary can e.g. be constructed using empty_dict(csv_dict) and subsequent modification, e.g. default_dict[‘Metric’].append(0).

(optional, default: None)

Author:Mario Krell, Sirko Straube
Created:2010/11/09