subgroups.utils package

Submodules

subgroups.utils.dataframe_filters module

This file contains the implementation of different functions used to filter a pandas DataFrame according to certain criteria.

subgroups.utils.dataframe_filters.filter_by_list_of_selectors(pandas_dataframe, list_of_selectors)[source]

Method to filter a pandas DataFrame, retrieving only the rows covered by all selectors included in the parameter ‘list_of_selectors’. IMPORTANT: If an attribute name of a selector of the pattern is not in the pandas.DataFrame passed by parameter, a KeyError exception is raised.

Parameters:
  • pandas_dataframe (pandas.core.frame.DataFrame) – the DataFrame which is filtered.

  • list_of_selectors (list[subgroups.core.selector.Selector]) – the list of selectors used in the filtering process. IMPORTANT: we assume that the parameter ‘list_of_selectors’ only contains selectors.

Return type:

pandas.core.frame.DataFrame

Returns:

the pandas DataFrame obtained after the filtering process.

subgroups.utils.file_format_transformations module

This file contains the implementation of different functions used to transform the resulting files obtained by the algorithms.

subgroups.utils.file_format_transformations.to_input_format_for_subgroup_list_algorithms(original_file_path, transformed_file_path)[source]

Method to transform the format of a file generated by a traditional SD algorithm (that mines a subgroup set) to the the input file format of the algorithms that mine subgroup lists.

Parameters:
  • original_file_path (str) – path of the original file.

  • transformed_file_path (str) – path of the transformed file.

Return type:

tuple[int, int]

Returns:

a 2-tuple of the form: (number of subgroups correctly read, number of subgroups not correctly read).

subgroups.utils.mdl module

This file contains the implementation of different functions used by the MDL principle.

subgroups.utils.mdl.log2_multinomial_with_recurrence(number_of_categories, number_of_samples)[source]

Compute the logarithm to base 2 of the multinomial distribution complexity.

Parameters:
  • number_of_categories (int) – number of categories of the multinomial distribution.

  • number_of_samples (int) – number of instances/points/samples/rows/registers.

Return type:

float

Returns:

the logarithm to base 2 of the multinomial distribution complexity or 0 if the multinomial distribution complexity is 0.

subgroups.utils.mdl.multinomial_with_recurrence(number_of_categories, number_of_samples)[source]

Compute the multinomial distribution complexity.

Parameters:
  • number_of_categories (int) – number of categories of the multinomial distribution.

  • number_of_samples (int) – number of instances/points/samples/rows/registers.

Return type:

float

Returns:

the multinomial distribution complexity.

subgroups.utils.mdl.universal_code_for_integer(input_integer_value)[source]

Compute the universal code LN(i) for the input integer value.

Parameters:

input_integer_value (int) – integer value on which to compute the universal code.

Return type:

float

Returns:

the universal code LN(i) for the input integer value.

subgroups.utils.mdl.universal_code_for_integer_with_maximum(input_integer_value, maximum_integer_value)[source]

Compute the universal code LN(i) for the input integer value, when a maximum integer value exists.

Parameters:
  • input_integer_value (int) – integer value on which to compute the universal code.

  • maximum_integer_value (int) – maximum integer value existing.

Return type:

float

Returns:

the universal code LN(i) for the input integer value.